Set up a federated search dataset with a Splunk-managed data catalog that Splunk software creates and maintains for you.
Set up a federated search dataset with a Splunk-managed data catalog that Splunk software creates and maintains for you.
- On your Splunk Cloud Platform deployment, in the Data Management app, at the Configure dataset step of the Create dataset workflow, select I don't have a catalog.
- Identify whether your data is stored in one of the following non-table formats: Parquet, CSV, or JSON. If your data is in JSON or CSV format, indicate whether the data is compressed with Gzip, or is Uncompressed.
- (Optional) If this dataset is updated on an ongoing basis, and you want your Splunk-native data catalog to be updated automatically so it is consistent with the dataset it represents, answer Yes to Do you want to keep catalog in sync with dataset?.
Set up an SQS queue and event notification for the Amazon S3 bucket that contains the dataset. Paste the ARN that you obtain when you set up the SQS queue into the SQS queue ARN field. For detailed instructions, see Set up automated updates for Splunk-native data catalogs in AWS.
Note: You can skip this step if your dataset is composed of historical data that is not subject to future updates, or if you are not interested in keeping your data catalog in sync with changes to your dataset.
- Indicate how you want to define the schema for your data catalog.
- Select Define schema manually if you want to manually determine the columns in your dataset. You can use a Field list view or JSON view. If you select JSON view, your input must match the JSON data schema (
dataSchema). For more information, see JSON standards for the data and partition schemas.
- Select Discover schema via crawler if you want a crawler to scan a sample set of files from your dataset to infer the overall dataset schema. Use Number of files to scan to tell the crawler how many files it should scan. The crawler begins its scan after you complete the Update policies step.
Note: The crawler can be applied only to data that is in Parquet, CSV, or JSON format. All files sampled by the crawler must share the same schema. Inconsistent schemas across sampled files might result in a data catalog with incorrectly inferred fields.
- (Optional) Select Define the time field if your dataset contains time-series data and you intend to use time-based filtering or SPL2 time functions when you run federated searches over it.
- If you select Define the time field, fill out the Time settings: Time field, Time format, and Unix time field.
- If your dataset is partitioned into data subsets, indicate what those partitions are. Answer Are your partitions Hive-compatible?
- Yes: Define your partitions manually or let Splunk software identify the partitions for you using a crawler. If you choose to define your partitions manually, you can use a Field list view or a JSON view. If you select JSON view your input must match the partition schema (
dataPartition.PartitionSchema). For more information, see JSON standards for the data and partition schemas.
Note: When you choose to let Splunk software identify the partitions for you, the crawler cannot determine which of your partitions are time partitions. If your partitions include time partitions, define them manually in the next step.
- No: Define partitions manually using a Field list view or a JSON view.
- I don't have partitions: Select Next to go to the dataset Review page.
- (Optional) If you have time-based partitions, identify them under Time partition settings. See Identify time partitions.
- Select Next to move on to the Update policies step.