Create an Amazon S3 dataset for federated search that is backed by a Splunk-native data catalog

Set up a federated dataset with a Splunk-managed data catalog that Splunk software creates and maintains for you.

Set up an Amazon S3 dataset with a Splunk-native data catalog that Splunk software creates and maintains for you.
  1. On your Splunk Cloud Platform deployment, in the Data Management app, at the Configure dataset step of the Create dataset workflow, select I don't have a catalog.
  2. Identify whether your data is stored in one of the following non-table formats: Parquet, CSV, or JSON. If your data is in JSON or CSV format, indicate whether the data is compressed with Gzip, or is Uncompressed.
  3. (Optional) If this dataset is updated on an ongoing basis, and you want your Splunk-native data catalog to be updated automatically so it is consistent with the dataset it represents, answer Yes to Do you want to keep catalog in sync with dataset?.

    Set up an SQS queue and event notification for the Amazon S3 bucket that contains the dataset. Paste the ARN that you obtain when you set up the SQS queue into the SQS queue ARN field. For detailed instructions, see Set up automated updates for Splunk-native data catalogs in AWS.

    Note: You can skip this step if your dataset is composed of historical data that is not subject to future updates, or if you are not interested in keeping your data catalog in sync with changes to your dataset.
  4. Indicate how you want to define the schema for your data catalog.
    • Select Define schema manually if you want to manually determine the columns in your dataset. You can use a Field list view or JSON view. If you select JSON view, your input must match the JSON data schema (dataSchema). For more information, see JSON standards for the data and partition schemas.
    • Select Discover schema via crawler if you want a crawler to scan a sample set of files from your dataset to infer the overall dataset schema. Use Number of files to scan to tell the crawler how many files it should scan.
      Note: The crawler can be applied only to data that is in Parquet, CSV, or JSON format. All files sampled by the crawler must share the same schema. Inconsistent schemas across sampled files might result in a data catalog with incorrectly inferred fields in its schema.

      The crawler begins its scan after you go to the Review step and select Create. You can review the crawler-inferred schema on the Edit page for the dataset, after the crawler process has completed.

  5. (Optional) Select Define the time field if your dataset contains time-series data and you intend to use time-based filtering or SPL2 time functions when you run federated searches over it.

    If you select Define the time field, fill out the Time settings: Time field, Time format, and Unix time field.

    For more information, see Identify the time field in an Amazon S3 dataset.

  6. Indicate whether your dataset is partitioned, and if so, whether its partitions follow Hive formatting. Answer Are your partitions Hive-compatible?
    • Yes: Decide whether you want to Define partitions manually or let Splunk software Discover partitions via crawler.

      If you select Define partitions manually, you can use a Field list view or a JSON view. If you select JSON view, your input must match the partition schema (dataPartition.PartitionSchema). For more information, see JSON standards for the data and partition schemas.

      If you select Discover partitions via crawler, the crawler begins its scan after you select Create on the Review step. The crawler process might take a few minutes to complete.

      Note: If you have time partitions, you must ensure they are properly identified and defined. This is true whether you have selected Define partitions manually or Discover partitions via crawler. For more information and instructions, see Identify time partitions.
    • No: Define partitions manually using a Field list view or a JSON view. If you have time partitions, ensure they are defined. See Identify time partitions.
    • I don't have partitions: Select Next.
  7. Select Next to move on to the Update policies step.
Go to Apply the dataset resource access policy to an AWS IAM role.