Create an Amazon S3 dataset for federated search that is backed by a Splunk-native data catalog

Set up a federated dataset with a Splunk-managed data catalog that Splunk software creates and maintains for you.

Set up an Amazon S3 dataset with a Splunk-native data catalog that Splunk software creates and maintains for you.

Your Splunk Cloud Platform deployment must be on version 10.4.2604 or higher.
Your user account on the Splunk Cloud Platform deployment must have a role with the edit_connections and edit_datasets capabilities. See Define roles on the Splunk platform with capabilities in the Splunk Cloud Platform Manage Users and Security manual.
You must have completed the Select data store, Configure connection, and Define dataset steps of the Create dataset workflow. See Define an Amazon S3 dataset.
If you plan to define dataset or partition schemas by hand using the JSON view, review the JSON standards for the data and partition schemas.

On your Splunk Cloud Platform deployment, in the Data Management app, at the Configure dataset step of the Create dataset workflow, select I don't have a catalog.
Identify whether your data is stored in one of the following non-table formats: Parquet, CSV, or JSON. If your data is in JSON or CSV format, indicate whether the data is compressed with Gzip, or is Uncompressed.
(Optional) If this dataset is updated on an ongoing basis, and you want your Splunk-native data catalog to be updated automatically so it is consistent with the dataset it represents, answer Yes to Do you want to keep catalog in sync with dataset?.

Set up an SQS queue and event notification for the Amazon S3 bucket that contains the dataset. Paste the ARN that you obtain when you set up the SQS queue into the SQS queue ARN field. For detailed instructions, see Set up automated updates for Splunk-native data catalogs in AWS.

Note: You can skip this step if your dataset is composed of historical data that is not subject to future updates, or if you are not interested in keeping your data catalog in sync with changes to your dataset.
Indicate how you want to define the schema for your data catalog.
- Select Define schema manually if you want to manually determine the columns in your dataset. You can use a Field list view or JSON view. If you select JSON view, your input must match the JSON data schema (dataSchema). For more information, see JSON standards for the data and partition schemas.
- Select Discover schema via crawler if you want a crawler to scan a sample set of files from your dataset to infer the overall dataset schema. Use Number of files to scan to tell the crawler how many files it should scan.
  Note: The crawler can be applied only to data that is in Parquet, CSV, or JSON format. All files sampled by the crawler must share the same schema. Inconsistent schemas across sampled files might result in a data catalog with incorrectly inferred fields in its schema.
  
  The crawler begins its scan after you go to the Review step and select Create. You can review the crawler-inferred schema on the Edit page for the dataset, after the crawler process has completed.
(Optional) Select Define the time field if your dataset contains time-series data and you intend to use time-based filtering or SPL2 time functions when you run federated searches over it.

If you select Define the time field, fill out the Time settings: Time field, Time format, and Unix time field.

For more information, see Identify the time field in an Amazon S3 dataset.
Indicate whether your dataset is partitioned, and if so, whether its partitions follow Hive formatting. Answer Are your partitions Hive-compatible?
- Yes: Decide whether you want to Define partitions manually or let Splunk software Discover partitions via crawler.
  
  If you select Define partitions manually, you can use a Field list view or a JSON view. If you select JSON view, your input must match the partition schema (dataPartition.PartitionSchema). For more information, see JSON standards for the data and partition schemas.
  
  If you select Discover partitions via crawler, the crawler begins its scan after you select Create on the Review step. The crawler process might take a few minutes to complete.
  
  Note: If you have time partitions, you must ensure they are properly identified and defined. This is true whether you have selected Define partitions manually or Discover partitions via crawler. For more information and instructions, see Identify time partitions.
- No: Define partitions manually using a Field list view or a JSON view. If you have time partitions, ensure they are defined. See Identify time partitions.
- I don't have partitions: Select Next.
Select Next to move on to the Update policies step.

Go to Apply the dataset resource access policy to an AWS IAM role.

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Create an Amazon S3 dataset for federated search that is backed by a Splunk-native data catalog