Create an Amazon S3 dataset for federated search that is backed by a Splunk-native data catalog

Set up a federated search dataset with a Splunk-managed data catalog that Splunk software creates and maintains for you.

Your Splunk Cloud Platform deployment must be on version 10.4.2604 or higher.
Your user account on the Splunk Cloud Platform deployment must have a role with the edit_datasets and edit_federated_providers capabilities. See Define roles on the Splunk platform with capabilities in the Splunk Cloud Platform Manage Users and Security manual.
You must have completed the Select data store, Configure connection, and Define dataset steps of the Create dataset workflow. See Define an Amazon S3 dataset.
If you plan to define dataset or partition schemas by hand using the JSON view, review the JSON standards for the data and partition schemas.

On your Splunk Cloud Platform deployment, in the Data Management app, at the Configure dataset step of the Create dataset workflow, select I don't have a catalog.
Identify whether your data is stored in one of the following non-table formats: Parquet, CSV, or JSON. If your data is in JSON or CSV format, indicate whether the data is compressed with Gzip, or is Uncompressed.
(Optional) If this dataset is updated on an ongoing basis, and you want your Splunk-native data catalog to be updated automatically so it is consistent with the dataset it represents, answer Yes to Do you want to keep catalog in sync with dataset?.

Set up an SQS queue and event notification for the Amazon S3 bucket that contains the dataset. Paste the ARN that you obtain when you set up the SQS queue into the SQS queue ARN field. For detailed instructions, see Set up automated updates for Splunk-native data catalogs in AWS.

Note: You can skip this step if your dataset is composed of historical data that is not subject to future updates, or if you are not interested in keeping your data catalog in sync with changes to your dataset.
Indicate how you want to define the schema for your data catalog.
- Select Define schema manually if you want to manually determine the columns in your dataset. You can use a Field list view or JSON view. If you select JSON view, your input must match the JSON data schema (dataSchema). For more information, see JSON standards for the data and partition schemas.
- Select Discover schema via crawler if you want a crawler to scan a sample set of files from your dataset to infer the overall dataset schema. Use Number of files to scan to tell the crawler how many files it should scan. The crawler begins its scan after you complete the Update policies step.
  Note: The crawler can be applied only to data that is in Parquet, CSV, or JSON format. All files sampled by the crawler must share the same schema. Inconsistent schemas across sampled files might result in a data catalog with incorrectly inferred fields.
(Optional) Select Define the time field if your dataset contains time-series data and you intend to use time-based filtering or SPL2 time functions when you run federated searches over it.
If you select Define the time field, fill out the Time settings: Time field, Time format, and Unix time field.
For more information, see Identify the dataset time field.
If your dataset is partitioned into data subsets, indicate what those partitions are. Answer Are your partitions Hive-compatible?
- Yes: Define your partitions manually or let Splunk software identify the partitions for you using a crawler. If you choose to define your partitions manually, you can use a Field list view or a JSON view. If you select JSON view your input must match the partition schema (dataPartition.PartitionSchema). For more information, see JSON standards for the data and partition schemas.
  Note: When you choose to let Splunk software identify the partitions for you, the crawler cannot determine which of your partitions are time partitions. If your partitions include time partitions, define them manually in the next step.
- No: Define partitions manually using a Field list view or a JSON view.
- I don't have partitions: Select Next to go to the dataset Review page.
(Optional) If you have time-based partitions, identify them under Time partition settings. See Identify time partitions.
Select Next to move on to the Update policies step.

Go to Apply the dataset resource access policy to an AWS IAM role.

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Create an Amazon S3 dataset for federated search that is backed by a Splunk-native data catalog