Define an Amazon S3 dataset

Define the preliminary settings for an Amazon S3 dataset, including its name, connection, location, and dataset type.

After you define an Amazon S3 connection, you define Amazon S3 datasets for use in federated searches.

Each dataset you define is backed up by a data catalog. The data catalog enables efficient federated searches of the Amazon S3 dataset that it represents. Depending on the type of dataset you create, your dataset can be backed by a catalog you own and operate, such as an AWS Glue or Apache Iceberg REST catalog, or it can be backed by a Splunk catalog that is maintained by Splunk software.

This task guides you through preliminary definition steps for an Amazon S3 dataset, which include deciding whether the dataset will facilitate data routing and federated search, or just federated search.

  • Your Splunk Cloud Platform deployment must be on version 10.4.2604 or higher.
  • Your user account on the Splunk Cloud Platform deployment must have a role with the edit_datasets and edit_federated_providers capabilities. See Define roles on the Splunk platform with capabilities in the Splunk Cloud Platform Manage Users and Security manual.
  • You must have an Amazon Web Services (AWS) account and an AWS IAM role with permissions that let you attach and modify custom trust policies and permissions policies for IAM roles. Contact your AWS administrator for assistance with AWS permissions. See IAM role creation in the AWS Identity and Access Management User Guide.
  1. On your Splunk Cloud Platform deployment, in Splunk Web, open the Data Management app.
  2. Select Datasets > Create dataset to enter the Create dataset workflow.
  3. On Get started, select Amazon S3. Then select Next.
  4. On Configure connection, determine whether you want to use an existing Amazon S3 connection or create a new Amazon S3 connection.
    • If a suitable Amazon S3 connection exists that you want to associate your dataset with, select Associated connection, and choose an existing Amazon S3 connection from the drop-down list. If the connection you select is ready to be used and its details are correct, select Next.
    • If no Amazon S3 connection exists that you want to associate you dataset with, select Create connection to define a new connection for your dataset. See Create an Amazon S3 connection. When you have successfully created a new Amazon S3 connection, select Next.
  5. On Define dataset, provide a Dataset name.
    The dataset name can contain only alphanumeric characters, underscores, and hyphens.
  6. (Optional) Provide a Dataset description.
  7. Provide an Amazon S3 location. This is a file path within the Amazon S3 general purpose bucket that contains the dataset you want to search.
    Note: Amazon S3 paths that terminate in a file object are not valid for Splunk federated providers. Use only locations that end in an Amazon S3 folder. For example, this is a valid location path: s3://bucket1/path1/my_csv_data/

    Amazon S3 locations can contain only alphanumeric characters and the following special characters: /!=_.*'():

  8. Decide whether the dataset you are creating uses Data Routing + Federated Search or just Federated Search.
    Data Routing and Federated Search dataset Federated Search dataset
    Dataset usage Collects data sent to Amazon S3 from your Splunk Cloud Platform deployment using Edge Processor or Ingest Processor.
    You can optionally run federated searches over this dataset.
    A dataset you maintain in Amazon S3. You run federated searches over this dataset.
    Connection requirements Must support one or both of the following abilities: "Send data from Edge Processor" and "Send data from Ingest Processor"
    Can optionally support the "Run federated search" ability.
    Must support the "Run federated search" ability.
    Data catalog management Splunk software manages the data catalog for this dataset. You can apply your own Apache Iceberg REST or AWS Glue data catalog to this dataset, or arrange for it to be referenced by a Splunk-native catalog.
    With the Splunk-native catalog, you can manually define the data schema and partition fields, or you can let Splunk software run a crawler to automatically infer them.
  9. Select Next to move on to the Configure dataset step.

Proceed to the next step depending on the dataset type you selected and the type of catalog you're using, if you are creating a dataset that supports only federated search: