Create an Amazon S3 dataset for Ingest Processor pipelines

Create an Amazon S3 dataset in the Data Management app for your Ingest Processor pipelines to send data to a specific bucket.

To send data from Ingest Processor to an Amazon S3 bucket, create an Amazon S3 dataset in the Data Management app on Splunk Cloud Platform. You can then use the dataset as a pipeline destination.

You can optionally configure the dataset to also support federated searches, so that you can use the same dataset to write and read data from Amazon S3.

The dataset uses an Amazon S3 connection for authentication. You can create multiple datasets that use the same connection.

Your Splunk Cloud Platform deployment must be on version 10.4.2604 or higher.
Note: If your Splunk Cloud Platform deployment does not meet this requirement, see Create a legacy Amazon S3 destination for Ingest Processor.
Your user account on the Splunk Cloud Platform deployment must have the edit_datasets and admin_all_objects capabilities. For more information, see the following pages:
- Manage users for the Ingest Processor solution
- Define roles on the Splunk platform with capabilities in the Splunk Cloud Platform Manage Users and Security manual
You have an Amazon Web Services (AWS) account and an AWS IAM role with permissions that let you attach and modify custom trust policies and permissions policies for IAM roles. Contact your AWS administrator for assistance with AWS permissions.
The Amazon S3 bucket that you want to send data to does not have Object Lock turned on.

Note: Object Lock cannot be turned off after it is turned on, so you might need to create a new bucket. For more information, see https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock-configure.html in the Amazon Simple Storage Service (S3) User Guide.
You must have an Amazon S3 connection that authenticates to the Amazon S3 bucket that you want the dataset to represent. For more information, see Create an Amazon S3 connection for Ingest Processor pipelines.

In Splunk Cloud Platform, select Data Management from the Apps panel.
Navigate to the Datasets page, and then select Create dataset.
On the Select data store page, select Amazon S3, then select Next.
On the Configure connection page, do one of the following:
- If you have already created the necessary Amazon S3 connection, select it from the Associated connection drop-down list and then select Next..
- If you have not created the connection yet, select the Create connection. You are prompted to navigate away from the current screen to create the connection. See Create an Amazon S3 connection for Ingest Processor pipelines for more information.

On the Define dataset page, configure the following options, and then select Next:


Option name	Configuration instructions
Dataset name	Enter a unique name for your dataset.
Dataset description	(Optional) Enter a description for your dataset.
Amazon S3 location	Enter the path location for the Amazon S3 bucket you want to send data to.
Usage	Select Data routing and federated search. Note: If you set Usage to Federated search, then the dataset cannot be used as a pipeline destination and can only be used in federated searches.

On the Configure dataset page, configure the following Data routing options:


Option name	Configuration instructions
Output schema	Select Pipeline output. Note: Avoid selecting Splunk HTTP Event Collector (HEC), especially if you intend to run federated searches on this dataset. Schema inference does not always work as expected on events that use the HEC schema.
Output format	Select the file format that you want to use to store your data in the Amazon S3 bucket. If you select Parquet, be aware that the following limitations apply: For best results, you must process and route your data using a pipeline that's created from a template instead of using a custom-configured pipeline. Pipeline templates can ensure that the schema of the resulting Parquet output is compatible with federated searches. If you use a custom-configured pipeline that changes the schema of the events, the dataset will format the event according to the HEC event schema in order to produce Parquet output that is at least partially compatible with federated searches. The resulting events contain the top-level fields described in Event metadata in the Splunk Cloud Platform Get Data In manual.
Compression type	Select the compression format for your data.
File name prefix	(Optional) Enter a prefix for the name of the file that contains your output data in S3. By default, file names are 32-digit UUIDs (universally unique identifiers) that are autogenerated by the system. You can make the file name more human-readable by adding a prefix. For example, if the autogenerated UUID is `2f0ff66a-e87a-4af5-befb-18dcafa6012f`, then entering `financial-report` in the File name prefix field changes the resulting file name to `financial-report-2f0ff66a-e87a-4af5-befb-18dcafa6012f`.

(Optional) To adjust the maximum number of events that this dataset can send in each batch of output data, expand Advanced settings and enter your desired maximum number of events in the Batch size field.

Note: In most cases, the default Batch size value is sufficient. The actual size of each batch can vary depending on the rate at which the Ingest Processor is sending out data.

If the connection that you're using with this dataset is configured to support the Federated Search ability, then you can configure the following Federated search options:


Option name	Configuration instructions
Activate	Toggle this switch to either allow or disallow federated searches for this dataset.
Do you want to keep catalog in sync with dataset?	(Optional) If federated searches are allowed for this dataset, you can arrange for the Splunk-managed data catalog to be updated automatically whenever data is added to or removed from the dataset. To configure automated catalog updates, you first must create an SQS queue for the Amazon S3 bucket and set up event notifications for that bucket. Then, set this option to Yes and enter the Amazon Resource Name (ARN) of the SQS queue in the SQS queue ARN field. For detailed instructions, see Set up automated updates for Splunk-native data catalogs in AWS in the Splunk Cloud Platform Federated Search manual. If you don't want to configure automated catalog updates, then set this option to No.

Select Next.
On the Update policies page, complete the following steps:
1. Depending on whether your Amazon S3 bucket is encrypted using the AWS Key Management Service (KMS), do one of the following:
  - If your bucket is KMS-encrypted, select Yes. Then, in the AWS KMS key ARNs field, enter the ARN of the KMS key used to encrypt the bucket.
  - If your bucket is not KMS-encrypted, select No.
2. Select Generate policies.
3. Select Copy to copy the generated resource access policies.

On another browser tab, navigate to the AWS Management Console. Depending on which authentication method the connection uses, do one of the following:

Option	Description
The connection uses Access key authentication.	Create an IAM role, and add the resource access policies to it. Then, configure the IAM user associated with the access keys to use this role. For more information, see the following AWS documentation: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html in the AWS Identity and Access Management User Guide https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage.html in the AWS Identity and Access Management User Guide https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_permissions-to-switch.html in the AWS Identity and Access Management User Guide.
The connection uses IAM role authentication.	Add the resource access policies to the IAM role that you're using to authenticate the connection. For more information, see https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage.html in the AWS Identity and Access Management User Guide.

Option

Description

The connection uses Access key authentication.

Create an IAM role, and add the resource access policies to it. Then, configure the IAM user associated with the access keys to use this role.

For more information, see the following AWS documentation:

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html in the AWS Identity and Access Management User Guide
https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage.html in the AWS Identity and Access Management User Guide
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_permissions-to-switch.html in the AWS Identity and Access Management User Guide.

The connection uses IAM role authentication.

Add the resource access policies to the IAM role that you're using to authenticate the connection.

For more information, see https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage.html in the AWS Identity and Access Management User Guide.

Return to the browser tab where you are creating the dataset, and select Next on the Update policies page. If you are prompted to confirm whether you have updated your AWS policies, select Confirm.
On the Review page, ensure that all the entered information is correct, and then select Create Dataset to create your dataset.

You now have an Amazon S3 dataset that can access the data in your Amazon S3 bucket.

To send data from Ingest Processor to your Amazon S3 bucket, create a pipeline that uses the Amazon S3 dataset as a destination. Then, apply the pipeline to Ingest Processor. For more information, see the following pages:

For information about running federated searches on Amazon S3 datasets, see Run federated searches over Amazon S3 datasets in the Splunk Cloud Platform Federated Search manual.

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Create an Amazon S3 dataset for Ingest Processor pipelines