Create an Amazon S3 dataset for Ingest Processor pipelines

Create an Amazon S3 dataset in the Data Management app for your Ingest Processor pipelines to send data to a specific bucket.

To send data from Ingest Processor to an Amazon S3 bucket, create an Amazon S3 dataset in the Data Management app on Splunk Cloud Platform. You can then use the dataset as a pipeline destination.

You can optionally configure the dataset to also support federated searches, so that you can use the same dataset to write and read data from Amazon S3.

The dataset uses an Amazon S3 connection for authentication. You can create multiple datasets that use the same connection.

  1. In Splunk Cloud Platform, select Data Management from the Apps panel.
  2. Navigate to the Datasets page, and then select Create dataset.
  3. On the Select data store page, select Amazon S3, then select Next.
  4. On the Configure connection page, do one of the following:
    • If you have already created the necessary Amazon S3 connection, select it from the Associated connection drop-down list and then select Next..
    • If you have not created the connection yet, select the Create connection. You are prompted to navigate away from the current screen to create the connection. See Create an Amazon S3 connection for Ingest Processor pipelines for more information.
  5. On the Define dataset page, configure the following options, and then select Next:
    Option name Configuration instructions
    Dataset name Enter a unique name for your dataset.
    Dataset description (Optional) Enter a description for your dataset.
    Amazon S3 location Enter the path location for the Amazon S3 bucket you want to send data to.
    Usage

    Select Data routing and federated search.

    Note: If you set Usage to Federated search, then the dataset cannot be used as a pipeline destination and can only be used in federated searches.
  6. On the Configure dataset page, configure the following Data routing options:
    Option name Configuration instructions
    Output schema

    Select Pipeline output.

    Note: Avoid selecting Splunk HTTP Event Collector (HEC), especially if you intend to run federated searches on this dataset. Schema inference does not always work as expected on events that use the HEC schema.
    Output format

    Select the file format that you want to use to store your data in the Amazon S3 bucket.

    If you select Parquet, be aware that the following limitations apply:

    • For best results, you must process and route your data using a pipeline that's created from a template instead of using a custom-configured pipeline. Pipeline templates can ensure that the schema of the resulting Parquet output is compatible with federated searches.

    • If you use a custom-configured pipeline that changes the schema of the events, the dataset will format the event according to the HEC event schema in order to produce Parquet output that is at least partially compatible with federated searches. The resulting events contain the top-level fields described in Event metadata in the Splunk Cloud Platform Get Data In manual.

    Compression type Select the compression format for your data.
    File name prefix

    (Optional) Enter a prefix for the name of the file that contains your output data in S3.

    By default, file names are 32-digit UUIDs (universally unique identifiers) that are autogenerated by the system. You can make the file name more human-readable by adding a prefix. For example, if the autogenerated UUID is 2f0ff66a-e87a-4af5-befb-18dcafa6012f, then entering financial-report in the File name prefix field changes the resulting file name to financial-report-2f0ff66a-e87a-4af5-befb-18dcafa6012f.

  7. (Optional) To adjust the maximum number of events that this dataset can send in each batch of output data, expand Advanced settings and enter your desired maximum number of events in the Batch size field.
    Note: In most cases, the default Batch size value is sufficient. The actual size of each batch can vary depending on the rate at which the Ingest Processor is sending out data.
  8. If the connection that you're using with this dataset is configured to support the Federated Search ability, then you can configure the following Federated search options:
    Option name Configuration instructions
    Activate Toggle this switch to either allow or disallow federated searches for this dataset.
    Do you want to keep catalog in sync with dataset?
    (Optional) If federated searches are allowed for this dataset, you can arrange for the Splunk-managed data catalog to be updated automatically whenever data is added to or removed from the dataset.
    • To configure automated catalog updates, you first must create an SQS queue for the Amazon S3 bucket and set up event notifications for that bucket. Then, set this option to Yes and enter the Amazon Resource Name (ARN) of the SQS queue in the SQS queue ARN field. For detailed instructions, see Set up automated updates for Splunk-native data catalogs in AWS in the Splunk Cloud Platform Federated Search manual.

    • If you don't want to configure automated catalog updates, then set this option to No.

  9. Select Next.
  10. On the Update policies page, complete the following steps:
    1. Depending on whether your Amazon S3 bucket is encrypted using the AWS Key Management Service (KMS), do one of the following:
      • If your bucket is KMS-encrypted, select Yes. Then, in the AWS KMS key ARNs field, enter the ARN of the KMS key used to encrypt the bucket.

      • If your bucket is not KMS-encrypted, select No.

    2. Select Generate policies.
    3. Select Copy to copy the generated resource access policies.
  11. On another browser tab, navigate to the AWS Management Console. Depending on which authentication method the connection uses, do one of the following:
    Option Description
    The connection uses Access key authentication.

    Create an IAM role, and add the resource access policies to it. Then, configure the IAM user associated with the access keys to use this role.

    For more information, see the following AWS documentation:

    The connection uses IAM role authentication.

    Add the resource access policies to the IAM role that you're using to authenticate the connection.

    For more information, see https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage.html in the AWS Identity and Access Management User Guide.

  12. Return to the browser tab where you are creating the dataset, and select Next on the Update policies page. If you are prompted to confirm whether you have updated your AWS policies, select Confirm.
  13. On the Review page, ensure that all the entered information is correct, and then select Create Dataset to create your dataset.
You now have an Amazon S3 dataset that can access the data in your Amazon S3 bucket.

To send data from Ingest Processor to your Amazon S3 bucket, create a pipeline that uses the Amazon S3 dataset as a destination. Then, apply the pipeline to Ingest Processor. For more information, see the following pages:

For information about running federated searches on Amazon S3 datasets, see Run federated searches over Amazon S3 datasets in the Splunk Cloud Platform Federated Search manual.