Overview of Federated Search for Amazon S3

Use Federated Search for Amazon S3 to perform searches over datasets stored in Amazon S3 buckets and retrieve the results in your Splunk Cloud Platform instance.

Use Federated Search for Amazon S3 to perform remote searches directly on your Amazon S3 buckets and retrieve the results in your Splunk Cloud Platform instance for correlation, enrichment, and analysis, without the cost and complexity of data restoration.

Connections and datasets

Federated Search for Amazon S3 is part of the Data Management app, where you'll set up your federated search experience through the definition of connections and datasets.
Connection
An Amazon S3 connection defines how Splunk software securely authenticates a link between your Splunk Cloud Platform deployment and a remote dataset in Amazon S3. Amazon S3 connections are reusable and can be associated with multiple Amazon S3 datasets. Amazon S3 connections do not specify what data is searchable.
Dataset
An Amazon S3 dataset is a searchable data object that is associated with a single Amazon S3 connection. Each Amazon S3 dataset is defined by an Amazon S3 location.

Two core workflows

Amazon S3 connections and datasets support two workflows. One workflow facilitates a combination of data routing and federated search. The other workflow is only for federated search.

Data routing and federated search
With this workflow, you configure a connection and a dataset that supports an Edge Processor pipeline, an Ingest Processor pipeline, or both, with the result being that Edge Processor or Ingest Processor data is sent to a dataset at an Amazon S3 location that can be used as a pipeline destination. You can optionally configure the dataset to support federated searches, so you can use the same dataset to write data to and read data from Amazon S3.
Federated search only
With this workflow, you configure a connection and dataset that are specifically for federated search of data that you store at an Amazon S3 location. Select this option when you want to focus on search of data that you are storing in Amazon S3 and do not require a data routing solution.

Data catalog requirement

Federated Search for Amazon S3 searches apply filtering and statistical functions to data catalogs that contain column, schema, and partition definitions for datasets in your Amazon S3 buckets. This means that a data catalog must be associated with each Amazon S3 dataset you intend to search.

When you create a Federated search only dataset, there are three setup routes you can take depending on the kind of data catalog you want to back the dataset with:

  • AWS Glue catalog table: A federated search dataset that is referenced by a table in an AWS Glue catalog that you maintain. This AWS Glue data catalog table can be formatted as a Delta Lake or Apache Iceberg table, or it can have non-table format, meaning it is a traditional external table formatted over JSON or Parquet files.

  • Apache Iceberg REST catalog: A federated search dataset that is referenced by an Apache Iceberg REST catalog that you own and maintain, such as Nessie, Polaris, or a custom proxy.
    Note: Federated Search for Amazon S3 currently does not support Apache Iceberg REST catalogs that require authorization.
    Note: The AWS Glue Apache Iceberg REST catalog interface is not supported. If you want to use AWS Glue in conjunction with Apache Iceberg, select AWS Glue as the type of catalog you want to represent your dataset and set Iceberg as the catalog's table format. See Create an Amazon S3 dataset that is backed by an AWS Glue catalog table.
  • Splunk-native catalog: If you do not have a catalog, you can let Splunk software build one for you. You can let Splunk software use a crawler to infer the schema and partitions for your dataset, or you can opt to define the schema and partitions manually.

Note: When you create a Data routing and federated search dataset, a Splunk-native catalog is built for it.

Supported file types and data formats

Federated Search for Amazon S3 supports the following file types and data formats.

  • CSV or CSV-type formats
  • JSON and ndjson (new-line delimited JSON)
  • Parquet

For more information about the file types supported by AWS Glue tables, see Creating tables using the console in the AWS Glue Developer Guide.

Federated Search for Amazon S3 supports data originating from Edge Processor and Ingest Processor. See About the Edge Processor solution or About Ingest Processor.

Federated Search for Amazon S3 does not support data in Dynamic Data Self-Storage (DDSS) format.

Supported compression types

Federated Search for Amazon S3 supports GZIP compression for data in JSON or CSV format.

Federated searches of compressed files might take longer to complete than federated searches of uncompressed files.

Supported encryption standards

Federated Search for Amazon S3 supports the following encryption standards.

  • Server-side encryption with Amazon S3-Managed Keys (SSE-S3)
  • Server-side encryption with the AWS Key Management Service (SSE-KMS)

Federated Search for Amazon S3 supports SSE-S3 without any additional setup requirements. For more information, see Using server-side encryption with Amazon S3-managed encryption keys (SSE-S3) in the Amazon Simple Storage Service User Guide.

Federated Search for Amazon S3 supports only customer-managed SSE-KMS keys. See Using server-side encryption with AWS KMS keys (SSE-KMS) in the Amazon Simple Storage Service User Guide.

Federated Search for Amazon S3 supports SSE-KMS encryption at the Amazon S3 bucket level and at the AWS Glue Data Catalog level.

SSE-KMS support requires some setup when you define a dataset. See Apply the dataset resource access policy to an AWS IAM role.

For more information about KMS encryption pricing, see AWS Key Management Service Pricing on the AWS website.

Supported Amazon S3 storage classes

Federated Search for Amazon S3 can search objects in Amazon S3 buckets that are in the following Amazon S3 data storage classes, with some restrictions and exceptions.

  • Standard

  • Standard Instant Access

  • One Zone Instant Access

  • Reduced Redundancy

  • Intelligent Tiering (see restrictions).

Federated Search for Amazon S3 can search objects in the Amazon S3 Intelligent-Tiering storage class as long as those objects are not archived in the Archive Access and Deep Archive Access tiers. Federated Search for Amazon S3 does not support the Archive Access and Deep Archive Access tiers.

Federated searches of objects that have the S3 Intelligent-Tiering storage class have the same latency and throughput as searches of objects that have the S3 Standard storage class.
Note: Federated Search for Amazon S3 currently does not support any Glacier storage classes.

Restrictions

Federated Search for Amazon S3 includes the following restrictions:
  • Federated Search for Amazon S3 is available only to Splunk Cloud Platform users with deployments in AWS regions.

  • Federated Search for Amazon S3 does not support the following kinds of Splunk Cloud Platform deployments:

    • Deployments in Google Cloud and Microsoft Azure regions.

    • FedRAMP High and DoD IL5 deployments.

  • You cannot use Federated Search for Amazon S3 to search Amazon S3 buckets that are configured to be Requester Pays buckets. If Requester Pays is turned on for your Amazon S3 bucket and you try to run a federated search over that bucket, Splunk Cloud Platform rejects your search.

    Searches of Amazon S3 buckets that are configured to be Requester Pays buckets incur data transfer charges in accordance with the Amazon S3 pricing schedule located at Amazon S3 Pricing on the Amazon Web Services (AWS) website.

What you need to get started

To get started with federated search of Amazon S3 data, you must have the following things:
  • You must have a Splunk Cloud Platform (SCP) deployment.

  • Your user account on the SCP deployment must have a role with the edit_datasets and edit_federated_providers capabilities. See Define roles on the Splunk platform with capabilities in the Splunk Cloud Platform Manage Users and Security manual.

  • You must have an Amazon Web Services (AWS) account with data in Amazon S3 buckets that conform to supported file and compression types.

Activate Federated Search for Amazon S3

To activate Federated Search for Amazon S3 for your Splunk Cloud Platform deployment, contact your Splunk Sales representative. As part of this activation, you acquire a data scan entitlement that is based on the amount of remote Amazon S3 data, in terabytes, that you are projected to search over the upcoming year. Data scan entitlements are made up of Data Scan Units (DSUs). Each DSU is equivalent to 10 TB of data scanning capabilities.

You have one pool of DSUs that you share between the federated search products you use. For example, if you use both Federated Search for Amazon S3 and Federated Search for Microsoft Azure, you will share one pool of DSUs among the searches you run for both products.

Note: If you are an existing user of legacy Federated Search for Amazon S3 with a license to use that product, you do not need to activate Federated Search for Amazon S3 in the Data Management app. Go to the Data Management app to create connections and datasets to replace your existing providers and indexes, and use your existing DSU pool to run federated searches.

For more information about DSUs, see Splunk Offerings Purchase Capacity and Limitations.

Checklist of tasks to set up Federated Search for Amazon S3

Use this checklist to guide you through the cross-account setup of Federated Search for Amazon S3.

Step Task Description
1 Create an Amazon S3 connection

A connection contains the tools you need to authenticate the ability to run federated searches over Amazon S3 datasets from your Splunk platform deployment.

2 Define an Amazon S3 dataset

Provide baseline information for your dataset, including its name, the Amazon account it's associated with, and the Amazon S3 location that contains it. Link it to a connection. Determine whether the dataset will support both data routing and federated search or just federated search.

3a Create an Amazon S3 dataset for data routing and federated search

Create a dataset that supports sending of Edge Processor or Ingest Processor data through pipelines to an Amazon S3 location.

You can optionally turn on federated search functionality for this dataset, so you can run searches over the data that gets sent to it.

3b Create an Amazon S3 dataset for federated search that is backed by an AWS Glue catalog table Create a dataset that is only for federated search, and which is backed by an AWS Glue data catalog that you own.
3c Create an Amazon S3 dataset for federated search that is backed by an Iceberg REST catalog Create a dataset that is only for federated search, and which is backed by an Apache Iceberg REST catalog that you own.
3d Create an Amazon S3 dataset for federated search that is backed by a Splunk-native data catalog Create a dataset that is only for federated search, and which is backed by a data catalog that is operated by Splunk software. You can define the dataset schema and partition fields yourself, or you can let Splunk software use a crawler to automatically infer the schema and partition fields.
4 Apply the dataset resource access policy to an AWS role Splunk software generates an AWS policy that authenticates access to the Amazon S3 bucket that contains the dataset you want to run federated searches over. You must manually apply this resource access policy to the IAM role that is associated with the connection for this dataset.
5 Give your users role-based access control of remote datasets After you have successfully created a Microsoft Azure dataset, give your users role-based access to it.
6 Run federated searches over remote datasets with SPL2 After you have successfully created a dataset definition, it's time to run federated searches over that dataset with SPL2.

Measure your DSU usage

When you use Federated Search for Amazon S3, you acquire a data scan entitlement that is based on the amount of remote Amazon S3 data, in terabytes, that you are projected to search over the upcoming year. Data scan entitlements are made up of Data Scan Units (DSUs). Each DSU is equivalent to 10 TB of data scanning capabilities.

If you want to get an idea of what your DSU usage requirements might be, run the following search for a time range long enough to be representative of your use of the product.

CODE
index=_cmc_summary source=federated-search-daily-usage
| fillnull value="" app user fs_type scenarios savedsearch_name
| stats max(dailyFsBytes) as maxDailyFsBytes by app user fs_type scenarios savedsearch_name _time
| stats sum(maxDailyFsBytes) as cumulativeFSUsage by app user fs_type scenarios
| eventstats sum(cumulativeFSUsage) as totalBytes
| eval totalFSGB = cumulativeFSUsage / (1024*1024*1024)
| eval totalFSDSU = totalFSGB / 1024 / 10
| eval DSU_Total = totalBytes / (1024*1024*1024*1024) / 10
| fields app user fs_type scenarios cumulativeFSUsage totalFSGB totalFSDSU DSU_Total