Define Amazon S3 federated provider details

Note: This topic covers the Provider details step of the workflow for creating an Amazon S3 federated provider. Before you attempt this step, you must first complete the Provider basics step. See Begin defining an Amazon S3 federated provider.

To set up federated search for Amazon S3 on your Splunk Cloud Platform deployment, you must define one or more federated providers for that deployment. A federated provider definition gives your Splunk Cloud Platform deployment the means to establish a connection with a specific AWS account and search over specific datasets in that AWS account.

In this task you do these things to set up the details of a federated provider definition:

  • Splunk software uses AWS Glue tables to facilitate federated searches of remote Amazon S3 datasets. Identify whether you want Splunk software to generate AWS Glue tables for your Amazon S3 datasets, or if you want to use AWS Glue tables that you have previously created for your Amazon S3 datasets. This decision affects how you will fill out the rest of the Amazon S3 federated provider definition.
    • Splunk software can generate AWS Glue tables for Amazon S3 datasets that are composed entirely of either AWS CloudTrail logs or default format VPC flow logs.
    • Your federated provider definition can support a mix of Splunk-managed and customer-created AWS Glue tables.
  • If you have customer-created AWS Glue tables for Amazon S3 locations that you intend to search, list the AWS Glue tables and supply the name of the AWS Glue dataset to which the Glue tables belong.
  • Provide Amazon S3 locations for the datasets you intend to search. Each location is an Amazon S3 file path.
  • If you use server-side encryption through the AWS Key Management Service, provide the AWS KMS key Amazon resource names (ARNs) for the following things:
    • All AWS KMS encrypted S3 general purpose buckets that you intend to search.
    • Your AWS Glue Data Catalog, if it has AWS KMS encryption.
  • Confirm your awareness of the risks of federated search.
  • Generate AWS Identity and Access Management (IAM) policies for the federated provider.

General prerequisites

  • You must have the following things:
    • A role on your Splunk Cloud Platform deployment with the admin_all_objects capability.
    • An AWS account and an AWS IAM role with permissions that let you attach and modify policies for Amazon S3 locations and an Amazon Glue data catalog. Contact your AWS administrator for assistance with permissions.
  • Turn on token authentication for your Splunk Cloud Platform deployment. See Enable or disable token authentication in Securing Splunk Cloud Platform.

Customer-created AWS Glue table prerequisites

If you are adding your own customer-created AWS Glue tables to the federated provider definition, gather information for each customer-created AWS Glue table that you are associating with the Amazon S3 federated provider definition. See Create an AWS Glue table.

Note: If you plan to search only Amazon S3 general purpose buckets that contain either AWS CloudTrail log datasets or default format VPC Flow log datasets, and you will have Splunk software create and manage the AWS Glue tables that facilitate federated searches of those CloudTrail datasets, you can ignore the following prerequisites.
  • Obtain the values of the Name, Database, and Location fields for each customer-created AWS Glue table you are adding to the definition. To find this information for a specific table, open the AWS Glue console, and select Tables.
  • Each customer-created AWS Glue table you add to a specific Amazon S3 federated provider definition must do the following things:
    • Share the same AWS Glue database.
    • Reference an Amazon S3 location path that contains file objects that share the same file type and compression type.
    • Have a column for each field in the Amazon S3 dataset contained by the Amazon S3 location path.

Steps

  1. On your Splunk Cloud Platform deployment, in Splunk Web, at the Provider details step, indicate whether you have AWS CloudTrail or default format VPC flow log datasets that you want Splunk software to create AWS Glue tables for.
  2. If you indicate you do want Splunk software to create AWS Glue tables for you, indicate whether all of your datasets have AWS CloudTrail or default format VPC flow log source types. If you do not select this checkbox, you are indicating that you will be providing some customer-created AWS Glue tables for this federated provider definition.
  3. If you have customer-created AWS Glue tables for this federated provider definition, Splunk software provides the AWS region for your Splunk Cloud Platform deployment. Confirm that your Glue Data Catalog resources for this provider, such as the AWS Glue database and your customer-created AWS Glue tables, reside in this AWS region
  4. Specify the following settings for your Amazon S3 federated provider:
    • AWS Glue database (required if you have customer-created Glue tables)
    • AWS Glue tables (required if you have customer-created Glue tables)
    • Amazon S3 locations (required in all conditions)
    • AWS KMS key ARNs (required if your AWS Glue Data Catalog or S3 general purpose buckets use SSE-KMS encryption)
    For detailed descriptions of these settings, including the conditions under which they are required, see Specify Amazon S3 Provider settings.
  5. Select Consent about the risk of using federated search and Confirmation that Requester Pays is turned off.
  6. Select Generate policy to create the following AWS IAM policies on the Update policies page:
    • A Glue Data Catalog resource policy, if you have customer-created AWS Glue tables.
    • A separate Amazon S3 bucket policy for each Amazon S3 general purpose bucket discovered in the location paths listed in Amazon S3 locations.
    • One or more AWS KMS key policies, if you added ARNs to AWS KMS key ARNs field.

See Update your AWS policies to learn how to update your AWS account with the generated policies.

Specify Amazon S3 provider settings

Note: Specifying Amazon S3 provider settings is part of the Define an Amazon S3 federated provider task.

The following table explains the conditions under which each of the Amazon S3 provider settings are required.

Setting name Requirement condition
AWS Glue database Required only if you have manually created AWS Glue tables for the datasets that you intend to search.
Do not provide an AWS Glue database value if all of the Amazon S3 datasets you intend to search are composed of AWS CloudTrail data or default format VPC flow log data, and you want Splunk software to create and manage the AWS Glue tables for those datasets.
AWS Glue tables Required only if you have created AWS Glue tables for Amazon S3 datasets that you intend to search.
Do not provide values for AWS Glue tables if all of the Amazon S3 datasets you intend to search are composed of AWS CloudTrail data or default format VPC flow log data, and you want Splunk software to create and manage the AWS Glue tables for those datasets.
Amazon S3 locations Required in all conditions.
AWS KMS key ARNs Required only if the Amazon S3 general purpose buckets that contain the data that you want to search have server-side encryption through the AWS Key Management Service (SSE-KMS encryption).

Federated provider name

Enter a unique name for the federated provider. The provider name can contain only alphanumeric characters, underscores, and hyphens.

AWS account ID

Enter the 12-digit ID for the AWS account that contains the Amazon S3 datasets that you want to search with this federated provider.

AWS region

The Splunk Cloud Platform attempts to automatically identify the AWS region of your deployment. If the Splunk Cloud Platform is successful, it populates the AWS region setting with the appropriate value.

Note: The AWS region of the AWS Glue database for this federated provider must match the AWS region for your Splunk Cloud Platform deployment.

You cannot set this field on your own.

AWS Glue database

Enter the name of the AWS Glue database that contains the AWS Glue tables listed in AWS Glue tables. AWS Glue database names can contain only lowercase letters, numbers, underscores, and hyphens. An AWS Glue table name can have no more than 255 characters.

The AWS Glue database setting has the following restrictions:

  • The AWS Glue database specified by AWS Glue database must contain all of the AWS Glue tables listed in AWS Glue tables.
  • The AWS Glue database must have the same AWS region as your Splunk Cloud Platform deployment.
  • A federated provider definition can have only 1 AWS Glue database name.

Check the Tables page in the AWS Glue console to see the Database assignments for individual AWS Glue tables.

Note: Do not provide an AWS Glue database value if all of your AWS Glue tables will be Splunk-managed.

AWS Glue tables

Enter 1 or more customer-created AWS Glue tables that you want to associate with this federated provider. Separate AWS Glue table names with commas. AWS Glue table names can contain only lowercase letters, numbers, underscores, and hyphens. AWS Glue table names can be no longer than 255 characters.

All AWS Glue tables listed in AWS Glue tables must have these elements:

  • Belong to the AWS Glue database specified by AWS Glue database.
  • Reference an Amazon S3 location path listed in Amazon S3 locations.

To get AWS Glue table names, check the Tables page in the AWS Glue console. AWS Glue table names appear in the Names column.

For more information about creating AWS Glue tables, see Create an AWS Glue table.

Note: Do not provide AWS Glue tables values if all of your AWS Glue tables will be Splunk-managed.

Amazon S3 locations

Amazon S3 locations are file paths in Amazon S3 general purpose buckets that contain the data that you want to search.

Enter 1 or more Amazon S3 location paths. Separate location paths with commas. Amazon S3 locations can contain only alphanumeric characters and the following special characters: /!=_.*'():

Amazon S3 paths that terminate in a file object are not valid for Splunk federated providers. Use only locations that end in an Amazon S3 folder.

For example, this is an invalid location path: s3//bucket1/path1/my_csv_data/data.csv

Provide location paths for the following things:

  • Each AWS CloudTrail log dataset or default format VPC flow log dataset in your Amazon S3 general purpose buckets for which you want Splunk software to create an AWS Glue table.
  • Each customer-created AWS Glue table you list in AWS Glue tables. Each AWS Glue table can be associated with only 1 location path. A single location path can be specified in multiple unrelated AWS Glue table definitions.

AWS KMS key ARNs

Do you use the AWS Key Management Service to apply server-side encryption (SSE-KMS) to the data stored in your Amazon S3 general purpose buckets or the metadata in your AWS Glue Data Catalog?

If you do, enter into AWS KMS key ARNs the Amazon resource names (ARNs) for the AWS KMS keys that encrypt data in your Amazon S3 general purpose buckets or metadata in your AWS Glue Data Catalog.

Note: Federated search for Amazon S3 supports only customer-managed AWS KMS keys. In addition, each KMS key ARN you provide in this field must belong to the AWS account you specify with the AWS account ID setting.

For more information about AWS KMS keys, see AWS KMS concepts in the AWS Key Management Service Developer Guide.

Get AWS KMS key ARNs for your Amazon S3 general purpose buckets

To get the AWS KMS key ARNs for your Amazon S3 general purpose buckets, go to your Amazon S3 console and review the buckets that are associated with the locations you have listed in Amazon S3 locations for this provider. Follow these steps to obtain an AWS KMS key ARN that is associated with an Amazon S3 general purpose bucket.:

  1. In the Amazon S3 console, navigate to the General purpose buckets page.
  2. Select the Name of the general purpose bucket with AWS KMS encryption.
  3. Select the general purpose bucket Properties tab.
  4. Inspect the Default encryption section. If the Encryption type is Server-side encryption with AWS Key Management Service keys (SSE-KMS), copy the Encryption key ARN that appears below it. Select the copy icon (This icon looks like one square shape overlapping an identical square shape. It represents the copy operation) next to the ARN to ensure an accurate copy and paste operation.
  5. Paste the copied ARN into the AWS KMS Key ARNs field in your federated provider definition.

If you copy multiple ARNs into AWS KMS key ARNs, separate them with comma characters.

Note: If additional AWS KMS keys encrypt data within the general purpose bucket, you must provide their key ARNs within the AWS KMS key ARNs list.

Get the AWS KMS key ARN for your AWS Glue Data Catalog

Follow these steps to get the AWS KMS key ARN for your AWS Glue Data Catalog:

  1. In the AWS Glue console, select Data Catalog and then Catalog settings in the left-hand navigation bar.
  2. Under Encryption options, if Metadata encryption is selected and you find an AWS KMS key ARN in the AWS KMS key for metadata encryption field, copy it. Select the copy icon (This icon looks like one square shape overlapping an identical square shape. It represents the copy operation) to ensure an accurate copy and paste operation.
  3. Paste the copied ARN into the AWS KMS key ARNs field in your federated provider definition.

If AWS KMS key ARNs contains a list of other ARNs when you add the Data Catalog AWS KMS key ARN, make sure you separate the new ARN from the others with a comma character.

Supply correctly-formatted Amazon S3 locations for Splunk-managed AWS Glue table generation

If you plan to search an AWS CloudTrail log dataset or a default format VPC flow log dataset and you want Splunk software to create the AWS Glue tables for that dataset, you must provide a Amazon S3 locations file path to that dataset that is formatted in a manner that Splunk software expects. If Splunk software gets an incorrectly-formatted file path it cannot generate an AWS Glue table for it.

The general rule for AWS CloudTrail log datasets and default format VPC flow log datasets is to provide location paths that stop at the AWSlogs/ folder.

The AWSlogs/ folder is always followed by 1 or more folders with 12-digit AWS account ID numbers in their names. Do not include AWS account ID folders in the Amazon S3 locations value.

For example, this is a correct Amazon S3 locations value for a default format VPC flow log dataset:

s3://splunkmanagedglue-vpcflow/default/text/hive/AWSLogs/

And these are incorrect Amazon S3 locations values for the same default format VPC flow log dataset:

  • s3://splunkmanagedglue-vpcflow/default/text/
  • s3://splunkmanagedglue-vpcflow/default/text/hive/AWSLogs/aws-account-id=100000000299/
  • s3://splunkmanagedglue-vpcflow/default/text/hive/AWSLogs/*

Note: The fact that multiple AWS account ID folders can exist within an AWSlogs folder means that Amazon S3 locations path values can represent datasets that span multiple AWS accounts.

See Syntax for AWS CloudTrail log location paths.

See Syntax for default format VPC flow log location paths.

Steps for getting an Amazon S3 location path that Splunk software can use for AWS Glue table generation

Follow these steps to get an Amazon S3 locations path that Splunk software can use to generate an AWS Glue table for an AWS CloudTrail log dataset or a default format VPC flow log dataset.

  1. Go to the Amazon S3 console in your AWS account and navigate to the General purpose buckets page.
  2. Select the Name of the general purpose bucket that contains the dataset that you want to search.
  3. Continue selecting folder Name values for the dataset you want to search, until you reach a folder named AWSLogs/.
  4. Select Copy S3 URI to copy the location file path to the AWSLogs/ folder for your dataset.
  5. Paste the URI into the Amazon S3 locations setting in your federated provider definition.

Repeat this process for each AWS CloudTrail log dataset or default format VPC flow log dataset for which you want Splunk software to create an AWS Glue table.

For detailed information about using the Amazon S3 console to review and manage general purpose bucket contents, see Working with objects in Amazon S3 in the Amazon Simple Storage Service (S3) User Guide.

Syntax for AWS CloudTrail log location paths

When an AWS CloudTrail log dataset is associated with 1 AWS account ID, its Amazon S3 location path has the following syntax:

s3://<general-purpose-bucket-name>/<additional-prefix-folders>/AWSLogs/

The <additional-prefix-folders>/ might not be present in the location path. These can be one or more additional folders that people optionally set up to differentiate between multiple datasets that are being stored in the same Amazon S3 general purpose bucket.

An AWS CloudTrail dataset can be associated with multiple AWS account IDs. When there are multiple AWS account IDs associated with an AWS CloudTrail dataset, its AWS S3 location path might include an AWS organization ID. Here is the Amazon S3 location path syntax for such AWS CloudTrail datasets:

s3://<general-purpose-bucket-name>/<additional-prefix-folders>/o-<organization-ID>/AWSLogs/

Syntax for default format VPC flow log location paths

Amazon S3 location paths for default format VPC flow log datasets have the following syntax:

s3://<general-purpose-bucket-name>/<additional-prefix-folders>/<file-format>/<partition-style>/AWSLogs/

The <additional-prefix-folders> might not be present in the location path. These can be one or more additional folders that people optionally set up to differentiate between multiple datasets that are being stored in the same Amazon S3 general purpose bucket.

The <file-format> can be either parquet or text. The <partition-style> can be either hive or non-hive.

Shorten location paths to represent multiple customer-created AWS Glue tables

If you have designed multiple AWS Glue tables that represent different datasets within the same general purpose Amazon S3 general purpose bucket, you can submit a shortened location path to the Amazon S3 locations list that represents some or all of those AWS Glue tables.

For example, the following 2 location paths represent 2 different datasets within a general purpose bucket named bucket1 :

Location AWS Glue table name
s3://bucket1/path1/my_csv_data table_csv
s3://bucket1/path1/my_json_data table_json

You can provide each path separately in the Amazon S3 locations list.

Alternatively, you can enter a single shortened location to the Amazon S3 locations list that captures both table_csv and table_json:

s3://bucket1/path1/

You can also use a wildcard ( * ) to capture both AWS Glue tables:

s3://bucket1/path1/my*

Note: Use wildcards only at the end of Amazon S3 locations.