Map an Amazon S3 federated index to a Splunk-managed AWS Glue table for a default format VPC flow log dataset

Note: This topic covers the Update policies step of the workflow for adding a new Amazon S3 federated provider. You cannot follow this step until you complete the steps that precede it in the workflow. See the checklist of tasks to set up Federated Search for Amazon S3.

This topic shows you how to create an Amazon S3 federated index that maps to a Splunk-managed AWS Glue table for a default format VPC flow log dataset, so you can run federated searches over that data.

If you want to search Amazon S3 datasets composed of AWS CloudTrail log data, see Map an Amazon S3 federated index to a Splunk-managed AWS Glue table for an AWS CloudTrail log dataset.

If you have manually created AWS Glue tables for your Amazon S3 datasets, see Map a federated index to a customer-created AWS Glue table.

After you define an Amazon S3 federated provider for your Splunk Cloud Platform deployment, you create federated indexes for use in federated searches. Each federated index you create maps to a specific AWS Glue table, which in turn references an Amazon S3 dataset. You invoke federated indexes in your federated searches to tell Splunk software which Amazon S3 dataset you intend to search.

The Splunk platform creates federated indexes on the search head of your Splunk Cloud Platform deployment.

This task guides you through the process of creating a federated index that maps to a Splunk-managed AWS Glue table for a default format VPC flow log dataset. Splunk software creates an AWS Glue table based on the information you provide in this task, and manages it thereafter.

In this task, you do these things:

Provide the name of the federated index.
Select the AWS Glue table (Splunk managed: VPC flow log) dataset type.
Supply an Amazon S3 location path that points to the default format VPC flow log dataset to which this federated index will be mapped.
Declare either start or end as the Time field for this VPC flow log dataset.
List the AWS Region values that can be used as partition keys for your searches of the VPC flow log dataset.

If a federated provider has Amazon S3 locations for several VPC flow log datasets over which you want to run federated searches, define a separate federated index for each VPC flow log dataset.

Prerequisites

A role on your Splunk Cloud Platform deployment that has the admin_all_objects capability.
Datasets in your Amazon S3 buckets that are composed entirely of default format AWS VPC flow log data.
You must have already defined an Amazon S3 federated provider that is set up for the creation of Splunk-managed AWS Glue tables. See Define an Amazon S3 federated provider.

Note: Splunk software cannot generate or manage AWS Glue tables for VPC flow log datasets that have non-default or custom log record formats. If your VPC flow log dataset has a non-default or custom log record format you must create an AWS Glue table for it. See Create an AWS Glue table.

Steps

On your Splunk Cloud Platform deployment, in Splunk Web, at the Set up federated index step of the Add a new Amazon S3 provider workflow, use the following table to specify the settings for your federated index.

Note: You might also come to this collection of new federated index settings when you edit a federated provider or select Add federated index on the Federated indexes list page.


Setting	Description
Federated index name	Enter a unique name for the federated index. Federated index names have the following restrictions: They can contain only letters, numbers, underscores, and hyphens. They must begin with a letter or number. They cannot be more than 2,048 characters in length. They cannot be named kvstore. You can use this string in a longer name, like abc_kvstore.
Dataset type	Select AWS Glue table (Splunk managed: VPC flow log).
Amazon S3 location	Select the Amazon S3 location path for the default format VPC flow log dataset that you will search with this federated index. Splunk software will create an AWS Glue table which represents this dataset, and the federated index will map to that AWS Glue table. Amazon S3 location lets you select from a set of Amazon S3 location paths that is equivalent to the list of Amazon S3 locations in the definition for the federated provider the federated index is associated with. For VPC flow log datasets, the location path must end at the `AWSLogs/` folder. If the location path you seek does not end at the `AWSLogs/` folder, you may need to fix it in the definition for the federated provider with which this index is associated. See Supply correctly-formatted Amazon S3 locations for Splunk-managed AWS Glue table generation.

Wait for Splunk software to retrieve permissions for the VPC flow log dataset and prepopulate field values. This process can take some time to complete.
Determine whether the declared Time field for this federated index is start or end. This will be the primary time field you use for sdselect searches that invoke this federated index. Time field is set to start by default.
A VPC flow log record represents a network flow in your VPC transit gateway.The start field is the time, in numeric UNIX format, when the first packet of the network flow was received within the record's aggregation interval. The end field is the time, in numeric UNIX format, when the last packet of the network flow was received within the record's aggregation interval.
Note: When you use the end field in sdselect searches, you must put the field name in single quotes, like this: 'end'.
See Use time fields in sdselect searches.
Specify the maximum relative time range within which searches of the VPC flow log dataset return results.
Max search time range applies to the time partitions in your data. For example, if you set a search that looks for the last 3 years in terms of time partitions and Max search time range is set to 1 year, your search returns results only for data within the last year partition.
Federated searches with time ranges of 2 years or more might suffer from reduced search performance. If you occasionally need to run searches over data that is older than the Max search time range, consider setting up additional federated indexes with larger Max search time range values.
For example, you might run most of your searches over a federated index with a Max search time range of 1 year. But you very occasionally have to run searches over data that is between 1 and 2 years of age, and for those searches you can set up a second federated index with a Max search time range of 2 years.
Splunk software prepopulates the AWS regions by which the VPC flow log dataset to which this federated index maps is partitioned. You can update the AWS regions list. For example, if you want to restrict searches of the VPC flow log dataset to a subset of the prepopulated AWS regions, you can delete values from the list.
You can restore deleted regions. For more information about obtaining AWS regions by which VPC flow log datasets are partitioned, see Get partition key values for a default format VPC flow log dataset
Alternatively, you can provide a wildcard symbol (*) to partition the dataset by all available AWS regions.
Note: When you use a wildcard symbol for AWS regions in a federated index definition, you must include a WHERE clause that filters results by pk_region when you invoke that federated index in an sdselect search. See sdselect command WHERE clause operations.
Select Save to save the federated index configuration.
(Optional) Give your users access to the federated index. To run searches over the remote dataset to which the federated index maps, your users must have access permissions for the federated index. See Give your users role-based access control of federated indexes.

How partitions optimize searches of VPC flow log datasets

Partitioning is an organization strategy for large datasets that makes it possible for you to search them efficiently. When you partition your data, you organize it into a hierarchical directory structure based on the distinct values of 1 or more fields in the data. Files in VPC flow log datasets are partitioned by time, meaning they are organized into folders by year, month, and day. This means all of the files associated with a specific date can easily be searched for.

Because VPC flow log datasets have a stable schema, definitions for federated indexes that map to Splunk-managed AWS Glue tables come with default partition time field values that you cannot change.

However, all VPC flow log datasets are also partitioned by two other fields (or "keys"): AWS account ID and AWS region. When your federated index definition includes a set of partition keys, you can run efficient and cost-effective sdselect searches of the VPC flow log dataset to which the federated index maps.

Get partition key values for a default format VPC flow log dataset

When Splunk software prepopulates fields for a VPC flow log federated index, it identifies the AWS account IDs and AWS regions that you can use as partition fields in your federated searches of the dataset. You might delete AWS regions values to restrict searches of the VPC flow log dataset. If you later decide to restore some of those AWS regions values, how would you go about doing that?

To get values for the AWS regions field, go to the Amazon S3 console and inspect the full Amazon S3 location path for the dataset. The bold folder in the following VPC flow log location path syntax example shows you where the values for the AWS regions field can be found:

s3://<general-purpose-bucket-name>/<additional-prefix-folders>/<file-format>/<partition-style>/AWSLogs/<AWS-account-ID>/<aws-service>/<AWS-region>/<year>/<month>/<day>/<filename>

For example, in the Amazon S3 console, when you open the <aws-service>/ folder for a VPC flow log dataset, you'll see the AWS regions the dataset is associated with.

Note: The <aws-service>/ folder will be named aws-service=vpcflowlogs/ if the dataset uses Hive-style partitioning or just vpcflowlogs/ if it does not.

Optionally identify all possible AWS region partition keys with a wildcard

If a VPC flow log dataset is associated with large number of AWS regions and you do not want to take the time to enter every key value into those fields, you can save time by entering wildcard symbols (*) into the fields instead. The wildcard symbol indicates that all possible key values for the field are applied to the federated index definition.

Note: When you use a wildcard symbol for AWS regions in a federated index definition, you must include a WHERE clause that filters results by pk_region when you invoke that federated index in an sdselect search. See sdselect command WHERE clause operations.

Search your VPC flow log datasets

After you set up federated indexes that map to AWS Glue tables for VPC flow log datasets, you can use the sdselect command to search those datasets. See sdselect command overview.

Delete a federated index

You can delete a federated index that maps to an AWS Glue table for a dataset that you no longer need to search. You can also delete federated indexes when your data scanning entitlements are depleted, to prevent unintentional usage.

Prerequisites

A role on your Splunk Cloud Platform deployment that has the admin_all_objects capability.
A federated index for Federated Search for Amazon S3 that you want to delete.

Steps

On your Splunk Cloud Platform deployment, in Splunk Web, select Settings, then Federation.
On the Federated index tab, identify a federated index that you want to delete.
Select Delete for the index you want to delete.