Aggregate event data using Ingest Processor

Learn how to aggregate event data using Ingest Processor to optimize data flow and reduce log volume by processing partial aggregations in batches.

You can create a pipeline that aggregates your incoming event data to reduce the volume of raw logs being sent to your destination.

For example, when working with network flow record data, you can get the value sum of total bytes transferred from each source IP address. You can then send these aggregations as singular events to your data destination to be indexed.

Data is aggregated using the stats command in an SPL2 pipeline. See the stats command documentation in the SPL2 Search Reference manual for more information.

Differences between ingest-time aggregations and search-time aggregations

Due to the streaming data flow through Ingest Processor pipelines, the stats command performs partial aggregations in batches. These batches can be made up of single or multiple events, depending on the rate of your data flow through pipelines and data source configurations. If aggregable data is processed across multiple batches, each batch is processed independently. These batches of partial aggregations are then routed as events to your pipeline destination. If you want to retrieve a complete and finalized aggregation of your data, you must run a search using the stats command in Splunk platform once your data is indexed. See Processing aggregations at search-time for more information.

Note: The default batch time span and batch size for an Ingest Processor pipeline aggregation is 4 seconds and 30MB, respectively. These configurations cannot be modified.

Additionally, there are optional arguments available for search time stats use that are not available for ingest-time aggregations, and optional arguments available for ingest-time aggregations. See the table below for which optional arguments are available for using in Ingest Processor pipelines and at search-time.


Optional argument	Can be used in an Ingest Processor pipeline	Can be used in Splunk platform search
`all-num`		Yes
`batch_id`	Yes
`batch_time`	Yes
`by-clause`	Yes	Yes
`delim`		Yes
`instance_id`	Yes
`partitions`		Yes
`span`	Yes	Yes

For more information on the statistical functions that can be used with stats in a pipeline, see SPL2 statistical functions for Ingest Processor pipelines.

Note: While the statistical function avg() is not available for use with the stats command in Ingest Processor pipelines, an aggregation with an average of values can still be computed by using sum and count in your Ingest Processor pipeline statement, then dividing at search-time. See Processing aggregations at search-time for a full pipeline and search statement example.

Optimizing aggregation efficiency

Using the stats command in an Ingest Processor pipeline creates partial aggregations output in batches, reducing the number of events sent to your indexers. These partial aggregations can be made more efficient for maximum event reduction. The efficiency of your aggregations depends on the batch size, with larger batch sizes being more efficient. Batch sizes can be configured depending on your source. See the table below for additional notes for each source protocol.


Source protocol	Configuration information
HTTP Event Client (HEC)	Adjust batch size directly in the upstream HEC client. The maximum batch time span and batch size for an Ingest Processor pipeline aggregation is 4 seconds and 30MB, respectively. Any batch timeout or maximum size adjustments on an upstream HEC client beyond this limit will not take effect to improve aggregation efficiency.
S2S	No configurations available to adjust batch size for aggregation efficiency.

Create a pipeline to aggregate your data

To create a pipeline that aggregates your data in batches, use the Add summary action in the pipeline editor to specify the fields you want to summarize.

Navigate to the Pipelines page and then select New Pipeline, then Ingest Processor pipeline.
On the Get started page, select Blank pipeline and then Next.
On the Define your pipeline's partition page, do the following:
1. Select how you want to partition your incoming data that you want to send to your pipeline.
  You can partition by source type, source, and host.
2. Enter the conditions for your partition, including the operator and the value.
  Your pipeline will receive and process the incoming data that meets these conditions.
3. Select Next to confirm the pipeline partition.

On the Add sample data page, do the following:

Enter or upload sample data for generating previews that show how your pipeline processes data.

The sample data must contain accurate examples of the values that you want to retrieve statistic example, the following sample events represent requested files from a server and contains aggregable data.


server_name	server_ip	file_requested	bytes_out	sourcetype
web-01	10.1.2.3	/index.html	500	web
web-01	10.1.2.3	/images/logo.png	8500	web
web-02	10.1.2.4	/css/main.css	1200	web

Select Next to confirm the sample data that you want to use for your pipeline.

On the Select a data destination page, select the name of the destination that you want to send your processed data to.

If you selected a Splunk platform destination, you can configure index routing:

Select one of the following options in the expanded destinations panel.


Option	Description
Default	The pipeline does not route events to a specific index. If the event metadata already specifies an index, then the event is sent to that index. Otherwise, the event is sent to the default index of the Splunk Cloud Platform deployment.
Specify index for events with no index	The pipeline only routes events to your specified index if the event metadata did not already specify an index.
Specify index for all events	The pipeline routes all events to your specified index.

If you selected Specify index for events with no index or Specify index for all events, then from the Specify index for all events drop-down list, select the name of the index that you want to send your data to.

Note: Be aware that the destination index is determined by a precedence order of configurations. See How does Ingest Processor know which index to send data to? for more information.

Note: It is recommended that you do not send both non-summarized and summarized data to the same index.

Select Done to confirm the data destination.
After you complete the on-screen instructions, the pipeline builder displays the SPL2 statement for your pipeline.
(Optional) To generate a preview of how your pipeline processes data based on the sample data that you provided, select the Preview Pipeline icon (). Use the preview results to validate your pipeline configuration.
Select the plus icon () in the Actions section and then select Summarize values in field.

In the Add summary dialog box, do the following:

Add an aggregation in the Aggregations section. Select a field to aggregate and the calculation you want to perform for that field. You can add multiple fields for multiple aggregations.
For example, using the sample data provided in Step 4a, add bytes_out as the source field and select sum as the calculation.
In the Split by section, add one or more fields to group together your aggregations, if desired.
Continuing the example from the previous step, if you wanted to group the sum of bytes_out by the server names given in the server_name field, add server_name in the Aggregations section.

The resulting output of this example would be aggregated as the following events:


server_name	bytes_out	_raw
web-01	9000	{"server_name": "web-01", "bytes_out": 9000}
web_02	1200	{"server_name", "web-02", "bytes_out": 1200}

Note: Your pipeline keeps the fields specified in the Aggregations and Split by sections and filters out the unspecified fields. In the example above, the pipeline filters out the server_ip and file_requested fields because they are not included in the aggregation.

To save your pipeline, do the following:
1. Select Save pipeline.
2. In the Name field, enter a name for your pipeline.
3. (Optional) In the Description field, enter a description for your pipeline.
4. Select Save. This pipeline is now listed on the Pipelines page, and you can now apply it as needed.
To apply this pipeline, do the following:
1. Navigate to the Pipelines page.
2. In the row that lists your pipeline, select the Actions icon () and then select Apply/Remove.
3. Select the pipeline that you want to apply, and then select Save. It can take a few minutes to finish applying your pipeline. During this time, all applied pipelines enter the Pending status.
4. (Optional) To confirm that the Ingest Processor has finished applying your pipeline, navigate to the Ingest Processor page and check if all affected pipelines have returned to the Healthy status.
Your applied pipelines can now process and route data as specified in the pipeline configuration.

You now have a pipeline that calculates aggregations from your event data.

Managing aggregations

Aggregation patterns

Different types of aggregations can be created with the stats command in an Ingest Processor pipeline depending on how many fields you want your indexed events to retain.

Summary pattern: Choosing a limited number of fields to aggregate in your pipeline filters out unselected fields and outputs a distilled summarization of your data. See Create a pipeline to aggregate your data for an example of the summary pattern.

Passthrough pattern: If you want to retain all fields from your events and still reduce the overall number of events being indexed, you can add all desired fields in the BY <clause> section of your pipeline statement. See below for an example of the passthrough pattern in practice.

Data input


server_name	server_ip	file_requested	bytes_out	sourcetype
web-01	10.1.2.3	/index.html	500	web
web-01	10.1.2.3	/index.html	500	web
web-02	10.1.2.4	/css/main.css	1200	web

SPL2 statement

CODE

... | stats sum(bytes_out) AS bytes_out BY server_name, server_ip, file_requested, sourcetype
    | rename orig_sourcetype AS sourcetype
    | eval _raw=json_delete(_raw, "orig_sourcetype"

... | stats sum(bytes_out) AS bytes_out BY server_name, server_ip, file_requested, sourcetype
    | rename orig_sourcetype AS sourcetype
    | eval _raw=json_delete(_raw, "orig_sourcetype"

Aggregation output


_raw	server_name	server_ip	file_requested	bytes_out	sourcetype
{"server_name": "web-01", "server_ip": "10.1.2.3", "file_requested": "/index.html", "bytes_out": 1000}	web-01	10.1.2.3	/index.html	1000	web
{"server_name": "web-02", "server_ip": "10.1.2.4", "file_requested": "/css/main.css", "bytes_out": 1200}	web-02	10.1.2.4	/css/main.css	1200	web

Processing aggregations at search-time

Now that you have an applied pipeline aggregating your data in batches and sending these batches to your Splunk platform destination, you can finalize your batched aggregations by running a search in Splunk platform using the stats command. There are corresponding SPL1 queries for each statistical function run in an Ingest Processor pipeline to finalize your aggregations. See the following examples for these search statements.

All examples use the following sample data as input:


server_name	server_ip	file_requested	bytes_out	sourcetype
web-01	10.1.2.3	/index.html	500	web
web-01	10.1.2.3	/images/logo.png	8500	web
web-02	10.1.2.4	/css/main.css	1200	web

sum

SPL2 pipeline statement

CODE

... | stats sum(bytes_out) as bytes_out BY server_name
    | eval sourcetype="web:summary"

... | stats sum(bytes_out) as bytes_out BY server_name
    | eval sourcetype="web:summary"

Aggregation output


server_name	bytes_out	_raw	sourcetype
web-01	9000	{"server_name": "web_01", "bytes_out":9000}	web:summary
web-02	1200	{"server_name": "web-02", "bytes_out":1200}	web:summary

SPL1 query

CODE

index=<myindex> sourcetype="web:summary" | stats sum(bytes_out) BY server_name

index=<myindex> sourcetype="web:summary" | stats sum(bytes_out) BY server_name

count

SPL2 pipeline statement

CODE

... | stats count() as event_count BY server_name
    | eval sourcetype="web:summary"

... | stats count() as event_count BY server_name
    | eval sourcetype="web:summary"

Aggregation output


server_name	event_count	_raw	sourcetype
web-01	2	{"server_name": "web-01", "count": 2}	web:summary
web-02	1	{"server_name": "web-02", "count":1}	web:summary

SPL1 query

CODE

index=<myindex> sourcetype="web:summary" | stats sum(event_count) BY server_name

index=<myindex> sourcetype="web:summary" | stats sum(event_count) BY server_name

min

SPL2 statement

CODE

... | stats min(bytes_out) as min_bytes BY server_name
    | eval sourcetype="web:summary"

... | stats min(bytes_out) as min_bytes BY server_name
    | eval sourcetype="web:summary"

SPL1 query

CODE

index=<myindex> sourcetype="web:summary" | stats min(min_bytes_out) BY server_name

index=<myindex> sourcetype="web:summary" | stats min(min_bytes_out) BY server_name

max

SPL2 statement

CODE

... | stats max(bytes_out) as max_bytes BY server_name
    | eval sourcetype="web:summary"

... | stats max(bytes_out) as max_bytes BY server_name
    | eval sourcetype="web:summary"

Aggregation output


server_name	max_bytes_out	_raw	sourcetype
web-01	8500	{"server_name": "web-01", "max_bytes_out":8500}	web:summary
web-02	1200	{"server_name": "web-02", "max_bytes_out":1200}	web:summary

SPL1 query

CODE

index=<myindex> sourcetype="web:summary" | stats max(max_bytes_out) BY server_name

index=<myindex> sourcetype="web:summary" | stats max(max_bytes_out) BY server_name

avg

Note: The statistical function avg is not available to use in an Ingest Processor pipeline, but a combination of the sum and count functions and a finalizing query achieves the same calculation result.

SPL2 statement

CODE

... | stats sum(bytes_out) as bytes_out, count(bytes_out) as event_count BY server_name
    | eval sourcetype="web:summary"

... | stats sum(bytes_out) as bytes_out, count(bytes_out) as event_count BY server_name
    | eval sourcetype="web:summary"

Aggregation output


server_name	bytes_out	event_count	_raw	sourcetype
web-01	9000	2	{"server_name": "web-01", "bytes_out":9000}	web:summary
web-02	1200	1	{"server_name":"web-02", "bytes_out":1200}	web:summary

SPL1 query

CODE

index=<myindex> sourcetype="web:summary"
     | stats sum(bytes_out) AS bytes_out, sum(event_count) AS event_count BY server_name
     | eval bytes_avg=bytes_out/event_count

index=<myindex> sourcetype="web:summary"
     | stats sum(bytes_out) AS bytes_out, sum(event_count) AS event_count BY server_name
     | eval bytes_avg=bytes_out/event_count

span

Input data


_time	server_name	server_ip	file_requested	bytes_out	sourcetype
2025-01-01 12:00:05	web-01	10.1.2.3	/index.html	500	web
2025-01-01 12:01:22	web-01	10.1.2.3	/images/logo.png	8500	web
2025-01-01 12:00:28	web-02	10.1.2.4	/css/main.css	1200	web

SPL2 statement

CODE

… | stats sum(bytes_out) as bytes_out BY span(_time, 1m), server_name
  | eval sourcetype="web:summary"

… | stats sum(bytes_out) as bytes_out BY span(_time, 1m), server_name
  | eval sourcetype="web:summary"

Aggregation output


_time	server_name	bytes_out	_raw	sourcetype
2025-01-01 12:00:00	web-01	500	{"server_name": "web-01", "bytes_out": 500, "_time":1735732800}	web:summary
2025-01-01 12:01:00	web-01	8500	{"server_name": "web-01", "bytes_out": 8500, "_time":1735732860}	web:summary
2025-01-01 12:00:00	web-02	1200	{"server_name": "web-01", "bytes_out": 1200}	web:summary

SPL1 query

CODE

index=<myindex> sourcetype="web:summary" | stats sum(bytes_out) BY _time, server_name

index=<myindex> sourcetype="web:summary" | stats sum(bytes_out) BY _time, server_name

Data Management

Aggregate event data using Ingest Processor

Differences between ingest-time aggregations and search-time aggregations

Optimizing aggregation efficiency

Create a pipeline to aggregate your data

Managing aggregations

Aggregation patterns

Processing aggregations at search-time

ON THIS PAGE

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Aggregate event data using Ingest Processor

Differences between ingest-time aggregations and search-time aggregations

Optimizing aggregation efficiency

Create a pipeline to aggregate your data

Managing aggregations

Aggregation patterns

Processing aggregations at search-time