Aggregate event data using Edge Processor
Learn how to aggregate event data using Edge Processor to optimize data flow and reduce log volume by processing partial aggregations.
You can create a pipeline that aggregates your incoming event data to reduce the volume of raw logs being sent to your destination.
For example, assume that your Edge Processor receives 20 network flow logs that are emitted by 5 servers in your network, and that each log includes the number of bytes sent out from a given server. You can aggregate the data so that each log shows the sum of the bytes sent out by each server, potentially reducing the number of logs sent to your data destination from 20 to 5.
To aggregate data, you can use the stats SPL2 command in your pipeline. However, be aware that the stats command works differently in Edge Processor pipelines compared to when you use it in Splunk platform searches.
See the following sections on this page:
-
For detailed information about aggregations in Edge Processor pipelines and how they differ from search-time aggregations, see:
-
For instructions on configuring aggregations in your pipelines, see Create a pipeline to aggregate your data.
-
For examples of various ingest-time and search-time aggregation configurations, see Managing aggregations.
For comprehensive reference information about the stats command, see stats command overview in the SPL2 Search Reference.
How Edge Processors aggregate streaming data
Edge Processors aggregate continuously streaming data by holding and aggregating the incoming events within a given state window.
Edge Processors work with incoming data that is continuously streaming through the applied pipelines. To aggregate this data, each Edge Processor instance holds the incoming events and aggregates the collected events until specific conditions are met, at which point the instance emits the result to the next processing action in the pipeline. This process is repeated as more events flow through the pipeline.
The interval between when the Edge Processor instance starts holding events and when it performs the aggregation is known as a "state window". The state window determines the scope of the data that an Edge Processor instance includes in each aggregation, as well as the rate at which the instance emits aggregated data.
The number of incoming events that each state window contains is determined by several factors. The following table summarizes these determining factors and the related settings that you can configure, if applicable:
| Determining factor | Configuration | For more information |
|---|---|---|
| The maximum length of time for which the Edge Processor instance is permitted to hold and aggregate events before emitting the result. | The @maxdelay setting in the stats command in the Edge Processor pipeline. |
See Create a pipeline to aggregate your data. |
| The maximum amount of disk space that the Edge Processor instance is permitted to use to hold and aggregate events before emitting the result. | The @maxdisk setting in the stats command in the Edge Processor pipeline. |
See Create a pipeline to aggregate your data. |
Aggregations reduce the number of events that your Edge Processors send to your data destinations. You can improve the efficiency of these aggregations and maximize event reduction by including more events in each aggregation. For example, you can increase the @maxdelay and @maxdisk settings in your Edge Processor pipeline in order to increase the size of the state window and include more events per aggregation. However, be aware that increasing those values can cause the Edge Processor to take longer to emit each aggregation.
Differences between ingest-time aggregations and search-time aggregations
Learn how ingest-time aggregations in Edge Processor pipelines differ from search-time aggregations in the Splunk platform, including their data scope and supported stats command options.
Ingest-time aggregations apply to a different scope of data compared to search-time aggregations. Additionally, the stats SPL2 command supports different configuration options depending on whether you are using it in a pipeline or a search.
Scope of the data being aggregated
Due to the continuous nature of streaming data, each aggregation that the Edge Processor performs is scoped to the events that are processed by a single instance within a specific state window. Because these aggregations are based on a subset of all the incoming events, they are partial aggregations. By contrast, the aggregations performed through searches in the Splunk platform are finalized aggregations that are based on a complete set of indexed events.
If you want to produce a finalized aggregation of the data that is handled by your Edge Processors, you need to aggregate your data again at the data destination. For example, you would start by configuring an Edge Processor pipeline to perform partial aggregations and send the results to the Splunk platform. Then, you would run a search using the stats command in the Splunk platform to return an aggregated result that is based on all the indexed events.
For more information about finalizing your aggregations through searches, see Processing aggregations at search-time.
Supported configuration options for the stats command
The stats command supports different configuration options depending on whether you are using it in an Edge Processor pipeline or in a Splunk platform search:
-
In pipelines, the stats command supports the
@maxdelayand@maxdiskannotations. You can configure these annotations to adjust the size of the state window, which determines the scope of the data included in each aggregation. -
Pipelines and searches support different subsets of the optional arguments for the stats command.
The following table summarizes which optional arguments are available for use in pipelines as opposed to searches:
| Optional argument | Can be used in an Edge Processor pipeline | Can be used in Splunk platform search |
|---|---|---|
| all-num | Yes | |
| by-clause | Yes | Yes |
| delim | Yes | |
| mode | Yes | |
| partitions | Yes | |
| prestats | Yes | |
| span | Yes | Yes |
Additionally, pipelines and searches support different statistical functions. For more information on the statistical functions that can be used with the stats command in a pipeline, see SPL2 statistical functions for Edge Processor Pipelines.
avg function is not supported in Edge Processor pipelines, you can still calculate an aggregation with an average of values by using the sum and count functions in your pipeline, and then dividing the sum by the count at search time. See Processing aggregations at search-time for a full pipeline and search statement example.
Create a pipeline to aggregate your data
Create a data pipeline to aggregate and process event data using the pipeline editor.
Managing aggregations
Learn more about common aggregation patterns and how to finalize the partial aggregations from Edge Processor pipelines.
Aggregation patterns
Use the stats command to aggregate data with either a summary pattern that reduces both events and fields or a passthrough pattern that reduces events while preserving the original schema.
- Summary pattern
-
Reduce both the number of events and the number of fields in the events. This aggregation pattern reduces overall data volume as well as the size of the resulting events, but also changes the event schema. When aggregating data using the summary pattern, make sure to assign a different
sourcetypevalue to the aggregated events so that events with different schemas are not categorized under the same source type. - Passthrough pattern
-
Reduce the number of events while retaining all the original event fields. This aggregation pattern reduces overall data volume without changing the event schema. When aggregating data using the passthrough pattern, there is no need to change the
sourcetypevalue of the aggregated events.
The following examples demonstrate how to configure an Edge Processor pipeline to aggregate data using the summary pattern or the passthrough pattern:
Both examples use this sample data as input:
| server_name | server_ip | file_requested | bytes_out | sourcetype |
|---|---|---|---|---|
| web-01 | 10.1.2.3 | /index.html | 500 | web |
| web-01 | 10.1.2.3 | /index.html | 500 | web |
| web-02 | 10.1.2.4 | /css/main.css | 1200 | web |
Example: Summary pattern
| stats mode=summary sum(bytes_out) AS bytes_out BY server_name
| server_name | bytes_out | _raw |
|---|---|---|
| web-01 | 1000 | {"server_name": "web-01", "bytes_out": 1000} |
| web-02 | 1200 | {"server_name", "web-02", "bytes_out": 1200} |
Example: Passthrough pattern
The following is the SPL2 syntax for a stats command in a pipeline that aggregates data using the passthrough pattern:
| stats mode=passthrough sum(bytes_out) AS bytes_out BY server_name, server_ip, file_requested, sourcetype
| _raw | server_name | server_ip | file_requested | bytes_out | sourcetype |
|---|---|---|---|---|---|
| {"server_name": "web-01", "server_ip": "10.1.2.3", "file_requested": "/index.html", "bytes_out": 1000, "sourcetype": "web"} | web-01 | 10.1.2.3 | /index.html | 1000 | web |
| {"server_name": "web-02", "server_ip": "10.1.2.4", "file_requested": "/css/main.css", "bytes_out": 1200, "sourcetype": "web"} | web-02 | 10.1.2.4 | /css/main.css | 1200 | web |
Processing aggregations at search-time
Use the Search & Reporting app to finalize partial aggregations from Edge Processor pipelines.
If you configure your pipeline to send the partially aggregated data to your Splunk platform deployment, you can finalize those aggregations by running a search in the Search & Reporting app using the stats command.
The following examples demonstrate how to write SPL searches that finalize aggregations from Edge Processor pipelines:
All examples use the following sample data as input:
| _time | server_name | server_ip | file_requested | bytes_out | sourcetype |
|---|---|---|---|---|---|
| 2025-01-01 12:00:05 | web-01 | 10.1.2.3 | /index.html | 500 | web |
| 2025-01-01 12:01:22 | web-01 | 10.1.2.3 | /images/logo.png | 8500 | web |
| 2025-01-01 12:00:28 | web-02 | 10.1.2.4 | /css/main.css | 1200 | web |
Example: Sum
$pipeline = | from $source
| stats mode=summary sum(bytes_out) as bytes_out BY server_name
| eval sourcetype="web:summary"
| server_name | bytes_out | _raw | sourcetype |
|---|---|---|---|
| web-01 | 9000 | {"server_name": "web_01", "bytes_out":9000} | web:summary |
| web-02 | 1200 | {"server_name": "web-02", "bytes_out":1200} | web:summary |
index=<my_index> sourcetype="web:summary" | stats sum(bytes_out) BY server_name
Example: Count
$pipeline = | from $source
| stats mode=summary count() as event_count BY server_name
| eval sourcetype="web:summary"
| server_name | event_count | _raw | sourcetype |
|---|---|---|---|
| web-01 | 2 | {"server_name": "web-01", "event_count": 2} | web:summary |
| web-02 | 1 | {"server_name": "web-02", "event_count":1} | web:summary |
index=<my_index> sourcetype="web:summary" | stats sum(event_count) BY server_name
Example: Min
$pipeline = | from $source
| stats mode=summary min(bytes_out) as min_bytes_out BY server_name
| eval sourcetype="web:summary"
| server_name | min_bytes_out | _raw | sourcetype |
|---|---|---|---|
| web-01 | 8500 | {"server_name": "web-01", "min_bytes_out":500} | web:summary |
| web-02 | 1200 | {"server_name": "web-02", "min_bytes_out":1200} | web:summary |
index=<my_index> sourcetype="web:summary" | stats min(min_bytes_out) BY server_name
Example: Max
$pipeline = | from $source
| stats mode=summary max(bytes_out) as max_bytes_out BY server_name
| eval sourcetype="web:summary"
| server_name | max_bytes_out | _raw | sourcetype |
|---|---|---|---|
| web-01 | 8500 | {"server_name": "web-01", "max_bytes_out":8500} | web:summary |
| web-02 | 1200 | {"server_name": "web-02", "max_bytes_out":1200} | web:summary |
index=<myindex> sourcetype="web:summary" | stats max(max_bytes_out) BY server_name
Example: Avg
The avg statistical function is not supported in Edge Processor pipelines. To calculate an aggregation with an average of values, start by using the sum and count functions in your pipeline. Then, use a search to divide the finalized sum by the finalized count.
$pipeline = | from $source
| stats mode=summary sum(bytes_out) as bytes_out, count(bytes_out) as event_count BY server_name
| eval sourcetype="web:summary"
| server_name | bytes_out | event_count | _raw | sourcetype |
|---|---|---|---|---|
| web-01 | 9000 | 2 | {"server_name": "web-01", "bytes_out":9000, "event_count": 2} | web:summary |
| web-02 | 1200 | 1 | {"server_name":"web-02", "bytes_out":1200, "event_count": 1} | web:summary |
index=<my_index> sourcetype="web:summary"
| stats sum(bytes_out) AS bytes_out, sum(event_count) AS event_count BY server_name
| eval bytes_avg=bytes_out/event_count
Example: Span
You can use the span statistical function to group aggregations by time span. For example, you can group aggregations into 1 minute time spans based on the event timestamps stored in the _time field.
The SPL2 statement of the Edge Processor pipeline is as follows:
$pipeline = | from $source
| stats mode=summary sum(bytes_out) as bytes_out BY span(_time, 1m), server_name
| eval sourcetype="web:summary"
The aggregated results look like this:
| _time | server_name | bytes_out | _raw | sourcetype |
|---|---|---|---|---|
| 2025-01-01 12:00:00 | web-01 | 500 | {"server_name": "web-01", "bytes_out": 500, "_time":1735732800} | web:summary |
| 2025-01-01 12:01:00 | web-01 | 8500 | {"server_name": "web-01", "bytes_out": 8500, "_time":1735732860} | web:summary |
| 2025-01-01 12:00:00 | web-02 | 1200 | {"server_name": "web-01", "bytes_out": 1200} | web:summary |
To finalize the aggregation, run the following SPL search, where my_index is the name of an index:
index=<my_index> sourcetype="web:summary" | stats sum(bytes_out) BY _time, server_name