Aggregate event data using Edge Processor

Learn how to aggregate event data using Edge Processor to optimize data flow and reduce log volume by processing partial aggregations.

You can create a pipeline that aggregates your incoming event data to reduce the volume of raw logs being sent to your destination.

For example, assume that your Edge Processor receives 20 network flow logs that are emitted by 5 servers in your network, and that each log includes the number of bytes sent out from a given server. You can aggregate the data so that each log shows the sum of the bytes sent out by each server, potentially reducing the number of logs sent to your data destination from 20 to 5.

To aggregate data, you can use the stats SPL2 command in your pipeline. However, be aware that the stats command works differently in Edge Processor pipelines compared to when you use it in Splunk platform searches.

See the following sections on this page:

For comprehensive reference information about the stats command, see stats command overview in the SPL2 Search Reference.

How Edge Processors aggregate streaming data

Edge Processors aggregate continuously streaming data by holding and aggregating the incoming events within a given state window.

Edge Processors work with incoming data that is continuously streaming through the applied pipelines. To aggregate this data, each Edge Processor instance holds the incoming events and aggregates the collected events until specific conditions are met, at which point the instance emits the result to the next processing action in the pipeline. This process is repeated as more events flow through the pipeline.

The interval between when the Edge Processor instance starts holding events and when it performs the aggregation is known as a "state window". The state window determines the scope of the data that an Edge Processor instance includes in each aggregation, as well as the rate at which the instance emits aggregated data.

The number of incoming events that each state window contains is determined by several factors. The following table summarizes these determining factors and the related settings that you can configure, if applicable:

Determining factor Configuration For more information
The maximum length of time for which the Edge Processor instance is permitted to hold and aggregate events before emitting the result. The @maxdelay setting in the stats command in the Edge Processor pipeline. See Create a pipeline to aggregate your data.
The maximum amount of disk space that the Edge Processor instance is permitted to use to hold and aggregate events before emitting the result. The @maxdisk setting in the stats command in the Edge Processor pipeline. See Create a pipeline to aggregate your data.

Aggregations reduce the number of events that your Edge Processors send to your data destinations. You can improve the efficiency of these aggregations and maximize event reduction by including more events in each aggregation. For example, you can increase the @maxdelay and @maxdisk settings in your Edge Processor pipeline in order to increase the size of the state window and include more events per aggregation. However, be aware that increasing those values can cause the Edge Processor to take longer to emit each aggregation.

Differences between ingest-time aggregations and search-time aggregations

Learn how ingest-time aggregations in Edge Processor pipelines differ from search-time aggregations in the Splunk platform, including their data scope and supported stats command options.

Ingest-time aggregations apply to a different scope of data compared to search-time aggregations. Additionally, the stats SPL2 command supports different configuration options depending on whether you are using it in a pipeline or a search.

Scope of the data being aggregated

Due to the continuous nature of streaming data, each aggregation that the Edge Processor performs is scoped to the events that are processed by a single instance within a specific state window. Because these aggregations are based on a subset of all the incoming events, they are partial aggregations. By contrast, the aggregations performed through searches in the Splunk platform are finalized aggregations that are based on a complete set of indexed events.

If you want to produce a finalized aggregation of the data that is handled by your Edge Processors, you need to aggregate your data again at the data destination. For example, you would start by configuring an Edge Processor pipeline to perform partial aggregations and send the results to the Splunk platform. Then, you would run a search using the stats command in the Splunk platform to return an aggregated result that is based on all the indexed events.

For more information about finalizing your aggregations through searches, see Processing aggregations at search-time.

Supported configuration options for the stats command

The stats command supports different configuration options depending on whether you are using it in an Edge Processor pipeline or in a Splunk platform search:

  • In pipelines, the stats command supports the @maxdelay and @maxdisk annotations. You can configure these annotations to adjust the size of the state window, which determines the scope of the data included in each aggregation.

  • Pipelines and searches support different subsets of the optional arguments for the stats command.

The following table summarizes which optional arguments are available for use in pipelines as opposed to searches:

Optional argument Can be used in an Edge Processor pipeline Can be used in Splunk platform search
all-num   Yes
by-clause Yes Yes
delim   Yes
mode Yes  
partitions   Yes
prestats Yes  
span Yes Yes

Additionally, pipelines and searches support different statistical functions. For more information on the statistical functions that can be used with the stats command in a pipeline, see SPL2 statistical functions for Edge Processor Pipelines.

Note: While the avg function is not supported in Edge Processor pipelines, you can still calculate an aggregation with an average of values by using the sum and count functions in your pipeline, and then dividing the sum by the count at search time. See Processing aggregations at search-time for a full pipeline and search statement example.

Create a pipeline to aggregate your data

Create a data pipeline to aggregate and process event data using the pipeline editor.

To create a pipeline that aggregates the data being processed by each Edge Processor instance, use the Summarize data action in the pipeline editor to specify the fields you want to summarize.
  1. Navigate to the Pipelines page and then select New pipeline, then Edge Processor pipeline.
  2. Follow the on-screen instructions to define a partition, optionally enter sample data, and select a data destination. For detailed instructions, see Create pipelines for Edge Processors.

    Be aware of the following when creating a pipeline for aggregations:

    • Your sample data must contain accurate examples of the values that you want to summarize into aggregations. For example, the following sample events represent requested files from a server and contains aggregable data:

      server_name server_ip file_requested bytes_out sourcetype
      web-01 10.1.2.3 /index.html 500 web
      web-01 10.1.2.3 /images/logo.png 8500 web
      web-02 10.1.2.4 /css/main.css 1200 web
    • On the Select a destination page, if you select a Splunk platform destination, you can configure index routing. As a best practice, avoid sending both non-aggregated and aggregated data to the same index.

    After you complete the on-screen instructions, the pipeline editor displays the SPL2 statement for your pipeline.
  3. Select the plus icon (This image shows an icon of a plus sign.) in the Actions section and then select Summarize data.
    The Summarize data dialog box opens. By default, the Aggregations section shows an unconfigured aggregation.
  4. (Optional) Set the Maximum delay option to the maximum length of time for which the Edge Processor can hold and aggregate incoming events before it emits the result.
  5. (Optional) Set the Maximum disk usage option to the maximum amount of disk space that the Edge Processor can use to hold and aggregate incoming events before it emits the result.
  6. (Optional) If you plan to group your aggregations by the host, source, sourcetype, or index metadata fields, then set Output mode to one of the following:
    Output mode name Description
    Summary mode The Edge Processor adds the prefix orig_ to the names of those metadata fields, so that you can use them to differentiate aggregated data from non-aggregated data.
    Passthrough mode The Edge Processor does not change the names of those metadata fields.
  7. In the Aggregations section, do the following:
    1. Set the Field option to the name of the event field that you want to aggregate.
    2. Set the Calculation option to the type of calculation that you want to perform on the specified field.
    3. (Optional) To specify the name of the event field that stores the aggregated data, select the Edit alias (This image shows a pencil icon.) icon and then enter the desired field name.
    4. (Optional) You can configure multiple aggregations by selecting the Add aggregations (This image shows an icon of a plus sign.) icon and then repeating steps 7a-c.
    For example, using the sample data shown in step 2, you can calculate the sum of the values in the bytes_out field by setting the Field option to bytes_out and the Calculation option to Sum. To keep the name of the aggregated field as bytes_out instead of changing it to sum(bytes_out), set the alias of the aggregation to bytes_out.
  8. (Optional) In the Group by section, you can specify one or more fields to group the aggregations by. Do the following:
    1. Select the Add group by (This image shows an icon of a plus sign.) icon.
    2. In the new entry that appears in the Group by section, set the Field option to the name of the event field that you want to group the aggregations by.
    3. (Optional) Repeat steps 8a-b as needed to define more groupings for the aggregations.
    Continuing the example from step 7, you can group the sum of bytes_out by the server names given in the server_name field. To do this, set the Field option to server_name.
  9. Select Apply.

    The pipeline editor adds a stats command to your pipeline. If you added sample data to your pipeline, the preview results panel shows the results of the aggregation.

    For example, if you used the sample data and example configurations described in the preceding steps, the pipeline editor shows the following SPL2 statement:
    PYTHON
    $pipeline = | from $source
    | @maxdisk("1MB") @maxdelay("30seconds") stats mode=summary sum(bytes_out) as bytes_out by server_name
    | into $destination;
    The aggregated results look like this:
    server_name bytes_out _raw
    web-01 9000 {"server_name": "web-01", "bytes_out": 9000}
    web-02 1200 {"server_name", "web-02", "bytes_out": 1200}
    Note: Your pipeline keeps the event fields specified in the Aggregations and Group by sections and drops the unspecified fields. In the example above, the pipeline drops the server_ip and file_requested fields because they are not included in the aggregation.
  10. (Optional) If you plan to work with the aggregated data in the Splunk platform using operations that require the data to be in prestats format, then add the prestats argument to the stats command. In the SPL2 editor, do one of the following:
    • To store prestats data values in the _raw field, type prestats=raw inside the stats command, in the space before the aggregation expression. For example:
      PYTHON
      $pipeline = | from $source
      | stats prestats=raw mode=summary sum(bytes_out) as bytes_out by server_name 
      | into $destination;
    • To store prestats data values in top-level event fields, type prestats=fields inside the stats command, in the space before the aggregation expression. For example:
      PYTHON
      $pipeline = | from $source
      | stats prestats=fields mode=summary sum(bytes_out) as bytes_out by server_name 
      | into $destination;
  11. Save your pipeline, and then apply it to your Edge Processors as needed. For more information, see Apply pipelines to Edge Processors.
You now have a pipeline that calculates aggregations from your event data.

Managing aggregations

Learn more about common aggregation patterns and how to finalize the partial aggregations from Edge Processor pipelines.

To learn more about common aggregation patterns and how to finalize the partial aggregations from Edge Processor pipelines, see the following:

Aggregation patterns

Use the stats command to aggregate data with either a summary pattern that reduces both events and fields or a passthrough pattern that reduces events while preserving the original schema.

When configuring an aggregation using the stats command, you specify which fields from the original events to retain in the aggregated results. The following are two common aggregation patterns that are based on the number of fields retained:
Summary pattern
Reduce both the number of events and the number of fields in the events. This aggregation pattern reduces overall data volume as well as the size of the resulting events, but also changes the event schema. When aggregating data using the summary pattern, make sure to assign a different sourcetype value to the aggregated events so that events with different schemas are not categorized under the same source type.
Passthrough pattern
Reduce the number of events while retaining all the original event fields. This aggregation pattern reduces overall data volume without changing the event schema. When aggregating data using the passthrough pattern, there is no need to change the sourcetype value of the aggregated events.

The following examples demonstrate how to configure an Edge Processor pipeline to aggregate data using the summary pattern or the passthrough pattern:

Both examples use this sample data as input:

server_name server_ip file_requested bytes_out sourcetype
web-01 10.1.2.3 /index.html 500 web
web-01 10.1.2.3 /index.html 500 web
web-02 10.1.2.4 /css/main.css 1200 web

Example: Summary pattern

The following is the SPL2 syntax for a stats command in a pipeline that aggregates data using the summary pattern:
CODE
| stats mode=summary sum(bytes_out) AS bytes_out BY server_name
Compared to the original input data, the aggregated results contain fewer events and fields.
server_name bytes_out _raw
web-01 1000 {"server_name": "web-01", "bytes_out": 1000}
web-02 1200 {"server_name", "web-02", "bytes_out": 1200}

Example: Passthrough pattern

The following is the SPL2 syntax for a stats command in a pipeline that aggregates data using the passthrough pattern:

CODE
| stats mode=passthrough sum(bytes_out) AS bytes_out BY server_name, server_ip, file_requested, sourcetype
Compared to the original input data, the aggregated results contain fewer events but the schema of the data remains unchanged.
_raw server_name server_ip file_requested bytes_out sourcetype
{"server_name": "web-01", "server_ip": "10.1.2.3", "file_requested": "/index.html", "bytes_out": 1000, "sourcetype": "web"} web-01 10.1.2.3 /index.html 1000 web
{"server_name": "web-02", "server_ip": "10.1.2.4", "file_requested": "/css/main.css", "bytes_out": 1200, "sourcetype": "web"} web-02 10.1.2.4 /css/main.css 1200 web

Processing aggregations at search-time

Use the Search & Reporting app to finalize partial aggregations from Edge Processor pipelines.

If you configure your pipeline to send the partially aggregated data to your Splunk platform deployment, you can finalize those aggregations by running a search in the Search & Reporting app using the stats command.

The following examples demonstrate how to write SPL searches that finalize aggregations from Edge Processor pipelines:

All examples use the following sample data as input:

_time server_name server_ip file_requested bytes_out sourcetype
2025-01-01 12:00:05 web-01 10.1.2.3 /index.html 500 web
2025-01-01 12:01:22 web-01 10.1.2.3 /images/logo.png 8500 web
2025-01-01 12:00:28 web-02 10.1.2.4 /css/main.css 1200 web

Example: Sum

The SPL2 statement of the Edge Processor pipeline is as follows:
PYTHON
$pipeline = | from $source 
| stats mode=summary sum(bytes_out) as bytes_out BY server_name
| eval sourcetype="web:summary"
The aggregated results look like this:
server_name bytes_out _raw sourcetype
web-01 9000 {"server_name": "web_01", "bytes_out":9000} web:summary
web-02 1200 {"server_name": "web-02", "bytes_out":1200} web:summary
To finalize the aggregation, run the following SPL search, where my_index is the name of an index:
CODE

Example: Count

The SPL2 statement of the Edge Processor pipeline is as follows:
PYTHON
$pipeline = | from $source 
| stats mode=summary count() as event_count BY server_name
| eval sourcetype="web:summary"
The aggregated results look like this:
server_name event_count _raw sourcetype
web-01 2 {"server_name": "web-01", "event_count": 2} web:summary
web-02 1 {"server_name": "web-02", "event_count":1} web:summary
To finalize the aggregation, run the following SPL search, where my_index is the name of an index:
CODE

Example: Min

The SPL2 statement of the Edge Processor pipeline is as follows:
PYTHON
$pipeline = | from $source 
| stats mode=summary min(bytes_out) as min_bytes_out BY server_name
| eval sourcetype="web:summary"
The aggregated results look like this:
server_name min_bytes_out _raw sourcetype
web-01 8500 {"server_name": "web-01", "min_bytes_out":500} web:summary
web-02 1200 {"server_name": "web-02", "min_bytes_out":1200} web:summary
To finalize the aggregation, run the following SPL search, where my_index is the name of an index:
CODE

Example: Max

The SPL2 statement of the Edge Processor pipeline is as follows:
PYTHON
$pipeline = | from $source 
| stats mode=summary max(bytes_out) as max_bytes_out BY server_name
| eval sourcetype="web:summary"
The aggregated results look like this:
server_name max_bytes_out _raw sourcetype
web-01 8500 {"server_name": "web-01", "max_bytes_out":8500} web:summary
web-02 1200 {"server_name": "web-02", "max_bytes_out":1200} web:summary
To finalize the aggregation, run the following SPL search, where my_index is the name of an index:
CODE

Example: Avg

The avg statistical function is not supported in Edge Processor pipelines. To calculate an aggregation with an average of values, start by using the sum and count functions in your pipeline. Then, use a search to divide the finalized sum by the finalized count.

The SPL2 statement of the Edge Processor pipeline is as follows:
PYTHON
$pipeline = | from $source 
| stats mode=summary sum(bytes_out) as bytes_out, count(bytes_out) as event_count BY server_name
| eval sourcetype="web:summary"
The aggregated results look like this:
server_name bytes_out event_count _raw sourcetype
web-01 9000 2 {"server_name": "web-01", "bytes_out":9000, "event_count": 2} web:summary
web-02 1200 1 {"server_name":"web-02", "bytes_out":1200, "event_count": 1} web:summary
To finalize the aggregation, run the following SPL search, where my_index is the name of an index:
CODE

Example: Span

You can use the span statistical function to group aggregations by time span. For example, you can group aggregations into 1 minute time spans based on the event timestamps stored in the _time field.

The SPL2 statement of the Edge Processor pipeline is as follows:

PYTHON
$pipeline = | from $source 
| stats mode=summary sum(bytes_out) as bytes_out BY span(_time, 1m), server_name 
| eval sourcetype="web:summary"

The aggregated results look like this:

_time server_name bytes_out _raw sourcetype
2025-01-01 12:00:00 web-01 500 {"server_name": "web-01", "bytes_out": 500, "_time":1735732800} web:summary
2025-01-01 12:01:00 web-01 8500 {"server_name": "web-01", "bytes_out": 8500, "_time":1735732860} web:summary
2025-01-01 12:00:00 web-02 1200 {"server_name": "web-01", "bytes_out": 1200} web:summary

To finalize the aggregation, run the following SPL search, where my_index is the name of an index:

CODE