Remove duplicate fields from pipelines
Remove duplicate fields from Edge Processor pipelines
Remove duplicate fields from pipelines using the dedup command.
The SPL2 dedup command removes events that contain an identical combination of values for the fields that you specify.
This lets you specify the number of duplicate events to keep for each value of a single field, or for each combination of values among several fields.
Overview
Removing duplicate fields from your pipeline requires the following tasks:
-
Identify your data source: Determine the pipeline input and the fields causing duplication.
-
Select deduplication strategy: Choose between a visual UI configuration or custom SPL2 code.
-
Define scope: Specify the fields for the dedup command.
-
Configure time constraints: Set the span and TTL to define how long the processor should "remember" an event.
-
Validate: Run the pipeline in "Preview" mode to verify the reduction in event volume.
How duplicates are identified
Duplicate events are identified by determining how far back in time, and across how many events can be remembered in order to identify a duplicate. Deduplication effectiveness depends on the runtime context (batch vs. instance vs. inter-batch) and the configuration of memory/TTL constraints.
Steps
Configure using the UI or with custom SPL2 code.
Configure pipeline to remove duplicate events using the Data Management UI
Complete the following steps to remove duplicate fields from your pipeline.
- On the Pipelines page of your Data Management instance, navigate to the pipeline that you want to deduplicate, and then select Edit.
- Select the plus icon (
) next to Actions.
- Select Remove duplicates for.
-
On the Remove duplicate field values page, set your desired deduplication parameters, and click Apply.
- Select Next to confirm the deduplicated data.
- Run a preview of your pipeline to verify your changes.
- Select Done to confirm changes.
Configure pipeline to remove duplicate events using custom SPL2 code
Complete the following steps to remove duplicate fields from your pipeline.
- On the Pipelines page of your Data Management instance, navigate to the pipeline that you want to deduplicate, and then select Edit.
- In the pipeline editor menu, navigate to the fields that you want to deduplicate.
- Create your desired SPL2 code.
-
Click the Preview button to review your changes.
-
Save your changes.
Examples of deduplication searches
The following are examples of SPL2 searches that utilize the dedup function.
Deduplicate by host within a batch.
from $source | dedup host, batch_id()
Time-Interval Deduplication: Using spans to manage high-volume event streams.
from $source | eval field_with_batch_id = batch_id() | dedup host, field_with_batch_id, span(_time, 5m)
Memory-Constrained Environments: Using @maxmem as a runtime hint.
from $source | @maxmem('1GB') dedup host, batch_id()
See also
See Also
For more information, see the following topics:
How the SPL2 dedup command works in the SPL2 manual.
dedup command overview in the SPL2 manual.
dedup command usage in the SPL2 manual.
dedup command examples in the SPL2 manual.