Remove duplicate fields from pipelines

Remove duplicate fields from Edge Processor pipelines

Remove duplicate fields from pipelines using the dedup command.

The SPL2 dedup command removes events that contain an identical combination of values for the fields that you specify.

This lets you specify the number of duplicate events to keep for each value of a single field, or for each combination of values among several fields.

Overview

Removing duplicate fields from your pipeline requires the following tasks:

  1. Identify your data source: Determine the pipeline input and the fields causing duplication.

  2. Select deduplication strategy: Choose between a visual UI configuration or custom SPL2 code.

  3. Define scope: Specify the fields for the dedup command.

  4. Configure time constraints: Set the span and TTL to define how long the processor should "remember" an event.

  5. Validate: Run the pipeline in "Preview" mode to verify the reduction in event volume.

How duplicates are identified

Duplicate events are identified by determining how far back in time, and across how many events can be remembered in order to identify a duplicate. Deduplication effectiveness depends on the runtime context (batch vs. instance vs. inter-batch) and the configuration of memory/TTL constraints.

Steps

Configure using the UI or with custom SPL2 code.

Configure pipeline to remove duplicate events using the Data Management UI

Complete the following steps to remove duplicate fields from your pipeline.

  1. On the Pipelines page of your Data Management instance, navigate to the pipeline that you want to deduplicate, and then select Edit.
  2. Select the plus icon (This image shows an icon of a plus sign.) next to Actions.
  3. Select Remove duplicates for.
  4. On the Remove duplicate field values page, set your desired deduplication parameters, and click Apply.

  5. Select Next to confirm the deduplicated data.
  6. Run a preview of your pipeline to verify your changes.
  7. Select Done to confirm changes.

Configure pipeline to remove duplicate events using custom SPL2 code

Complete the following steps to remove duplicate fields from your pipeline.

  1. On the Pipelines page of your Data Management instance, navigate to the pipeline that you want to deduplicate, and then select Edit.
  2. In the pipeline editor menu, navigate to the fields that you want to deduplicate.
  3. Create your desired SPL2 code.
  4. Click the Preview button to review your changes.

  5. Save your changes.

Examples of deduplication searches

The following are examples of SPL2 searches that utilize the dedup function.

Deduplicate by host within a batch.

PYTHON
from $source | dedup host, batch_id()

Time-Interval Deduplication: Using spans to manage high-volume event streams.

PYTHON
from $source | eval field_with_batch_id = batch_id() | dedup host, field_with_batch_id, span(_time, 5m)

Memory-Constrained Environments: Using @maxmem as a runtime hint.

PYTHON
from $source | @maxmem('1GB') dedup host, batch_id()

See also

dedup command

dedup command overview

dedup command examples in the Splunk SPL2 manual.