Edge Processor Validated Architecture

Overall benefits

The Edge Processor solution:

Provides users the capability to process and/or route data at arbitrary boundaries.
Enables users to apply real-time processing pipelines to data en-route.
Uses SPL2 to author processing pipelines.
Can scale both vertically and horizontally to meet processing demand.
Can be deployed in a highly available and load-balanced configuration as part of your data availability strategy.
Enables flexibility for scaling data and infrastructure at scale.
Use cases and enablers:
- Organizational requirement for centralized data egress
- Routing, forking, or cloning events to multiple Splunk deployments and supported data lakes or object stores
- Filtering, masking, transforming, enriching, or otherwise modifying data in motion
- Reducing certificate or security complexity
- Remove or reduce ingestion load on the indexing tier due to offloading initial data processing, such as line breaking.
- Relocate data processing away from the indexer to reduce the need for rolling restarts and other availability disruptions.

Architecture and topology

Refer to the Splunk docs for the latest system architecture for a high-level view of the components of a deployment of the Edge Processor solution. However, there are several key points related to the default topology, which are critical to understand for a successful Edge Processor solution deployment:

The term Edge Processor is a logical grouping of instances that share the same configuration.
All pipelines deployed to an Edge Processor will run concurrently on all instances in that Edge Processor.
Edge Processor instances are not aware of one another. There is no intra-instance communication, synchronization, balancing, or other behavior.
All Edge Processor instances are managed by a Splunk Data Management control plane. That control plane can be either customer hosted, or Splunk hosted.
There are two planes or domains of data on an Edge Processor instance: control plane and data plane:
- The control plane is used to collect and transmit instance configuration and instance telemetry. The data on this plane is independent of the customer data being routed through pipelines on the data plane. The destination of data on this plane is only to the Edge Processor management infrastructure.
- The data plane is where the pipelines run, and is where all of the routing, masking, filtering, and transformation of customer data occurs. The destination of data on this plane is determined by the pipeline configuration.
- The default destination configuration is shared for both telemetry destination and default pipeline behavior.

Edges and data domains

An edge in the context of Edge Processor and Splunk data routing refers to any step between the source and the destination where event control is required. Some common edges are data centers, clouds and cloud availability zones, network segments, VLANs, and infrastructure that manages regulated or compliance-related data (PII, HIPAA, PCI, GDPR, and so on). Data domains represent the relationships between the data source and the edge. When data traverses an edge, the data has left the originating data domain.

In Splunk terms, edges are generally correlated with other "intermediate forwarding" concepts and you may often see edges referred to with Edge Processor, heavy and universal forwarders, and OTEL which can all serve as intermediate forwarders and affect change on events.

Intermediate routing close to source

Edge Processors can be deployed close to the source, logically within the same data domain as the data source.

Diagram of intermediate routing close to the source


Benefits	Limitations
Edge Processors are immediate "neighbors" of data, with generally fewer firewall or network configuration considerations. Easier to apply data domain specific enrichment. Less susceptible to data loss due to networking interruption. Modify events prior to egress from a data domain. May reduce network and firewall complexity by acting as a data funnel.	Provisioning hardware for Edge Processor instances may be challenging in very small or very large distributed environments. Potentially more infrastructure to manage. Can result in larger data payloads, which may be undesirable if next hop networks are expensive or otherwise constrained.

Intermediate routing close to destination

Edge Processors can be deployed close to the destination, logically in a different data domain than the data source, often within the same data domain as the destination.

Diagram showing intermediate routing close to destination.


Benefits	Limitations
Reduce distributed hardware sprawl via centralized scale up and out Can act as a catch-all or final point of contact for data administrators Centralized routing and destination management	Higher risk of network disruption during transit Potentially larger number of pipelines or pipeline complexity to account for all streams from all data domains Filtering or enriching based on origin domain may not be possible

Multi-hop

There is no restriction on the number of Edge Processors events can travel through. In situations where it is desirable to have event processing near the data as well as near the destination Edge Processors can send and receive from one another.

Diagram showing multi-hop architecture


Benefits	Limitations
Benefits from both styles of control can be utilized. Fine-grained event control end to end. Event processing and routing can be optimized for complex infrastructures.	Additional hardware required, potentially 2-3x. Will result in pipeline sprawl requiring more administration and naming complexity. More hops can complicate debugging.

Note: When data spans multiple hops as shown, it can be helpful to include markers in the data to indicate the systems that have been traversed. For example, adding the following SPL2 to your pipeline will build a list of traversed Edge Processors into a field called "hops".

CODE

| eval hops=hops+"[your own marker here]"

| eval hops=hops+"[your own marker here]"

Management Strategy

Once you have decided on where Edge Processor instances will be provisioned you have to decide how you want to manage the deployment of pipelines to those instances.

Control plane

Splunk Cloud managed control plane

Diagram showing Splunk Cloud managed control plane


Benefits	Limitations
Splunk manages the control plane infrastructure, scale, resilience, and other operational tasks. Customer infrastructure requirements limited to Edge Processor instances. Simple, centralized metrics and reporting. Cloud-based user interface. Faster software version release cycles vs. customer-managed control plane.	Edge processor instances must have internet access. Private link does not apply to control plane traffic. Not available in all regions and compliance environments. Shared settings may limit deployment options. Limited API and automation support available.

Customer managed control plane

Diagram showing customer managed control plane


Benefits	Limitations
No internet connectivity required. More flexible deployment architecture (supporting multiple regions, AZs, etc.) Multiple control planes can be used to segment shared settings. Better API support.	Slower software version release cycles. Single instance control plane with specialized backup requirements. Search peering or federation will be required to integrate the control plane telemetry with most Splunk deployments. Distinct user interface (not hosted on SHC).

Per-domain configuration

In most cases, each data domain that requires event processing will have an Edge Processor and the instances belonging to that Edge Processor will be specific to that data domain. Organizing in this manner generally results in easy to understand naming and data flow, and offers the best reporting fidelity. When organized this way, all instances in each Edge Processor belong to the same Splunk output group for purposes of load balancing.

Diagram describing per domain configuration


Benefits	Limitations
Data domain and Edge Processor names are easily correlated. Pipeline and Edge Processor metrics are aligned with hardware deployment. Pipelines can be deployed on any number of Edge Processors.	Can result in a large number of Edge Processors in large distributed environments when repetitive or duplicate data domains are present.

Stretched configuration

When event processing data source types, ports, requirements, and destinations are the same across more than one data domain, all of the instances can belong to the same Edge Processor. There is no requirement that each data domain must have its own Edge Processor.

Diagram showing stretched configuration


Benefits	Limitations
Reduce Edge Processor sprawl in large distributed but identical infrastructures, such as in retail, manufacturing, or banking.	Reduced reporting fidelity. All Edge Processor settings must be valid for all domains, specifically network ports and certificates.

Input Load Balancing

Edge Processor supports three input types: Splunk (s2s), HTTP Event Collector, and syslog via UDP and TCP. Edge Processor has no specific configuration or support for load balancing these protocols, and best practices for each protocol should be followed using the Edge Processor instances as targets. High level strategies and architectures for each input type will be covered, but specific protocol optimization and tuning is out of scope of this validated architecture.

Splunk

Splunk universal and heavy forwarders should continue to use outputs.conf with the list of Edge Processor instances as a server group to enable output load balancing. The behavior of the forwarders that is expected when using indexers or other forwarders as output targets can be expected when Edge Processors are output targets.

Diagram showing Tcpout to Edge Processor comm


Benefits	Limitations
No significant changes to normal forwarder behavior. Availability and queueing managed by the forwarder. Usually no certificate requirements for the forwarder.	S2S Ack is not supported. There is no EP instance auto discovery or auto configuration by the forwarder. Edge Processor does not support httpout from Universal Forwarders, only S2S.

HTTP Event Collector (HEC)

Where more than one Edge Processor instance is configured to receive HEC events, some sort of independent load balancing must be used. There is no mechanism in Edge Processor that load balances HEC traffic.

The following sections provide examples of load balancing HEC traffic to Edge Processor:

Network or application load balancers

The most common approach to load balancing HEC is the same technology used to balance most other HTTP traffic - Classic network load balancers or application load balancers. Purpose-built load balancers offer a managed solution for providing a single endpoint that can be used to intelligently distribute HEC traffic to one or more Edge Processors. You can see a simple example here.


Benefits	Limitations
Most complete and reliable solution. Most load balancers include health checks to prevent connections to unhealthy servers. Can offload TLS workload. Many different load balancing strategies.	Requires load balancer infrastructure.

Diagram showing HEC sources feeding a load balancer

DNS round robin

DNS round robin is a simple load balancing method used to distribute network traffic across multiple servers. In this approach, DNS is configured to rotate through a list of Edge Processor IP addresses associated with a single domain name. When a HEC source makes a request to the domain name, the DNS server responds with one of the IP addresses from the list.


Benefits	Limitations
Easy to implement. Does not require extra infrastructure or complex configuration. Simple solution when source can only reference a single URL for sending events.	Sources will often cache DNS results resulting in pinned connections. DNS is unaware of Edge Processor instance availability. Clients may be directed to offline servers and must be tolerant of connection failures.

Diagram illustrating DNS round robin

Client-side load balancing

In scenarios where HEC integration is scripted or access to the code is otherwise available, the responsibility for load balancing, queuing, retrying, and monitoring availability can be managed by the sending client or mechanism. This design decentralizes the complexity from the central network infrastructure to the individual sending entities. Each sender can dynamically adjust to downstream Edge Processor instance availability by implementing retry mechanisms that respond to errors and timeouts. The client implementation can be made as reliable as required.


Benefits	Limitations
Data handling functionality is customizable. Scale and reliability of integration can be built to fit data requirements. Does not require extra routing infrastructure. Edge Processor instances support health endpoint.	Managing retry and ack queues can be complex and challenging to scale. Increased technical debt.

Diagram showing client-side load balancing

Syslog

In many cases, systems that use syslog can only specify a single server name or IP in their syslog configuration. There are several approaches to distributing or load balancing the syslog traffic among the Edge Processors to avoid a single point of failure. The guidance provided in this document as it relates to load balancing the syslog protocol is not specific to Edge Processor and is intentionally brief. Refer to About Splunk Validated Architectures or Splunk Connect for Syslog for more information. See https://datatracker.ietf.org/doc/html/rfc5424 for a relevant syslog request for comments (RFC).

DNS round robin

Using a single friendly name to represent a list of possible listeners. The same considerations as with DNS round robin for HEC apply to syslog. In particular, caching of DNS records can result in stalled data even when healthy instances are available.

Network load balancer

The same considerations as with Network load balancers for HEC apply to syslog with some notable differences:

Syslog often relies on the sender's IP address, which is typically lost with a load balancer. If the sending IP address is needed, special configuration on the load balancer must be made.
There is no specific syslog health check that can be used by a load balancer.
BGP or other layer 3 load balancing strategies may be considered for very large or sensitive syslog environments. Consult with a Splunk Architect or Professional Services in this case.

For further guidance and best practices for load balancing syslog traffic, refer to the Syslog Validated Architecture.

Port mapping

Similar to Splunk Heavy Forwarders using TCP or UDP listeners to receive syslog data, Edge Processor can be configured to open one or more arbitrary ports on which syslog data is received. RFC compliance, source, and sourcetype are assigned to each port and to each event arriving on that port. The data sources, use cases, pipeline structure, and destinations all need to be considered when choosing a syslog port assignment strategy as port assignment can directly affect pipeline structure and performance.

Considerations

There are several configuration patterns to consider when building your Edge Processor syslog topology.


RFC Assignment	RFC formatting requirements are strictly enforced. You must ensure that the RFC represented in the data matches the RFC configured in the Edge Processor. When RFC does not match, the produced field data will not be reliable.
Timestamping	When syslog data is sent from Edge Processor via Splunk protocol, the next Splunkd instance (Heavy Forwarder or Indexer) that receives the events will perform timestamping via props and transforms. It's critical that host or sourcetype timezone assignment is accurate. When syslog data is sent from Edge Processor via the HEC protocol, the pipeline that processes the syslog events must set the _time. If _time is not set, Splunk will use "current time"
Port Contention and Pipeline Complexity	In environments with many syslog data sources using the same Edge Processor port, pipeline contention and complexity can limit the effectiveness of a single pipeline dedicated to syslog processing. When many syslog sources use the same port, having multiple pipelines will reduce pipeline complexity, improve reporting, and result in more consistent, scalable data flows.

Implementation patterns

There are several configuration patterns to consider when building your Edge Processor syslog topology. While the following examples are the most common implementations, it's important to keep in mind that these port and pipeline configurations are not mutually exclusive and in practice the actual topology tends to evolve over time.

Diagram showing Cisco inputs to Edge Processor

In this configuration, each unique device and sourcetype is assigned a specific port and each sourcetype is processed by a unique pipeline. This results in one pipeline per sourcetype, and multiple ports may supply data to the same pipeline when assigning the same sourcetype.


Benefits	Limitations
1:1 mapping allows for more granular control and management of data. Can be easier to manage time zones and time zone inconsistencies. Often easier to adapt to new data sources. Pipeline complexity is reduced.	Some syslog sources force specific ports. Can lead to port and pipeline sprawl at large scale with many device types. Potentially more complex syslog load balancing implementation.

Diagram showing Linux and firewall inputs to an Edge Processor

In this configuration, one or more syslog sources share a single sourcetype and a single pipeline is responsible for processing that single sourcetype into distinct sourcetype events.


Benefits	Limitations
May help when syslog port availability is restricted. Reduced network port and load balancing complexity. Each syslog event is processed only once.	Increased pipeline complexity. All syslog flows are dependent on operation of a single pipeline (single point of failure). Event processing debugging can be complex. Any data not captured by a `where` will be dropped.

Multiple sourcetypes per port, cloned streams Diagram showing multiple sourcetypes per port

As with the prior configuration, one or more syslog sources share a single sourcetype and the pipeline(s) are responsible for detecting and processing distinct sourcetype events. However in this configuration a distinct pipeline is used for each unique sourcetype, where the initial partition is the generic sourcetype and some initial filter is used to select specific sourcetype events.


Benefits	Limitations
May help when syslog port availability is restricted. Reduced network port and load balancing complexity. Per-sourcetype logic is constrained to a single pipeline. Less complex pipelines and event debugging,	Increased system resource consumption. All data that shares the same sourcetype is processed by all pipelines that use that sourcetype. Care must be taken to prevent data duplication. Event filters must be well tested. Any data not captured by a `where` will be dropped.

Splunk Destination

When sending events to Splunk, both S2S and HEC are validated and supported protocols. The S2S protocol is default and is configured as part of the first time Edge Processor setup.

Note: Refer to the

splunk docs

for the implications of parsedness when sending data to Splunk from Edge Processor.

S2S - Splunk to Splunk


Benefits	Limitations
Battle-tested protocol. Events will retain the parsed state of any original event. Events sent to Edge Processor by Universal Forwarder using default settings will be unparsed. Events sent to Edge Processor by Heavy Forwarders using default settings will be parsed. Existing downstream props and transforms behave as expected.	Non-standard firewall port requirements, rather than HTTP S2S ack is not supported

HEC - HTTP Event Collector


Benefits	Limitations
Standard, well-known port requirement (HTTP). In-place support for existing HEC tokens and HEC data flows. Tokens can be passed through.	Edge Processor only supports the HEC Event endpoint. It does not support raw or httpout. Outbound HEC ack is not supported. HEC tokens are global, not per-Edge Processor. Timestamping must be performed in the pipeline.

Asynchronous load balancing from Splunk agents

To learn more about Splunk asynchronous load balancing see the Splunkd intermediate forwarding validated architecture.

Asynchronous load balancing is utilized to spread events more evenly across all available servers in the tcp output group of a forwarder. Traditionally the list of output servers are indexers, but the configuration is also valid when the output servers are Edge Processors. When configuring outputs from high volume forwarders to Edge Processors, configuring asynchronous load balancing can improve throughput.

Note: There are no specific asynchronous load balancing settings related to the output of Edge Processors being sent to Splunk.

Funnel with caution - indexer starvation

To learn more about indexer starvation see the intermediate forwarding validated architecture.

The use of Edge Processors introduces the same funneling concerns as when heavy forwarders are used as intermediate forwarders. Consolidation of forwarder connections and events into a small intermediate tier can introduce bottlenecks and performance degradation when compared to environments where intermediate forwarding is not used.

Indexer distribution

It is critical that all indexers are configured individually for on-premises and Splunk Cloud Platform classic stacks to ensure proper distribution of events and avoid clumping of data due to pinned connections.

Resiliency and queueing

Pipelines, queuing, and data loss

Data received by Edge Processors are stored in memory while it passes through the processing pipeline. Once an event has left the processor and is delivered to the exporter, it is queued on disk. The event will remain in the disk-backed queue until the exporter successfully sends it to the destination.

Provisioning Edge Processor instances with disk storage capable of supporting pipeline throughput requirements is essential. Every even passing through a pipeline is written to disk. Align expected EPS and event sizes per destination with disk assigned to destination queue to ensure sufficient IOPS and available storage.

Data from a source is only ever processed by a single Edge Processor instance. Even in scenarios where events are eligible for multiple pipelines, any given event is only processed by a single Edge Processor instance. Edge Processor instances are unaware of other Edge Processor instances and data is never synchronized or otherwise reconciled between instances. Because of this, the domain for data loss is the amount of unprocessed data in memory on any given instance.

Diagram showing default behavior for Splunk, HEC, and Syslog data

Full queues and backpressure

A persistent queue is assigned for each unique destination (exporter) and is shared among all pipelines that use those destinations.

When a destination becomes unavailable, its queue will begin to fill with events from all pipelines that use that destination.
Once a queue is full, all pipelines using that destination will begin applying backpressure to clients.

Backpressure in pipelines with multiple destinations

Diagram showing single-pipeline, multi-destination backpressure

Referring to the diagram above, in a situation where the Object Store destination is unavailable for an extended period of time and the queue fills, all cisco:asa data would be stopped until the full queue empties.

Backpressure in pipelines with single destinations

Diagram showing multi-pipeline, single-destination backpressure

Referring to the diagram above, in a situation where the Object Store destination is unavailable for an extended period of time and the queue fills, all cisco:asa data destined for Splunk would continue to be delivered while data destined for Object Store would stop. In this scenario, there would be two pipelines for cisco:asa that each share the same partition.

Backpressure behavior

Depending on the transport type and the architecture of the pipelines and destinations, client-side backpressure can vary. In all cases, upstream clients should expect and handle rejected payloads.

When the target destination resumes normal operation and the queue has the ability to drain, expect significant load as the queues will empty as fast as the destination system is able to accept the data.

Duplicates after full queues

When a queue fills, and particularly when a pipeline has both a destination with a full queue and another destination that is not blocked, duplicate events can be generated as a result of the random nature of event distribution.

HTTP Event Collector acknowledgment

The HEC data input for Edge Processor supports acknowledgment of events. From the client perspective, this feature operates the same as HEC Acknowledgment on Splunk. However, whereas the Splunk implementation of HEC Ack can monitor the true indexing status, Edge Processor will consider the event acknowledged successfully once the event has been received by the instance's exporter queue. It may be some time between the delivery of the event to the queue and the receipt of the event by the destination index, so sending agents may register the event as delivered before the data is indexed or searchable.


Benefits	Limitations
Enable compatibility with several add-ons, in particular the add-on for AWS. Can address data resiliency and reporting requirements.	Acks are local to each instance. Requests from client must be sticky from end to end to retrieve ack, such as client to LB, and LB to EP. The instances must maintain the ack queue and will consume more system resources. Ack only represents delivery to the output queue and does not guarantee delivery or indexing. Acks are stored in memory. If an Edge Processor instance is restarted or crashes ack state for those events will be lost and may result in some duplicate events.

Size and scale

Monitoring

There are many dimensions available for monitoring an Edge Processor and its pipelines. You can review all of the various metrics available using mcatalog. The list of metrics will grow over time so it's best to review all available metrics and dimensions in your environment:

CODE

| mcatalog values(metric_name) WHERE index=_metrics AND sourcetype="edge-metrics"

| mcatalog values(metric_name) WHERE index=_metrics AND sourcetype="edge-metrics"

There's no one metric that can tell you it's time to scale up or down, instead monitor key metrics across your Edge Processors in order to establish baseline, expected usage metrics. In particular:

Throughput in and out
Event counts in and out
Exporter queue size
CPU & Memory consumption
Unique sourcetypes and agents sending data

Additionally, consider measuring event lag by comparing index time vs. event time as a general practice for GDI health, irrespective of the Edge Processor.

Scale up or down

As data processing requirements change you'll have to decide whether to scale by altering the available resources for your Edge Processor instances, or by altering the number of instances doing the processing. The following are some common scenarios and the most common scale result:

As is the case with most technology, Edge Processor instances scale both vertically and horizontally depending on the circumstances, and scaling one way vs. the other can lead to different outcomes.


Scenario Examples	Scale Example
Scale out Need to improve indexer event distribution and avoid funneling Spread out persistent queues, require less disk space per instance Improve resiliency, reduce impact of instance failures	to
Scale up Data pipeline complexity increases such as: More complex regular expressions Multi-value evals and mvexpand Branched pipelines More destinations Significant event size or event volume increases Long persistent queue requirements	to

For most purposes consider any substantial change to any of the following as cause to evaluate scale:

Event volume, both the number and size of events.
Number of forwarders or data sources.
Number and complexity of pipelines.
Change in target destinations.
Risk tolerance.

Any change to these factors will play a role in the overall resource consumption and processing speed of Edge Processor instances.

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Overall benefits

Architecture and topology

Edges and data domains

Intermediate routing close to source

Intermediate routing close to destination

Multi-hop

Management Strategy

Control plane

Splunk Cloud managed control plane

Customer managed control plane

Per-domain configuration

Stretched configuration

Input Load Balancing

Splunk

HTTP Event Collector (HEC)

Network or application load balancers

DNS round robin

Client-side load balancing

Syslog

DNS round robin

Network load balancer

Port mapping

Considerations

Implementation patterns

Splunk Destination

S2S - Splunk to Splunk

HEC - HTTP Event Collector

Asynchronous load balancing from Splunk agents

Funnel with caution - indexer starvation

Indexer distribution