Upgrade Edge Processor to automatically reconnect

Upgrade Edge Processors on Splunk Enterprise to successfully reconnect to the control plane after an extended disconnect.

Upgrade your Splunk Edge Processor (EP) instances to enable the automatic reconnection feature, which introduced with version 10.4 of Splunk Enterprise. The feature uses OAuth2 authentication to let EP nodes automatically refresh their tokens and reconnect after extended periods of disconnection. This eliminating the prior vulnerability where a node disconnected for longer than two hours would fail to re-authenticate.

Why a Fresh install is required

The new automatic reconnection feature cannot be applied through standard package management or OAS package reconciliation. The following tasks are required:

  • Generate new RSA key pairs.
  • Initialize the splunk-edge-cmp supervisor process with a new OAuth2 client identity.
  • Register the node under the new OAuth2 authentication flow.

Every EP instance must be fully uninstalled and reinstalled using the latest script from the Data Management UI.

Edge Processor deployments that are not upgraded will continue to operate normally using the previous user identity authentication method. However, those nodes remain susceptible to authentication failures if they are disconnected for longer than the 2-hour token expiration window.

Concepts and terminology

  • Edge Processor (EP): A logical deployment unit. A pipeline is deployed to an EP.
  • EP Instance: A physical server that executes pipelines and processes data. An EP has one or more instances, and instances on different versions can coexist within the same EP during a rolling upgrade.
  • OAS (OpAmp Service): Manages binary updates (splunksup, edge) during an Enterprise upgrade. It does not handle splunk-edge-cmp updates. That requires manual reinstallation.

Prerequisites

Prerequisites for upgrading your Edge Processor deployment to enable automatic reconnection.

  • A healthy, multi-instance EP deployment (see the following note on single-instance deployments).
  • Version 10.4 or higher of a Splunk Enterprise instance upgraded to version 10.4.
  • Access to the Data Management UI to download the latest installation script.
  • Administrative access to the edge instances.
  • A clear picture of current data flow and upstream sender configuration.
Note: Running a single EP instance in any production environment is not a best practice, and strongly discouraged. A single instance provides no ability to perform a rolling upgrade without a data flow interruption. If you are running one instance, the best practice is to add a second before proceeding. N+1 is the minimum viable configuration for this procedure.

Upstream sender considerations

Review the following information on how upstream senders are connected to your EP instances. This information determines the load balancing behavior during the rolling upgrade.

Sender type Load balancing method Behavior during instance outage
Universal forwarders/Heavy forwarders outputs.conf target list Splunk protocol handles availability, backoff, and back-pressure automatically. An offline instance is marked unavailable; traffic shifts to remaining instances.
HEC clients Load balancers (NLB or ALB) with health checks Load balancer removes unhealthy back-ends and redistributes traffic. HEC clients should handle non-200 responses with retry/dead letter queue (DLQ) logic, regardless of EP.
Syslog servers DNS round-robin or load balancer TCP syslog should have minimal loss with load balancing. Connections re-establish to available instances. UDP syslog is inherently lossy and has no resiliency, regardless of destination.
Note: Confirm your load balancer health checks and client failover behavior are functioning correctly before starting the upgrade. Do not proceed if the load balancer is not actively health-checking back-ends.

Capacity planning

Determine the minimum number of instances required to sustain the current data flow before taking any instances offline. This minimum number of instances is your capacity floor. You must not drop below it at any point during the upgrade.

Best practice:

CODE
max_concurrent_offline = total_instances - capacity_floor

For example, with 10 instances where 8 are needed to handle peak load, you can take down 2 instances at a time. Adjust this based on your actual throughput metrics and any headroom in your pipeline.

Upgrade steps

Perform the following steps to upgrade your Edge Processor instance.

Phase 1: Upgrade the Enterprise Instance

  1. Upgrade your Splunk Enterprise instance to version 10.4.
  2. During this upgrade, OAS will automatically update the splunksup and edge binaries on connected EP nodes to their latest versions.
  3. After the upgrade completes, your EP nodes will remain healthy and functional, but they will not yet have automatic reconnection capability. They are still operating under the legacy user identity authentication method.
Note: Only EP nodes that are connected with valid, non-expired tokens will receive binary updates from OAS during the Enterprise upgrade. Ensure all EP instances are in a healthy state before initiating the Enterprise upgrade.

Phase 2: Rolling instance reinstallation

Execute the following cycle for each batch of instances. Keep the number of concurrently offline instances within the capacity floor discussed in the previous section.

  1. Identify the eligible batch.

    Select a set of instances to migrate that does not exceed max_concurrent_offline. During the initial upgrade cycle, all instances will be on the pre-OAuth2 version.

  2. Offboard the batch.

    Note:

    The script-based removal below applies to bare metal and VM deployments. For Docker or Kubernetes deployments, use the native platform mechanisms to properly uninstall and remove running containers.

    Run the uninstall script on each instance in the current batch using the commands provided in the Data Management UI.

    The script removes existing EP configurations and cleans up the agent identity. Verify that each instance has been fully offboarded before proceeding. Review and confirm that existing agent identities are cleaned up.

  3. Confirm load balancer removal.

    Verify that each offboarded instance has been removed from active rotation:

    • For NLB/ALB (HEC traffic): Check the load balancer target group health status. Offboarded instances should appear as unhealthy and be removed from rotation.
    • For Forwarders: Confirm in outputs.conf or using forwarder metrics that the offline instances are no longer receiving data. The Splunk protocol will handle this automatically, but verification is good practice.
    • For Syslog: Confirm via your load balancer or DNS configuration that syslog traffic is not being routed to offline instances.

    Do not proceed to reinstallation until you have confirmed the batch is fully out of rotation. Reinstalling while an instance is still receiving traffic can result in data loss from the in-memory and persistent queues.

  4. Reinstall Edge Processor.

    Note:

    The script-based reinstallation below applies to bare metal and VM deployments. For Docker or Kubernetes deployments, use the native platform mechanisms to bring up new instances with the updated image rather than running the install script directly (for example, update the image tag and roll the deployment in Kubernetes, or replace the container in Docker).

    From the Data Management UI, download the latest EP installation script.

    1. Run the script on each instance in the current batch.

    The script will perform the following tasks:

    • Initialize the new splunk-edge-cmp supervisor process.
    • Generate required RSA key pairs.
    • Register the instance with the new OAuth2 authentication flow.
  5. Confirm return to service.

    Once reinstallation is complete, perform the following steps:

    1. Verify that the new instances register with the OpAmp Service using the new OAuth2 client.
    2. Confirm that each instance returns to a healthy status in the Data Management UI.
    3. Confirm that the load balancer has returned the new instances to active rotation (health checks should pass automatically once the instance is healthy).
    4. Observe data flow for a brief stabilization period before proceeding.
  6. Repeat for each eligible instance.

    Return to Step 1 and select the next eligible batch of instances. Continue until all instances have been migrated to the new OAuth2 authentication model.

Post-upgrade verification

After all instances have been migrated, confirm the following tasks have been completed:

  • All EP instances show healthy status in the Data Management UI
  • All instances are registered under the new OAuth2 client identity
  • No instances remain on the legacy user identity authentication method
  • Data is flowing through all instances as expected
  • Load balancer health checks are passing for all instances
  • Token rotation is operating without manual intervention. This can be validated after 2 hours by confirming that the nodes remain connected without a reconnection event.

Data loss reference

Review the following scenarios to identify potential causes of data loss.

Scenario Risk Mitigation
Instance improperly shut down during migration. Data in persistent queue or in-memory can be lost. Always use the offboard script; confirm offboard before reinstall.
Forwarder sends to an offline instance. Minimal, forwarder marks instance unavailable and queues locally. Splunk protocol handles this automatically.
HEC load balancer has no healthy back-ends. Upstream HEC clients receive non-200 response. Behavior depends on client resilience. Maintain capacity floor; ensure HEC clients implement retry/DLQ.
UDP syslog during instance outage. Data loss is expected. UDP has no delivery guarantee. Minimize batch size; accept some loss or switch to TCP syslog.
TCP syslog during instance outage. Minimal with load balancing in place. Confirm TCP syslog load balancing before starting.