Upgrade Edge Processor to automatically reconnect
Upgrade Edge Processors on Splunk Enterprise to successfully reconnect to the control plane after an extended disconnect.
Upgrade your Splunk Edge Processor (EP) instances to enable the automatic reconnection feature, which introduced with version 10.4 of Splunk Enterprise. The feature uses OAuth2 authentication to let EP nodes automatically refresh their tokens and reconnect after extended periods of disconnection. This eliminating the prior vulnerability where a node disconnected for longer than two hours would fail to re-authenticate.
Why a Fresh install is required
The new automatic reconnection feature cannot be applied through standard package management or OAS package reconciliation. The following tasks are required:
- Generate new RSA key pairs.
- Initialize the
splunk-edge-cmpsupervisor process with a new OAuth2 client identity. - Register the node under the new OAuth2 authentication flow.
Every EP instance must be fully uninstalled and reinstalled using the latest script from the Data Management UI.
Concepts and terminology
- Edge Processor (EP): A logical deployment unit. A pipeline is deployed to an EP.
- EP Instance: A physical server that executes pipelines and processes data. An EP has one or more instances, and instances on different versions can coexist within the same EP during a rolling upgrade.
- OAS (OpAmp Service): Manages binary updates (
splunksup,edge) during an Enterprise upgrade. It does not handlesplunk-edge-cmpupdates. That requires manual reinstallation.
Prerequisites
Prerequisites for upgrading your Edge Processor deployment to enable automatic reconnection.
- A healthy, multi-instance EP deployment (see the following note on single-instance deployments).
- Version 10.4 or higher of a Splunk Enterprise instance upgraded to version 10.4.
- Access to the Data Management UI to download the latest installation script.
- Administrative access to the edge instances.
- A clear picture of current data flow and upstream sender configuration.
Upstream sender considerations
Review the following information on how upstream senders are connected to your EP instances. This information determines the load balancing behavior during the rolling upgrade.
| Sender type | Load balancing method | Behavior during instance outage |
|---|---|---|
| Universal forwarders/Heavy forwarders | outputs.conf target list |
Splunk protocol handles availability, backoff, and back-pressure automatically. An offline instance is marked unavailable; traffic shifts to remaining instances. |
| HEC clients | Load balancers (NLB or ALB) with health checks | Load balancer removes unhealthy back-ends and redistributes traffic. HEC clients should handle non-200 responses with retry/dead letter queue (DLQ) logic, regardless of EP. |
| Syslog servers | DNS round-robin or load balancer | TCP syslog should have minimal loss with load balancing. Connections re-establish to available instances. UDP syslog is inherently lossy and has no resiliency, regardless of destination. |
Capacity planning
Determine the minimum number of instances required to sustain the current data flow before taking any instances offline. This minimum number of instances is your capacity floor. You must not drop below it at any point during the upgrade.
Best practice:
max_concurrent_offline = total_instances - capacity_floor
For example, with 10 instances where 8 are needed to handle peak load, you can take down 2 instances at a time. Adjust this based on your actual throughput metrics and any headroom in your pipeline.
Upgrade steps
Perform the following steps to upgrade your Edge Processor instance.
Phase 1: Upgrade the Enterprise Instance
- Upgrade your Splunk Enterprise instance to version 10.4.
- During this upgrade, OAS will automatically update the
splunksupandedgebinaries on connected EP nodes to their latest versions. - After the upgrade completes, your EP nodes will remain healthy and functional, but they will not yet have automatic reconnection capability. They are still operating under the legacy user identity authentication method.
Phase 2: Rolling instance reinstallation
Execute the following cycle for each batch of instances. Keep the number of concurrently offline instances within the capacity floor discussed in the previous section.
-
Identify the eligible batch.
Select a set of instances to migrate that does not exceed
max_concurrent_offline. During the initial upgrade cycle, all instances will be on the pre-OAuth2 version. -
Offboard the batch.
Note:The script-based removal below applies to bare metal and VM deployments. For Docker or Kubernetes deployments, use the native platform mechanisms to properly uninstall and remove running containers.
Run the uninstall script on each instance in the current batch using the commands provided in the Data Management UI.
The script removes existing EP configurations and cleans up the agent identity. Verify that each instance has been fully offboarded before proceeding. Review and confirm that existing agent identities are cleaned up.
-
Confirm load balancer removal.
Verify that each offboarded instance has been removed from active rotation:
- For NLB/ALB (HEC traffic): Check the load balancer target group health status. Offboarded instances should appear as unhealthy and be removed from rotation.
- For Forwarders: Confirm in
outputs.confor using forwarder metrics that the offline instances are no longer receiving data. The Splunk protocol will handle this automatically, but verification is good practice. - For Syslog: Confirm via your load balancer or DNS configuration that syslog traffic is not being routed to offline instances.
Do not proceed to reinstallation until you have confirmed the batch is fully out of rotation. Reinstalling while an instance is still receiving traffic can result in data loss from the in-memory and persistent queues.
-
Reinstall Edge Processor.
Note:The script-based reinstallation below applies to bare metal and VM deployments. For Docker or Kubernetes deployments, use the native platform mechanisms to bring up new instances with the updated image rather than running the install script directly (for example, update the image tag and roll the deployment in Kubernetes, or replace the container in Docker).
From the Data Management UI, download the latest EP installation script.
- Run the script on each instance in the current batch.
The script will perform the following tasks:
- Initialize the new
splunk-edge-cmpsupervisor process. - Generate required RSA key pairs.
- Register the instance with the new OAuth2 authentication flow.
-
Confirm return to service.
Once reinstallation is complete, perform the following steps:
- Verify that the new instances register with the OpAmp Service using the new OAuth2 client.
- Confirm that each instance returns to a healthy status in the Data Management UI.
- Confirm that the load balancer has returned the new instances to active rotation (health checks should pass automatically once the instance is healthy).
- Observe data flow for a brief stabilization period before proceeding.
-
Repeat for each eligible instance.
Return to Step 1 and select the next eligible batch of instances. Continue until all instances have been migrated to the new OAuth2 authentication model.
Post-upgrade verification
After all instances have been migrated, confirm the following tasks have been completed:
- All EP instances show healthy status in the Data Management UI
- All instances are registered under the new OAuth2 client identity
- No instances remain on the legacy user identity authentication method
- Data is flowing through all instances as expected
- Load balancer health checks are passing for all instances
- Token rotation is operating without manual intervention. This can be validated after 2 hours by confirming that the nodes remain connected without a reconnection event.
Data loss reference
Review the following scenarios to identify potential causes of data loss.
| Scenario | Risk | Mitigation |
|---|---|---|
| Instance improperly shut down during migration. | Data in persistent queue or in-memory can be lost. | Always use the offboard script; confirm offboard before reinstall. |
| Forwarder sends to an offline instance. | Minimal, forwarder marks instance unavailable and queues locally. | Splunk protocol handles this automatically. |
| HEC load balancer has no healthy back-ends. | Upstream HEC clients receive non-200 response. Behavior depends on client resilience. | Maintain capacity floor; ensure HEC clients implement retry/DLQ. |
| UDP syslog during instance outage. | Data loss is expected. UDP has no delivery guarantee. | Minimize batch size; accept some loss or switch to TCP syslog. |
| TCP syslog during instance outage. | Minimal with load balancing in place. | Confirm TCP syslog load balancing before starting. |