Cross-Region Disaster Recovery service description
Use the Splunk Cloud Platform Cross-Region Disaster Recovery service to maintain your critical Splunk infrastructure during a cloud service provider (CSP) outage.
Overview of Cross-Region Disaster Recovery
The service runs in conjunction with your Splunk Cloud Platform environment. Splunk Cloud Platform runs on several CSPs like Amazon Web Services (AWS). After you sign up and pay for the service, Splunk Support works with you to set it up. You might then need to perform additional post-integration steps.
After setup is complete, the service replicates your Splunk Cloud Platform environment in its primary CSP region to another environment in a secondary region. This replication happens automatically and requires no action from you. Replication begins when you start using the service and continues for as long as you maintain the service.
Definition of a qualified regional disaster
Splunk continuously monitors the health of the cloud service provider region that hosts your Splunk Cloud Platform environment. If service in that environment degrades significantly, and Splunk determines through its monitoring methods that the service degradation is due to a failure in the cloud service, Splunk calls a disaster and initiates disaster recovery procedures for your environment. This action happens automatically and doesn't require action or acknowledgment from you.
Splunk uses the following conditions to determine that a qualified regional disaster is imminent or taking place:
- Data ingestion into your Splunk Cloud Platform environment through forwarders or HTTP Event Collector inputs is blocked
- An outage severely blocks or prevents indexing
- An outage severely affects or prevents search
- No one is able to log into Splunk Cloud Platform
- Splunk can confirm a failure with the cloud service provider
If you think your Splunk Cloud Platform environment is suffering from service degradation, you can file a support case. Splunk Support processes the information in your case and determines whether or not it should call a disaster based on that and other factors related to the current operation of the CSP region. See Report a potential Splunk Cloud Platform service failure for details on how to report an outage.
What happens when Splunk declares a qualified regional disaster
After Splunk determines that a qualified regional disaster is in progress, it begins what is known as a failover. There are two types of failovers in Cross-Region Disaster Recovery: planned and unplanned.
- A planned failover is when you ask Splunk to test your Splunk Cloud Platform environment for the purposes of disaster recovery, and requires a certain procedure to run. During a planned failover, the primary Splunk Cloud Platform environment runs as normal, and Splunk does not introduce artificial failures of any kind during the planned failover process. See Schedule a planned disaster recovery for your Splunk Cloud Platform environment for more information about planned failovers and how to start them.
- Any other kind of failover is unplanned.
When an unplanned failover happens, Splunk recovers your Splunk Cloud Platform environment from the failing primary CSP region to the secondary CSP region. It does this by reconfiguring networking caches to point to the Splunk Cloud Platform instance in the secondary region rather than the failing primary region. Because Splunk continuously replicates data from the primary to the secondary region, data loss is minimal. See Cross-Region Disaster Recovery service level agreements and limitations for specific details on what you can expect of the backup Splunk Cloud Platform environment during an unplanned failover.
What happens after a failover
After the failover completes, Splunk maintains your Splunk Cloud Platform environment in the secondary CSP region indefinitely until the primary CSP region that hosts your Splunk Cloud Platform environment recovers. During that time, data ingestion, indexing, searching, reporting, and other Splunk Cloud Platform functions happen in the secondary CSP region.
This state continues until Splunk determines that the primary CSP region has recovered sufficiently enough to support your Splunk Cloud Platform environment again.
Recovery to the primary region
After Splunk determines that the primary CSP region that hosts your Splunk Cloud Platform environment has recovered and is available for hosting your Splunk Cloud Platform environment, it declares an end to the disaster and works with you to perform what is known as a "failback" to the primary region.
Splunk attempts to fail back your Splunk Cloud Platform environment as quickly as possible after it declares a disaster over. Splunk Support needs your input to schedule a maintenance window to do this procedure. Failbacks must happen within two weeks of the primary CSP region being restored to full service. There is no expectation of loss of ingested data during a failback.
After the failback is complete, all Splunk Cloud Platform operations resume on the primary CSP region on which your Splunk Cloud Platform environment is hosted.
Communications from Splunk during the disaster recovery process
Throughout the disaster recovery process, Splunk sends automatic communications to the operational contacts that you specify about the various failover and failback events as they occur, such as start and end time stamps.