Manage a High Availability Deployment
This page describes how to manage and troubleshoot Controllers as a high availability (HA) pair.
Set Up Monitoring for the HA Pair
You can set up monitoring for your HA pair by installing another Controller to act as the monitoring Controller.
Set Up App Agents for Monitoring
Install and Set Up Machine Agents for Monitoring
- Install the Machine Agent on the primary Controller box. Do not start the agent.
- Repeat step 1 for the secondary Controller.
- Configure the Machine Agent properties for both Machine Agents by editing the
controller-info-xmlfile located in the<machine_agent_home>/confdirectory.- Update the
<controller-host>to the monitoring Controller's IP. - Model the rest of your
controller-info-xmlfile.
- Update the
- Start both Machine Agents.
- In the Enterprise Console UI, select your Controller Monitor Platform, and navigate to the Controller page.
- Click on External URL on the widget to open the UI of the monitoring Controller.
- Log in to the Controller. You should be able to see the monitoring application for both the primary and secondary Controllers.
Bouncing the Primary Controller Without Triggering Failover
The Enterprise Console does not allow you to stop and start the primary Controller without initiating failover. To workaround this, you will need to perform the following steps:
Starting and Stopping the Controller
The Enterprise Console does not allow you to shut down the primary Controller. However, you can restart the secondary Controller via the start and stop Controller commands.
To start or stop the Controller manually, use the following commands:
-
To start:
bin/platform-admin.sh start-controller-appserver --with-db
-
To stop:
bin/platform-admin.sh stop-controller-appserver --with-db
Automatic Failover
Enterprise Console includes the atchdog High Availability (HA) module which utilizes the Controller Watchdog for auto-failover. If you want to enable or disable the auto-failover, then the watchdog script needs to be running or stopped respectively.
You can also disable or enable automatic failover through the CLI.
To disable and enable the Controller Watchdog with CLI, use the following commands:
- To stop the Controller Watchdog:
./platform-admin.sh submit-job --job stop-controller-watchdog --service controller
- To start the Controller Watchdog:
./platform-admin.sh submit-job --job start-controller-watchdog --service controller
Performing a Manual Failover and Failback
To failover from the primary to the secondary manually, click the HA Failover option on the Controller page of the Enterprise Console or run the following command on the Enterprise Console host:
bin/platform-admin.sh submit-job --service controller --job ha-failover --platform-name <name_of_the_platform>
This changes the Appserver on the secondary as primary and database on the secondary as the replication master. It also changes the old primary to secondary.
The process for performing a failback to the old primary is the same as failing over to the secondary. You can run the following command on the Enterprise Console host:
bin/platform-admin.sh submit-job --service controller --job ha-failover --platform-name <name_of_the_platform>
Initiate Controller Database Incremental Replication
Re-enable Broken Replication
Incremental replication, replication via rsync when the primary database is up, is required in cases where the database replication on the secondary Controller is lagging behind the primary Controller by more than three days. This type of replication allows the primary Controller to keep operating while the disk contents are copied to the secondary node.
To initiate incremental replication:
-
Run the following command on the Enterprise Console host:This launches a continuously running background job.This launches a continuously running background job.
bin/platform-admin.sh submit-job --service controller --job incremental-replication
-
Make sure replication occurs four or more times by running either one of the following commands:
-
CODE
cd <controller_home>/controller-ha ./ha_replicate.sh -r status -
CODE
cd <controller_home>/controller-ha/tmp cat replication.status
Note: If replication fails, go to the secondary host and stop all rsync andha-replicate.shprocesses. Then try running the incremental-replication job again. -
-
Finalize the job by running the following command on the Enterprise Console host:This stops the incremental replication loop. The command will restart the primary Controller, resulting in downtime.This stops the incremental replication loop. The command will restart the primary Controller, resulting in downtime.
bin/platform-admin.sh submit-job --service controller --job finalize-replication
-
Make sure replication is working by checking that there is no significant gap between the primary and secondary Controllers. You can run the following command on the Enterprise Console host to check the replication status:It may take a few minutes to display the secondary status.It may take a few minutes to display the secondary status.
bin/platform-admin.sh show-service-status --platform-name <platform_name> --service controller
Add a Secondary Controller Using Incremental Replication
You can convert a single Controller with a large amount of data to an HA pair by using incremental replication. This way, you can rsync most of the Controller data while the Controller is still running, limiting the downtime of adding a secondary Controller.
To add a secondary Controller using incremental replication:
-
Start the incremental replication, giving host and rsync parameters:This launches a continuously running background job.
bin/platform-admin.sh submit-job --service controller --job incremental-replication --args controllerSecondaryHost=1.1.1.1 rsyncThrottle=40000 rsyncCompress=true
-
Make sure replication occurs four or more times, by checking
<controller_home>/controller-ha/tmp/replication.statuson the primary database host.Sample rsync status file output:rsync started at Mon Mar 5 11:49:56 PST 2018 rsync completed at Mon Mar 5 11:50:56 PST 2018 rsync started at Mon Mar 5 11:51:01 PST 2018 rsync completed at Mon Mar 5 11:51:11 PST 2018
Note: If replication fails, go to the secondary host and stop all rsync andha-replicate.shprocesses. Then try running the incremental-replication job again. -
Run the add secondary job. The Enterprise Console will perform a final rsync and add the secondary job.The command will restart the primary Controller, resulting in downtime.
bin/platform-admin.sh submit-job --service controller --job add-secondary --args controllerSecondaryHost=secondary mysqlRootPassword=‘password'
Note: Until you trigger the add-secondary command, the secondary Controller is not added to the Enterprise Console platform. Therefore, the Enterprise Console will not be able to perform any other operations on the secondary Controller.
If you need to stop replication, you can run the following command:
bin/platform-admin.sh submit-job --service controller --job stop-incremental-replication
Set Replication Factors for Rsync Threads
Using the Enterprise Console UI or the CLI, you can set the number of parallel rsync threads as a job parameter when you perform incremental or finalize replication.
- From the Enterprise Console UI:
-
Log in to the Enterprise Console and access the Controller page.
-
From the More menu, based on which replication you are performing, select either Incremental Replication or Finalize Replication.
-
Enter a number in the Number of parallel rsync threads field and click Submit. The default value is 1.
-
-
From the CLI, based on which replication you are performing, run either of the following commands from the Enterprise Console host and set the
numberThreadForRsyncargument.CODEbin/platform-admin.sh submit-job --job incremental-replication --args numberThreadForRsync=<number> bin/platform-admin.sh submit-job --job finalize-replication --args numberThreadForRsync=<number>
Enable MySQL Parallel Replication
Using the Enterprise Console UI or the CLI, you can enable MySQL (available from MySQL 5.7) parallel replication when you perform finalize replication.
- From the Enterprise Console UI:
-
Log in to the Enterprise Console and access the Controller page.
-
From the More menu, select Finalize Replication.
-
Select the Database parallel replication check box to enable parallel replication with the MySQL database.
-
Click Submit.
-
-
From the CLI, run the following command from the Enterprise Console host to enable MySQL parallel replication. The default value is true.
CODEbin/platform-admin.sh submit-job --job finalize-replication --args dbParallelReplication=true
Troubleshooting the Incremental Replication Status
If your first incremental replication run is taking longer than usual, you can check the replication status by executing either one of the below commands:
-
CODE
cd <controller_home>/controller-ha ./ha_replicate.sh -r status -
CODE
cd <controller_home>/controller-ha/tmp cat replication.status
Re-enable Controller Database Replication
The Controller databases can be synchronized using the replicate script if they have been out of sync for more than seven days. Synchronizing a database that is more than seven days behind a master is considered reviving a Controller database. Reviving a database involves the same procedure as adding a new secondary Controller to an existing production Controller, as described in Set Up the Secondary Controller and Initiate Replication. You can also follow these steps in the case of an HA failover that failed at replication.
To re-enable replication or revive a Controller database:
Backing Up and Restoring Controller Data in an HA Pair
An HA deployment makes backing up Controller data relatively straightforward since the secondary Controller offers a complete set of production data on which you can perform a cold backup without disrupting the primary Controller service.
After setting up HA, perform a back up by stopping the Controller on the Enterprise Console and performing a file-level copy of the Splunk AppDynamics home directory (i.e., a cold backup). When finished, simply restart the Controller from the Enterprise Console. The secondary will then catch up its data to the primary.
When restoring the database from a back up in an HA or standalone environment, you should check that the primary and secondary server ha.type and ha.mode are set to active and passive, respectively.
Updating the Configuration in an HA Pair
The Enterprise Console will copy any file-level configuration customizations made on the primary controller to the secondary controller, such as changes in the Jetty XML files and db.cnf
Over time, if you need to make modifications to the Controller configuration, always do those changes in the Enterprise Console on the Controller Settings page under Configurations. These changes will be preserved during upgrades. Any changes made outside the Enterprise Console will not be preserved after upgrade.
Troubleshooting HA
Controller Diagnostic Data
The Enterprise Console writes log messages pertaining to HA to the platform-admin-server.log on the Enterprise Console host.
To diagnose the Controller, run the following command:
bin/platform-admin.sh submit-job --platform-name <name_of_the_platform> --job diagnosis --service controller
Refer to the Controller diagnostic data in the platform-admin-server.log.
Sample Controller diagnostic data
Linux
Controller diagnostic data:
123.45.0.1:
controller_database: running
controller_appserver: running
reports_service: running
operating_system: Linux
controller_version: 004-004-001-000
controller_performance_profile: small
controller_ha_type: primary
controller_appserver_mode: active
controller_metric_data_per_min: N/A
slave_io_state: Waiting for master to send event
seconds_behind_master: 0
master_server_id: 567.
master_host: controller-secondary
master_ssl_allowed: No
123.45.0.2:
controller_database: running
controller_appserver: not running
reports_service: running
operating_system: Linux
controller_version: 004-004-001-000
controller_performance_profile: small
controller_ha_type: secondary
controller_appserver_mode: passive
Invalid HA Controller Roles
If your HA Controller roles in the Controller databases are incorrect, the Enterprise Console will prevent discover and upgrade jobs. An invalid HA Controller state is when both of your Controller role types are identical, such as in a primary/primary or secondary/secondary case.
To fix this issue:
- Identify which server is the primary.
-
Log in to one of the Controller databases by running the following command in the Controller installation directory:
bin/controller.sh login-db
-
Run the following command:
select * from global_configuration_local where name=‘ha.controller.type’;
-
- Ensure that
ha.controller.typeis set correctly in the database.-
Log in to the Controller database you would like to change by running the following command in the Controller installation directory:
bin/controller.sh login-db
-
Run the following commands to set the database to the primary or secondary:
- Primary
-
CODE
use controller; update global_configuration_local set value=‘primary’ where name=‘ha.controller.type’; update global_configuration_local set value=‘active’ where name=‘appserver.mode’; - Secondary
-
CODE
use controller: update global_configuration_local set value=‘secondary’ where name=‘ha.controller.type’; update global_configuration_local set value=‘passive’ where name=‘appserver.mode’;
-
-
Restart the database for the change to take effect on the Appserver:
bin/platform-admin.sh stop-controller-appserver --with-db bin/platform-admin.sh start-controller-appserver --with-db
If the secondary Appserver is already in a shutdown state, then there is no need to restart the database.
-
Verify the replication is healthy:
show slave status\G
Slave_IO_RunningandSlave_SQL_Runningshould showYes.
You may now retry the discover and upgrade job.
Failover Prevention
If failover is prevented on your Controller HA configuration, it may be due to one of two scenarios:
- The secondary database is down. Failover cannot occur when the secondary database is not running.To fix this issue, restart the secondary database by running the following command on the secondary host:
If this does not enable failover, then it may be due to the second scenario.
bin/controller.sh start-db
- Database replication is not healthy. Failover is not allowed when the database replication is not healthy.There are various reasons why this may be the case. Contact customer support to correct the issue.