High Availability Considerations

Splunk AppDynamics 自己ホスト型仮想アプライアンス supports high availability by implementing automatic failover mechanisms.

A high-availability architecture removes single points of failure in critical Virtual Appliance components, allowing you to continuously monitor your applications even during infrastructure failures.

Database Layer

The database layer comprises MySQL InnoDBCluster with three MySQL server pods and three routers under the Oracle MySQL operator, TLS-enabled, and PostgreSQL managed by the Percona Operator (pgv2) with three data pods, pgBouncer proxies, and pgBackRest, configured for TLS and Patroni-managed failover. It is designed to minimize downtime and prevent data loss during failover events.

The Virtual Appliance ensures high availability at the database layer through MySQL group replication with automatic primary re-election and router failover, and PostgreSQL failover managed by Patroni, with TLS across components. Multi-node deployments enable continuous availability. The Virtual Appliance routes traffic dynamically through routers and distributed connection management.

Messages

Kafka is deployed on the Virtual Appliance to provide reliable message streaming and management. The architecture consists of three Kafka broker nodes and three Zookeeper replicas, working together to ensure high availability and fault tolerance. Key configurations include a default replication factor of three and a minimum in-sync replicas setting of two, so messages remain safe even if a broker is down. Secure connections are enabled using Scram/TLS listeners.

Most main Kafka topics use Strimzi's default of one replica per topic. However, important internal topics, like those for offsets and transactions, use the cluster-wide replication settings for extra reliability. This setup, managed by Strimzi, helps balance workload and ensures seamless message delivery, even during server failures.

Search and Analytics

The Virtual Appliance runs Elasticsearch as a three-node cluster, designed to provide efficient search and indexing for event and metadata data. Each node is responsible for storing data, routing queries, and handling ingestion, supporting flexible failover and workload distribution. In this setup, the Events Service configures event and metadata indices with no replicas (number_of_replicas=0), so there are no redundant copies of the data.

If a node becomes unavailable, Elasticsearch marks the cluster health as yellow until the node rejoins, indicating missing redundancy. Additional settings for shards, caches, TLS, and slowlogs are managed by the Events configuration to optimize performance and security. This architecture ensures fast search and indexing, with cluster health clearly reflecting node availability.

Application Services

The Controller is deployed as a single-replica, stateless service that utilizes persistent volumes for storing its database, logs, and custom actions. Kubernetes manages its lifecycle by handling pod restarts in case of failures, ensuring the Controller is quickly rescheduled and persistent storage is reattached so operations can resume with minimal disruption. During any downtime, agents temporarily buffer their metrics and automatically reconnect to deliver them once the Controller is back online, preventing data loss.

The Events Service operates as a single stateful pod, protected by liveness and readiness probes, and routes traffic through its dedicated service endpoint. Supporting services such as EUM, Synthetic, Redis, and MinIO follow the replica configurations defined in their deployment charts, with defaults set to single-instance, except for components like Redis and MinIO, which run as three replicas for greater availability. This architecture is designed to maintain reliable service continuity, minimize downtime, and support seamless recovery from failures.

Infrastructure

The Kubernetes control plane in the Virtual Appliance is designed for high availability, relying on a healthy multi-node setup to ensure that critical services remain operational. The system depends on maintaining API server availability and etcd quorum so that core functions continue without interruption, even if a node fails.

Ingress NGINX runs as a DaemonSet on every node, providing TLS termination on all instances. An external load balancer directs traffic only to pods that are healthy and ready, ensuring seamless connectivity. To make the system more resilient, important components like databases and Kafka are spread across different nodes. This helps prevent a single point of failure. If any node or pod goes down, the load balancer automatically reroutes traffic to the remaining healthy nodes, and Kubernetes manages rescheduling to maintain service continuity and workload balance.