Monitor the health of your Splunk UBA deployment

The Health Monitor dashboard helps you assess the health of your Splunk UBA deployment and verify the quality of data added to Splunk UBA. You can open the Health Monitor by selecting System > Health Monitor.

You can also use the Health Monitor to review system errors. System errors appear as messages in the menu.

  1. The bell icon appears when there are messages. Click the bell icon to view messages.
  2. Click a message or error to open the Health Monitor dashboard.

You can have system errors emailed to you. See Monitor system health with the health check script.

Splunk UBA maintains historical information about the health of your Splunk UBA deployment. You can download the information when you collect diagnostic data from the System Monitor and System Resources Monitor modules for your Splunk UBA deployment. See Collect diagnostic data from your Splunk UBA deployment.

CAUTION: You have the option to create a custom SSH banner message that displays when users login. Do not configure this SSH banner in /etc/ssh/ssh_config, as it can impact the Health Monitor. Set the custom banner in /etc/ssh/sshd_config for optimal compatibility.

Enable test mode for specific health indicators

You can enable test mode for health indicators that are not helpful or that are not producing relevant system monitoring information. The test mode status replaces the OK, BAD, or WARN status for health status indicators.

  1. Log in to the Splunk UBA management server as the caspida user using SSH.
  2. Open the /etc/caspida/local/conf/uba-site.properties file in an editor.
  3. Change the ubaMonitor.<module_id>.<indicator_id>.mutable parameter from true to false to enable test mode. For example:

    ubaMonitor.pipeline.reToOCSNewAnomalyLag.mutable=false

    .
  4. Save your file.

View system health

To view system health information, perform the following steps:

  1. Select System > Health Monitor.
  2. Click System, if it is not already selected.

This page displays the IP addresses, host names, and deployment types of the server nodes in your environment. This view is mostly informational and displays errors when CPU usage and disk usage are higher than 90%. Click a row in the table to learn more about a specific server.

View services health

To view services health information, select System > Cluster.

Splunk UBA relies on several services and processes to create anomalies and threats, process events, identify user and device associations, and more. Monitor the health of these services and processes on this page.

In a distributed system, the host IP shows for each service so that you can find errors related to a specific host.

Service name Service process Description
Analytics Aggregator Service analyticsaggregator This service aggregates and compresses data written by Analytics Writer service.
Analytics View Builder Service analyticsviewbuilder This service periodically updates materialized views in the analytics database. Splunk UBA uses these views to render dashboards.
Analytics Writer Service analyticswriter This service writes new aggregated events and models output to the analytics database.
Anomaly Aggregation Task anomalyaggregationmodel This model pre-processes new anomalies by enhances the entities in the anomalies, then storing the anomalies in the database.
Docker docker This service builds the containerized services, such as the streaming models, data ingestion, and identity resolution.
Hadoophadoop-hdfs-namenodeThis service keeps track of where data is stored in the Hadoop Distributed File System.
hadoop-hdfs-datanodeThis service stores data in the Hadoop Distributed File System.
hadoop-hdfs-secondarynamenodeThis dedicated node in the HDFS cluster takes checkpoints of the file system metadata present on the hadoop-hdfs-namenode. It is not a backup namenode.
Hive Metastore hive-metastore This service stores the metadata for Hive/Impala tables and partitions in a relational database, and provides clients access to this information using the metastore service API.
Impalaimpala-catalogThis internal service is used by the Apache Impala database engine.
impala-serverThis service is the Apache Impala database server.
impala-state-storeThis internal service is used by the Apache Impala database engine.
Job Agent caspida-jobagent This service manages all jobs in a multi node environment. This service is not applicable for single-node environments.
Job Manager caspida-jobmanager This service runs and manages jobs for Splunk UBA.
Kafka kafka-server This service acts as the message bus for Splunk UBA.
Kubelet kubelet This Kubernetes component makes sure that the containers are running in pods.
Offline Rule Executor caspida-offlineruleexec This service runs custom threats and anomaly rules.
Output Connector Server caspida-outputconnector This service is the outbound connection to external data sources such as Splunk Enterprise Security, Email, or ServiceNow.
PostgreSQL postgresql This service stores the Splunk UBA system metadata.
Realtime Rule Executor caspida-realtimeruleexec This service runs anomaly action rules on generated anomalies.
Redis redis-server This service is the in-memory store that caches model metadata and system-wide configuration parameters.
Sparkspark-historyThis service monitors the Apache Spark history server.
spark-masterThis service monitors the Apache Spark master server.
spark-serverThis service submits Spark jobs to the Spark backend.
spark-workerThis service monitors the Apache Spark workers running in the cluster.
System Monitor caspida-sysmon This service monitors the Splunk UBA system.
Splunk splunkd This service monitors the status of the Splunk forwarder when enabled. A Splunk forwarder is needed to send data from Splunk UBA to the Splunk UBA Monitoring App.
Time Series DB influxdb This service stores time series data.
UBA ETL Service etl This service parses events and runs identity resolution for devices and users from IR cache. It also runs all active decorators such as geolocation, threat intel, and entity validation.
UBA Identity Resolver Service identityresolver This service processes events to build IR data.
UBA Streaming Models devicetopic-modelgroupxx devicetopic-modelgroupxx This model runs in the streaming workflow. Models reading from the same topic are grouped together for performance reasons.
UBA Streaming Models domaintopic-modelgroupxx domaintopic-modelgroupxx This model runs in the streaming workflow. Models reading from the same topic are grouped together for performance reasons.
UBA Streaming Models eventtopic-modelgroupxx eventtopic-modelgroupxx This model runs in the streaming workflow. Models reading from the same topic are grouped together for performance reasons.
UBA UI caspida-ui This service is the Splunk UBA web interface.
Unusual Per Day Activity Time Model uthourperusermodel-xx This model detects unusual activity time during a day by a user based on his/her normal access profile.
Zookeeper zookeeper-server This service synchronizes services and manages global configurations.

View modules health

To view the health status of Splunk UBA modules, perform the following steps:

  1. Select System > Health Monitor.
  2. Click Modules.

Review the status of various modules that make up the Splunk UBA product. Determine what to do if you see error codes that appear on the Modules health dashboard.

Module name Indicator Description
Analytics Aggregator Service Last Activity Check This service checks for the last activity to determine if the service is working as expected.
Analytics View Builder Service Last Activity Check This service checks for the last activity to determine if the service is working as expected.
Analytics Writer ServiceTime Lag on AnalyticsTopicThis service shows the time lag of the AnalyticsTopic.
Time Lag on IRTopicThis service shows the time lag of the IRTopic.
EPS on AnalyticsTopicThis service shows the number of events processed per second (EPS) by the service on AnalyticsTopic.
EPS on IRTopicThis service shows the number of events processed per second by the service on IRTopic.
Event Lag on AnalyticsTopicThis service shows the number of events waiting to be ingested on AnalyticsTopic.
Event Lag on IRTopicThis service shows the number of events waiting to be ingested on IRTopic.
Events dropped from Kafka on AnalyticsTopicThis service shows the percentage of events dropped from Kafka on AnalyticsTopic.
Events dropped from Kafka on IRTopicThis service shows the percentage of events dropped from Kafka on IRTopic.
Anomaly Aggregation TaskTime Lag on AnomalyTopicThis service shows the time lag of the AnomalyTopic.
EPS on AnomalyTopicThis service shows the number of events processed per second by the service on AnomalyTopic.
Event Lag on AnomalyTopicThis service shows the number of events waiting to be ingested on AnomalyTopic.
Events dropped from Kafka on AnomalyTopicThis service shows the percentage of events dropped from Kafka on AnomalyTopic.
Data SourceAssets data retrieval timeThis service shows the last time assets data was retrieved.
Data Source ProcessingThis service shows the EPS of data sources in the Processing state.
Events Count by Data FormatThis service shows the number of events processed by data format.
HR data retrieval timeThis service shows the last time HR data was retrieved.
Overall EPS of all DatasourcesThis service shows the aggregated EPS of all data sources in the Processing state.
Percentage of Events skippedThis service shown the percentage of events skipped.
Splunk Data Source LagThis service monitors the data ingestion search lag for all Splunk data sources, including Kafka data ingestion. The lag is defined as the duration between search submission time and the search's latest time. If lag is beyond 3600 seconds, warning message is displayed.
Splunk Data Source Search Status CheckThis service Monitors the health of data ingestion into Splunk UBA by tracking errors in the Splunk data source searches, including Kafka data ingestion.
Kafka Broker
Note: Kafka Broker indicators are only available by adding ?system to The Splunk UBA URL. For example: https://ubaserver1/?system#Y2FzcGlk==.
All topics bytes inThe number of bytes per second being received by the Kafka broker.
All topics bytes outThe number of bytes per second being read by the consumers.
Request handler idle ratioThe percentage of time that the brokers' request handlers are idle.
Request process time - 99th PercentileThe time in milliseconds that it takes for the brokers to fully process 99% of requests per request type. Click on the View <number> Values link to see more information.
Request process time - AverageThe average time in milliseconds that it takes for the brokers to fully process requests per request type. Click on the View <number> Values link to see more information.
Topic bytes inThe rate in bytes per second of the message traffic each topic is receiving from producing clients. Click on the View <number> Values link to see more information.
Topic bytes outThe rate in bytes per second of the message traffic consumed by clients of each topic. Click on the View <number> Values link to see more information.
Topic partition countThe number of partitions per topic. Click on the View <number> Values link to see more information.
Topic size on diskThe size each topic occupies on disk. Click on the View <number> Values link to see more information.
Total size on diskThe total size of all the Kafka topics.
Model Store
Note: Model Store indicators are only available by adding ?system to The Splunk UBA URL. For example: https://ubaserver1/?system#Y2FzcGlk==.
Average Deserialization DelayAverage deserialization delay for each model. Click on the View <number> Values link to see more information.
Average Size of Stored ModelsAverage size of each stored model. Click on the View <number> Values link to see more information.
Maximum Size of Stored ModelsMaximum size for each model. Click on the View <number> Values link to see more information.
Number of Models StoredNumber of models stored. Click on the View <number> Values link to see more information.
Offline Models
Note: All Offline Models indicators except for Last Execution Time per Model are only available by adding ?system to The Splunk UBA URL. For example: https://ubaserver1/?system#Y2FzcGlk==.
Completed StagesNumber of completed stages for each offline model in the latest execution. Click on the View <number> Values link to see more information.
Completed TasksNumber of completed tasks for each offline model in the latest execution. Click on the View <number> Values link to see more information.
Disk Bytes SpilledNumber of disk bytes spilled by each offline model in the latest execution. Data that does not fit in the memory is "spilled" to the disk. Click on the View <number> Values link to see more information.
Execution DurationAmount of time it took for each offline model to run. Click on the View <number> Values link to see more information.
Failed StagesNumber of failed stages for each offline model in the latest execution. Click on the View <number> Values link to see more information.
Failed TasksNumber of failed tasks for each offline model in the latest execution. Click on the View <number> Values link to see more information.
Last Execution Time Per ModelThe last time each offline model was run. Click on the View <number> Values link to see more information.
Longest Stage DurationThe longest stage duration for each offline model during its last execution. Click on the View <number> Values link to see more information.
Shuffle Read BytesNumber of shuffle read bytes for each offline model. Click on the View <number> Values link to see more information.
Shuffle Read RecordsNumber of shuffle read records for each offline model. Click on the View <number> Values link to see more information.
Shuffle Write BytesNumber of shuffle write bytes for each offline model. Click on the View <number> Values link to see more information.
Shuffle Write RecordsNumber of shuffle write records for each offline model. Click on the View <number> Values link to see more information.
Skipped StagesNumber of stages skipped for each offline model. Click on the View <number> Values link to see more information.
Skipped TasksNumber of tasks skipped for each offline model. Click on the View <number> Values link to see more information.
Total JobsTotal number of jobs for each offline model. Click on the View <number> Values link to see more information.
Total StagesTotal number of stages for each offline model. Click on the View <number> Values link to see more information.
Total TasksTotal number of tasks for each offline model. Click on the View <number> Values link to see more information.
Offline Rule Executor Threat Revalidation This service checks whether the average time to revalidate threats since the last system restart is within the normal range.
Output Connector ServerAnomalies dropped from KafkaThis service shows the percentage of anomalies dropped from Kafka.
Anomalies Time LagThis service shows the time lag of the Output Connector Server on the anomaly input queue.
Audit and control events dropped from KafkaThis service shows the percentage of audit and control events dropped from Kafka.
Email Failure PercentageThis service shows the percentage of email attempts that failed.
Note: This indicator is only available by adding ?system to The Splunk UBA URL. For example: https://ubaserver1/?system#Y2FzcGlk==.
Events Time LagThis service shows the time lag of the Output Connector Server on the events input queue.
Sending Threats to Enterprise Security is haltedThis service monitors whether or not Splunk UBA is able to send threats to Splunk ES.
Postgre SQL Number of Suppressed Anomalies This service shows the total number of anomalies in the system which have been suppressed either manually or by anomaly action rules.
Realtime Rule ExecutorAnomalies dropped from KafkaThis service shows the percentage of anomalies that were dropped from Kafka.
Time LagThis service shows the time lag of the anomalies being processed by Kafka.
Rate of Anomaly GenerationThis service checks whether the average number of anomalies processed per second over the last 10 minutes is within the normal range.
Note: This indicator is only available by adding ?system to The Splunk UBA URL. For example: https://ubaserver1/?system#Y2FzcGlk==.
Threat Computation TaskGraph-based Threat ComputationThis service shows OK if graph-based threat computation is running in a timely manner.
Threat ComputationThis service shows OK if overall threat computation is completing successfully.
Threat Computation DurationThis service shows OK if overall threat computation is completing in the expected amount of time.
UBA ETL ServiceTime Lag on RawDataTopicThis service shows the time lag of the RawDataTopic.
EPS on RawDataTopicThis service shows the EPS by the service on RawDataTopic.
Event Lag on RawDataTopicThis service shows the number of events waiting to be ingested on RawDataTopic.
Events dropped from Kafka on RawDataTopicThis service shows the percentage of events dropped from Kafka on RawDataTopic.
Latest Event Time on RawData TopicThis service monitors the time of the last event processed on the RawData topic.
Note: This indicator is only available by adding ?system to The Splunk UBA URL. For example: https://ubaserver1/?system#Y2FzcGlk==.
Time Difference on RawData TopicThis service monitors the maximum time difference among the latest processed events of each of the Splunk UBA raw ETL parsers.
Note: This indicator is only available by adding ?system to The Splunk UBA URL. For example: https://ubaserver1/?system#Y2FzcGlk==.
UBA Identity Resolver ServiceTime Lag on IRTopicThis service shows the time lag of the IRTopic.
Time Lag on PreIREventTopicThis service shows the time lag of the PreIREventTopic.
EPS on IRTopicThis service shows the EPS of the service on IRTopic.
EPS on PreIREventTopicThis service shows the EPS of the service on PreIREventTopic.
Event Lag on PreIREventTopicThis service shows the number of events waiting to be ingested on PreIREventTopic.
Events dropped from Kafka on IRTopicThis service shows the percentage of events dropped from Kafka on IRTopic.
Events dropped from Kafka on PreIREventTopicThis service shows the percentage of events dropped from Kafka on PreIREventTopic.
UBA Pipeline NewAnomalyTopic This service shows the status of the NewAnomalyTopic.
UBA Streaming Models devicetopic-modelgroupnnTime Lag on DeviceTopicThis service shows the time lag of the DeviceTopic.
EPS on DeviceTopicThis service shows the EPS of the service on DeviceTopic.
Event Lag on DeviceTopicThis service shows the number of events waiting to be ingested on DeviceTopic.
Events dropped from Kafka on DeviceTopicThis topic shows the percentage of events dropped from Kafka on DeviceTopic.
UBA Streaming Models domaintopic-modelgroupnnTime Lag on DomainTopicThis service shows the time lag of the DomainTopic.
Time Lag on DomainTopicThis service shows the time lag of the DomainTopic.
EPS on DomainTopicThis service shows the EPS of the service on DomainTopic.
Event Lag on DomainTopicThis service shows the number of events waiting to be ingested on DomainTopic.
Events dropped from Kafka on DomainTopicThis service shows the percentage of events dropped from Kafka on DomainTopic.
UBA Streaming Models eventtopic-modelgroupnnTime Lag on EventTopicThis service shows the time lag of the EventTopic.
EPS on EventTopicThis server shows the EPS of the service on EventTopic.
Event Lag on EventTopicThis service shows the number of events waiting to be ingested on EventTopic.
Events dropped from Kafka on EventTopicThis service shows the percentage of events dropped from Kafka on EventTopic.
Unusual Per Day Activity Time ModelTime Lag on EventTopic.This service shows the time lag of the EventTopic.
EPS on EventTopicThis server shows the EPS of the service on EventTopic.
Event Lag on EventTopicThis service shows the number of events waiting to be ingested on EventTopic.
Events dropped from Kafka on EventTopicThis service shows the percentage of events dropped from Kafka on EventTopic.

View data quality Metrics

To view data quality information, perform the following steps:

  1. Select System > Health Monitor.
  2. Click Data Quality.

Review metrics about the quality of data in your system. If system issues affect the quality of data, errors appear on this page.

Module Indicator Description
Data SourceData Source EPS on SplunkThis service shows the average number of events processed per second by each data source on Splunk in the last hour.
Percentage of Events dropped by EventFiltersThis service shows the percentage of events dropped by EventFilters on the UI.
Percentage of Events with no entityThis service shows the percentage of events that have no entity.
Percentage of Events with no Relevant DataThis service shows the percentage of events that have no relevant data.
Splunk Direct Data Source Enum CheckThis service monitors the Splunk Direct input enum field data quality and tracks the mismatch rate (percentage) in each data source.
Offline Rule ExecutorAverage Execution Time Per RuleThis service shows the average execution time of each custom threat, anomaly action rule, or anomaly rule.
Last Execution End Time per RuleThis service shows the last time each custom threat, anomaly action rule, or anomaly rule finished running.
Last Execution Failure per RuleThis service shows the last time at which a custom threat, anomaly action rule, or anomaly rule failed to run.
Last Execution Start Time per RuleThis service shows the start time of the most recent run of each custom threat, anomaly action rule, or anomaly rule.
Number of Execution Failures per RuleThis service shows the number of consecutive times each rule failed to run, or 0 if no failures have occurred.
Number of Executions per RuleThis service shows the total number of times that each custom threat, anomaly action rule, or anomaly rule has attempted to run. Both successful and failed attempts are counted.
Output Connector ServerNumber of Threats Sent to Output ConnectorThis service shows the total number of threats sent to the output connector for forwarding to Splunk Enterprise Security or other external destinations in the time since Splunk UBA was last restarted.
Total New AnomaliesThis service shows the number of new anomalies received by the output connector server in this session. You can compare this number to the number of anomalies in the receiving system, such as Splunk Enterprise Security, to determine if all anomalies are successfully being processed by the system to which Splunk UBA is sending anomalies.
PostgreSQLNumber of Inactive AnomaliesThis service shows the total number of anomalies currently processing in the system. Contact Splunk Support if the number is consistently above several thousand or the number continues to increase.
Number of Suppressed AnomaliesThis service shows the total number of anomalies in the system that have been suppressed manually or by anomaly action rules.
Real-time Rule ExecutorAverage New Anomalies CompletedThis service shows the average number of anomalies processed per second since the last restart of the Realtime Rule Executor.
Dropped New AnomaliesThis service shows the total number of dropped anomalies that were not duplicates since the last restart of the Realtime Rule Executor.
Duplicate New AnomaliesThis service shows the total number of duplicate anomalies since the last restart of the Realtime Rule Executor.
New Anomalies ReceivedThis service shows the total number of new anomalies created by Splunk UBA since the last restart of the Realtime Rule Executor.
New Anomalies CompletedThis service shows the total number of new anomalies processed since the last restart of the Realtime Rule Executor.
New Anomalies in ProcessThis service shows the number of new anomalies currently being processed by Splunk UBA.
Number of Active AnomaliesThis service shows the total number of anomalies activated by the Realtime Rule Executor after the last restart of the real-time rule executor. Activated anomalies are anomalies that were not suppressed or permanently deleted by anomaly action rules.
Threat Computation Task Threat Computation Start Time This service shows the last time the threat computation task was started.

Monitor system health with the health check script

The health check script captures the state of a running system and highlights areas of concern such as event processing lags, system slowness, and errors in services like Apache Kafka.

You can schedule the script to run regularly as a cron job and email the output as an attachment. See Configure email alerts to your Splunk UBA deployment administrators.

The following example adds a crontab entry to run daily at 7am:

  1. SSH to management node as a caspida user.
  2. Run the following commands:
    crontab -e   # to edit
    0 7 * * * /opt/caspida/bin/utils/uba_health_check.sh > /dev/null 2>&1
    crontab -l   # to list
    

The uba_health_check.sh script is stored in the /opt/caspida/bin/utils directory of Splunk UBA. Log in as the caspida user on the management server using SSH to run the script.

Output from the script is saved in a plain text file in the /var/log/caspida/check/ directory with a file name that includes the host name of the server and the time stamp. You can also collect the health check script output from the Splunk UBA user-interface. From Download Diagnostics select the Scripts module option. See, Collect diagnostic data from your Splunk UBA deployment.

Monitor server health with SNMP

You can use an SNMP monitoring tool to track statistics related to CPU usage, memory, and disk utilization on any server that has Splunk UBA installed.

Monitor postgres wal archiving errors

If your UBA deployment is configured for automated incremental backups, you can monitor the logs. For RHEL and OEL systems the logs are located at /var/log/messages. For Ubuntu systems the logs are located at /var/log/syslog.

Note: Ensure that the logs do not contain any errors related to PostgreSQL WAL archiving.

If an error occurs, it will appear similar to the following example error:

PostgreSQL WAL Archiving error: file pg_wal/00000001000000000000004D already exists. Last updated: 2025-01-06 07:29:54.867431375 +0000. File Size: 16384.00 KB please contact Splunk support.