GPU Monitoring

Splunk AppDynamics GPU Monitoring provides comprehensive visibility into the health and performance of NVIDIA GPUs across your infrastructure. By integrating with the AppDynamics Machine Agent and Cluster Agent, it enables both node-level and cluster-wide metric collection. This ensures optimal resource utilization, efficient troubleshooting, and improved performance for GPU-enabled workloads.

Monitoring GPUs is essential for maintaining optimal performance, efficient resource utilization, and supporting advanced workloads, such as AI/ML. Key benefits include:

Optimize Resource Utilization: Monitor GPU compute and memory usage to right-size workloads and plan capacity effectively.
Enhance Thermal and Power Management: Track temperature and power draw to prevent overheating and ensure energy efficiency.
Correlate GPU and Application Metrics: Link GPU performance with application-level metrics to isolate and resolve bottlenecks.
Gain GPU Efficiency Insights: Measure compute, graphics, and throughput metrics to assess GPU performance for AI/ML training and inference tasks.
Accelerate Troubleshooting: Quickly identify and resolve GPU-related hardware or workload issues by correlating telemetry with application traces.

Splunk AppDynamics offers two core components for capturing GPU-specific telemetry:

Machine Agent (Node-Level Monitoring)
Cluster Agent (Cluster-Level Monitoring)

Machine Agent (Node-Level Monitoring)

Collects metrics via NVIDIA-SMI or the DCGM Exporter.
Deployed as a standalone agent on GPU-enabled hosts or as the Infraviz DaemonSet in Kubernetes environments.
Tracks GPU utilization, memory usage, power draw, temperature, and PCIe throughput.

Cluster Agent (Cluster-Level Monitoring)

Aggregates GPU metrics at the cluster, pod, and container levels in Kubernetes environments.
Provides insights such as utilization, memory usage, and request vs. limit metrics directly in cluster dashboards.
Supplements node-level data with cluster-wide views for comprehensive monitoring.

AppDynamics On-Premises

GPU Monitoring

Machine Agent (Node-Level Monitoring)

Cluster Agent (Cluster-Level Monitoring)

ON THIS PAGE

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

GPU Monitoring

Machine Agent (Node-Level Monitoring)

Cluster Agent (Cluster-Level Monitoring)