GPU Monitoring

Splunk AppDynamics GPU Monitoring provides comprehensive visibility into the health and performance of NVIDIA GPUs across your infrastructure. By integrating with the AppDynamics Machine Agent and Cluster Agent, it enables both node-level and cluster-wide metric collection. This ensures optimal resource utilization, efficient troubleshooting, and improved performance for GPU-enabled workloads.

Monitoring GPUs is essential for maintaining optimal performance, efficient resource utilization, and supporting advanced workloads, such as AI/ML. Key benefits include:
  1. Optimize Resource Utilization: Monitor GPU compute and memory usage to right-size workloads and plan capacity effectively.

  2. Enhance Thermal and Power Management: Track temperature and power draw to prevent overheating and ensure energy efficiency.

  3. Correlate GPU and Application Metrics: Link GPU performance with application-level metrics to isolate and resolve bottlenecks.

  4. Gain GPU Efficiency Insights: Measure compute, graphics, and throughput metrics to assess GPU performance for AI/ML training and inference tasks.

  5. Accelerate Troubleshooting: Quickly identify and resolve GPU-related hardware or workload issues by correlating telemetry with application traces.

Splunk AppDynamics offers two core components for capturing GPU-specific telemetry:
  1. Machine Agent (Node-Level Monitoring)

  2. Cluster Agent (Cluster-Level Monitoring)

Machine Agent (Node-Level Monitoring)

  • Collects metrics via NVIDIA-SMI or the DCGM Exporter.

  • Deployed as a standalone agent on GPU-enabled hosts or as the Infraviz DaemonSet in Kubernetes environments.

  • Tracks GPU utilization, memory usage, power draw, temperature, and PCIe throughput.

Cluster Agent (Cluster-Level Monitoring)

  • Aggregates GPU metrics at the cluster, pod, and container levels in Kubernetes environments.

  • Provides insights such as utilization, memory usage, and request vs. limit metrics directly in cluster dashboards.

  • Supplements node-level data with cluster-wide views for comprehensive monitoring.