NVIDIA GPU Metrics
NVIDIA GPU metrics are collected from the DCGM exporter and mapped into AppDynamics custom metrics for the AI POD GPU dashboards.
Prerequisites
Ensure that:
- NVIDIA DCGM exporter is deployed
- NVIDIA GPU Operator is used by the environment
- GPU nodes are reachable through the Kubernetes service path
- if Infrastructure Visibility is scheduled only on GPU nodes, you use
nodeSelector: nvidia.com/gpu.present: "true"and the matching GPU toleration
Enable Prometheus Scraping for NVIDIA GPU
The following are example values from this repo:
- service:
nvidia-dcgm-exporter - namespace:
nvidia-gpu-operator - port:
9400 - path:
/metrics
Replace these values with the DCGM service name and namespace used in the target environment.
Configure Machine Agent Ingestion
Infrastructure Visibility Prometheus monitoring loads the DCGM exporter definition through prometheus-config-template.yaml.
If GPU metrics are required, set these Infrastructure Visibility pod environment variables:
APPDYNAMICS_MACHINE_AGENT_GPU_ENABLED=trueAPPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAME=nvidia-dcgm-exporterAPPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAMESPACE=nvidia-gpu-operatorAPPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_PORT=9400
The DCGM exporter runs as a DaemonSet. Set internalTrafficPolicy: Local on nvidia-dcgm-exporter so each Machine Agent scrapes the DCGM pod on the same node; otherwise agents may scrape a different node each interval and corrupt per-GPU counter-derived metrics (framebuffer used/free, tensor-active, DRAM-active deltas). Verify with kubectl get svc nvidia-dcgm-exporter -n <gpu-namespace> -o jsonpath='{.spec.internalTrafficPolicy}' — the value must be Local.
Before enabling the scrape, update the exporter YAML service discovery fields to the service name and namespace used by your GPU metrics deployment.
Exporter YAML Contract
exporter-yamls/dcgm-exporter.yaml- key source metrics:
DCGM_FI_DEV_GPU_TEMPDCGM_FI_DEV_GPU_UTILDCGM_FI_DEV_POWER_USAGEDCGM_FI_DEV_FB_USEDDCGM_FI_DEV_FB_FREEDCGM_FI_PROF_PIPE_TENSOR_ACTIVEDCGM_FI_PROF_DRAM_ACTIVE
- computed metrics are used for GPU memory used percent
Expected AppDynamics Custom Metric Paths
Custom Metrics|AI Pod|GPUs|{gpu}|GPU Temperature (C)Custom Metrics|AI Pod|GPUs|{gpu}|GPU Utilization (%)Custom Metrics|AI Pod|GPUs|{gpu}|GPU Power (W)Custom Metrics|AI Pod|GPUs|{gpu}|Framebuffer Memory Used (MiB)Custom Metrics|AI Pod|GPUs|{gpu}|Framebuffer Memory Free (MiB)Custom Metrics|AI Pod|GPUs|{gpu}|Tensor Core Utilization (%)Custom Metrics|AI Pod|GPUs|{gpu}|DRAM Utilization (%)Custom Metrics|AI Pod|GPUs|gpu{gpu}|GPU Memory Used (%)Custom Metrics|AI Pod|GPUs|Average GPU Utilization (%)Custom Metrics|AI Pod|GPUs|Average GPU Memory Used (%)Custom Metrics|AI Pod|GPUs|Total GPU Power Usage (W)
Create Custom Dashboard
The custom dashboard script generates ready-to-import AppDynamics dashboard JSON files from a set of templates. You supply your environment's node names and, optionally, the custom metric path prefixes. The script substitutes them into the templates and writes the JSON files. See Create Custom Dashboards for AI Pods.
Troubleshooting
Connection refused usually indicates a bad service or exporter pod state, not a scrape interval issue.