NVIDIA GPU Metrics

NVIDIA GPU metrics are collected from the DCGM exporter and mapped into AppDynamics custom metrics for the AI POD GPU dashboards.

Prerequisites

Ensure that:

  • NVIDIA DCGM exporter is deployed
  • NVIDIA GPU Operator is used by the environment
  • GPU nodes are reachable through the Kubernetes service path
  • if Infrastructure Visibility is scheduled only on GPU nodes, you use nodeSelector: nvidia.com/gpu.present: "true" and the matching GPU toleration

Enable Prometheus Scraping for NVIDIA GPU

The following are example values from this repo:

  • service: nvidia-dcgm-exporter
  • namespace: nvidia-gpu-operator
  • port: 9400
  • path: /metrics

Replace these values with the DCGM service name and namespace used in the target environment.

Configure Machine Agent Ingestion

Infrastructure Visibility Prometheus monitoring loads the DCGM exporter definition through prometheus-config-template.yaml.

If GPU metrics are required, set these Infrastructure Visibility pod environment variables:

  • APPDYNAMICS_MACHINE_AGENT_GPU_ENABLED=true
  • APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAME=nvidia-dcgm-exporter
  • APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAMESPACE=nvidia-gpu-operator
  • APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_PORT=9400

The DCGM exporter runs as a DaemonSet. Set internalTrafficPolicy: Local on nvidia-dcgm-exporter so each Machine Agent scrapes the DCGM pod on the same node; otherwise agents may scrape a different node each interval and corrupt per-GPU counter-derived metrics (framebuffer used/free, tensor-active, DRAM-active deltas). Verify with kubectl get svc nvidia-dcgm-exporter -n <gpu-namespace> -o jsonpath='{.spec.internalTrafficPolicy}' — the value must be Local.

Before enabling the scrape, update the exporter YAML service discovery fields to the service name and namespace used by your GPU metrics deployment.

Exporter YAML Contract

  • exporter-yamls/dcgm-exporter.yaml
  • key source metrics:
    • DCGM_FI_DEV_GPU_TEMP
    • DCGM_FI_DEV_GPU_UTIL
    • DCGM_FI_DEV_POWER_USAGE
    • DCGM_FI_DEV_FB_USED
    • DCGM_FI_DEV_FB_FREE
    • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
    • DCGM_FI_PROF_DRAM_ACTIVE
  • computed metrics are used for GPU memory used percent

Expected AppDynamics Custom Metric Paths

  • Custom Metrics|AI Pod|GPUs|{gpu}|GPU Temperature (C)
  • Custom Metrics|AI Pod|GPUs|{gpu}|GPU Utilization (%)
  • Custom Metrics|AI Pod|GPUs|{gpu}|GPU Power (W)
  • Custom Metrics|AI Pod|GPUs|{gpu}|Framebuffer Memory Used (MiB)
  • Custom Metrics|AI Pod|GPUs|{gpu}|Framebuffer Memory Free (MiB)
  • Custom Metrics|AI Pod|GPUs|{gpu}|Tensor Core Utilization (%)
  • Custom Metrics|AI Pod|GPUs|{gpu}|DRAM Utilization (%)
  • Custom Metrics|AI Pod|GPUs|gpu{gpu}|GPU Memory Used (%)
  • Custom Metrics|AI Pod|GPUs|Average GPU Utilization (%)
  • Custom Metrics|AI Pod|GPUs|Average GPU Memory Used (%)
  • Custom Metrics|AI Pod|GPUs|Total GPU Power Usage (W)

Create Custom Dashboard

The custom dashboard script generates ready-to-import AppDynamics dashboard JSON files from a set of templates. You supply your environment's node names and, optionally, the custom metric path prefixes. The script substitutes them into the templates and writes the JSON files. See Create Custom Dashboards for AI Pods.

Troubleshooting

Connection refused usually indicates a bad service or exporter pod state, not a scrape interval issue.