NVIDIA GPU Metrics

NVIDIA GPU metrics are collected from the DCGM exporter and mapped into AppDynamics custom metrics for the AI POD GPU dashboards.

Prerequisites

Ensure that:

NVIDIA DCGM exporter is deployed
NVIDIA GPU Operator is used by the environment
GPU nodes are reachable through the Kubernetes service path
if Infrastructure Visibility is scheduled only on GPU nodes, you use nodeSelector: nvidia.com/gpu.present: "true" and the matching GPU toleration

Enable Prometheus Scraping for NVIDIA GPU

The following are example values from this repo:

service: nvidia-dcgm-exporter
namespace: nvidia-gpu-operator
port: 9400
path: /metrics

Replace these values with the DCGM service name and namespace used in the target environment.

Configure Machine Agent Ingestion

Infrastructure Visibility Prometheus monitoring loads the DCGM exporter definition through prometheus-config-template.yaml.

If GPU metrics are required, set these Infrastructure Visibility pod environment variables:

APPDYNAMICS_MACHINE_AGENT_GPU_ENABLED=true
APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAME=nvidia-dcgm-exporter
APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAMESPACE=nvidia-gpu-operator
APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_PORT=9400

The DCGM exporter runs as a DaemonSet. Set internalTrafficPolicy: Local on nvidia-dcgm-exporter so each Machine Agent scrapes the DCGM pod on the same node; otherwise agents may scrape a different node each interval and corrupt per-GPU counter-derived metrics (framebuffer used/free, tensor-active, DRAM-active deltas). Verify with kubectl get svc nvidia-dcgm-exporter -n <gpu-namespace> -o jsonpath='{.spec.internalTrafficPolicy}' — the value must be Local.

Before enabling the scrape, update the exporter YAML service discovery fields to the service name and namespace used by your GPU metrics deployment.

Exporter YAML Contract

exporter-yamls/dcgm-exporter.yaml
key source metrics:
- DCGM_FI_DEV_GPU_TEMP
- DCGM_FI_DEV_GPU_UTIL
- DCGM_FI_DEV_POWER_USAGE
- DCGM_FI_DEV_FB_USED
- DCGM_FI_DEV_FB_FREE
- DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
- DCGM_FI_PROF_DRAM_ACTIVE
computed metrics are used for GPU memory used percent

Expected AppDynamics Custom Metric Paths

Custom Metrics|AI Pod|GPUs|{gpu}|GPU Temperature (C)
Custom Metrics|AI Pod|GPUs|{gpu}|GPU Utilization (%)
Custom Metrics|AI Pod|GPUs|{gpu}|GPU Power (W)
Custom Metrics|AI Pod|GPUs|{gpu}|Framebuffer Memory Used (MiB)
Custom Metrics|AI Pod|GPUs|{gpu}|Framebuffer Memory Free (MiB)
Custom Metrics|AI Pod|GPUs|{gpu}|Tensor Core Utilization (%)
Custom Metrics|AI Pod|GPUs|{gpu}|DRAM Utilization (%)
Custom Metrics|AI Pod|GPUs|gpu{gpu}|GPU Memory Used (%)
Custom Metrics|AI Pod|GPUs|Average GPU Utilization (%)
Custom Metrics|AI Pod|GPUs|Average GPU Memory Used (%)
Custom Metrics|AI Pod|GPUs|Total GPU Power Usage (W)

Create Custom Dashboard

The custom dashboard script generates ready-to-import AppDynamics dashboard JSON files from a set of templates. You supply your environment's node names and, optionally, the custom metric path prefixes. The script substitutes them into the templates and writes the JSON files. See Create Custom Dashboards for AI Pods.

Troubleshooting

Connection refused usually indicates a bad service or exporter pod state, not a scrape interval issue.

AppDynamics SaaS

NVIDIA GPU Metrics

Prerequisites

Enable Prometheus Scraping for NVIDIA GPU

Configure Machine Agent Ingestion

Exporter YAML Contract

Expected AppDynamics Custom Metric Paths

Create Custom Dashboard

Troubleshooting

ON THIS PAGE

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

NVIDIA GPU Metrics

Prerequisites

Enable Prometheus Scraping for NVIDIA GPU

Configure Machine Agent Ingestion

Exporter YAML Contract

Expected AppDynamics Custom Metric Paths

Create Custom Dashboard

Troubleshooting