Configure the DCGM Exporter (Kubernetes)
Use the following configuration to deploy the NVIDIA DCGM Exporter as a DaemonSet on GPU nodes within a Kubernetes cluster. You can configure the Machine Agent to collect GPU metrics, and enabling cluster-wide GPU monitoring using the Cluster Agent.
-
Kubernetes Environment:
-
Kubernetes Flavor: Vanilla Kubernetes.
-
Kubernetes Version: 1.28 or later (verify using kubectl version).
-
-
GPU Nodes:
-
NVIDIA drivers (version 550.x or later)
-
NVIDIA Container Toolkit (installable via the NVIDIA GPU Operator version ≥ 24.9.x).
-
-
Cluster Agent Configuration: Enable GPU monitoring in the Cluster Agent spec:
gpuMonitoringEnabled: true
-
Controller Configuration: Enable GPU monitoring at the account level using the following controller flag:
sim.cluster.gpu.enabled=true
-
Machine Agent DaemonSet: Use the Machine Agent Docker image with GPU support. Enable GPU Monitoring on the Machine Agent using one of the following methods:
-
System Property:
-Dappdynamics.machine.agent.gpu.enabled=true
-
Controller Configuration File (
controller-info.xml
):<gpu-enabled>true</gpu-enabled>
-
Environment Variable:
APPDYNAMICS_MACHINE_AGENT_GPU_ENABLED=true
-
-
Disable Built-in DCGM Exporter in the GPU Operator. By default, the NVIDIA GPU Operator deploys its own DCGM Exporter. However, it lacks support for
hostPID: true
andinternalTrafficPolicy: Local
, which are necessary for proper configuration. Disable the built-in DCGM Exporter using the following commands:-
If the GPU Operator is not installed:
helm install gpu-operator nvidia/gpu-operator \ -n gpu-operator --create-namespace \ --set dcgmExporter.enabled=false \ --wait
-
If the GPU Operator is already installed:
helm upgrade --install gpu-operator nvidia/gpu-operator \ -n gpu-operator \ --set dcgmExporter.enabled=false \ --reuse-values \ --wait
-