Configure the DCGM Exporter (Kubernetes)

Use the following configuration to deploy the NVIDIA DCGM Exporter as a DaemonSet on GPU nodes within a Kubernetes cluster. You can configure the Machine Agent to collect GPU metrics, and enabling cluster-wide GPU monitoring using the Cluster Agent.

Before deploying the DCGM Exporter, ensure the following requirements are met:

Kubernetes Environment:
- Kubernetes Flavor: Vanilla Kubernetes.
- Kubernetes Version: 1.28 or later (verify using kubectl version).
GPU Nodes:
- NVIDIA drivers (version 550.x or later)
- NVIDIA Container Toolkit (installable via the NVIDIA GPU Operator version ≥ 24.9.x).
Cluster Agent Configuration: Enable GPU monitoring in the Cluster Agent spec:
```
gpuMonitoringEnabled: true
```
Controller Configuration: Enable GPU monitoring at the account level using the following controller flag:
```
sim.cluster.gpu.enabled=true
```
Machine Agent DaemonSet: Use the Machine Agent Docker image with GPU support. Enable GPU Monitoring on the Machine Agent using one of the following methods:
- System Property:
```
-Dappdynamics.machine.agent.gpu.enabled=true
```
- Controller Configuration File (controller-info.xml):
```
<gpu-enabled>true</gpu-enabled>
```
- Environment Variable:
```
APPDYNAMICS_MACHINE_AGENT_GPU_ENABLED=true
```
Disable Built-in DCGM Exporter in the GPU Operator. By default, the NVIDIA GPU Operator deploys its own DCGM Exporter. However, it lacks support for hostPID: true and internalTrafficPolicy: Local, which are necessary for proper configuration. Disable the built-in DCGM Exporter using the following commands:
- If the GPU Operator is not installed:
```
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set dcgmExporter.enabled=false \
  --wait
```
- If the GPU Operator is already installed:
```
helm upgrade --install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --set dcgmExporter.enabled=false \
  --reuse-values \
  --wait
```

Deploy the DCGM Exporter as a DaemonSet with the required customizations. Use the following YAML specification:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "4.1.1"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "4.1.1"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "4.1.1"
      name: "dcgm-exporter"
    spec:
      hostPID: true
      containers:
      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.1-ubuntu22.04"
        name: "dcgm-exporter"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
          capabilities:
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
 
---
 
apiVersion: v1
kind: Service
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "4.1.1"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "4.1.1"
  ports:
  - name: "metrics"
    port: 9400
  internalTrafficPolicy: Local

Alternatively, you can modify an existing DCGM Exporter deployment to include the following:
- Set hostPID: true in the DaemonSet spec.
- Set internalTrafficPolicy: Local in the Service spec.
Note: The internalTrafficPolicy: Local setting ensures that scraping from a pod on Node X only queries the exporter on Node X, avoiding cross-node traffic

Update the Machine Agent DaemonSet environment variables to enable integration with the DCGM Exporter:

- name: APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAMESPACE
  value: "gpu-operator"
- name: APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAME
  value: "dcgm-exporter"
- name: APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_PORT
  value: "9400"

Environment Variable	Description
`APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAME`	DCGM-Exporter Service Name is the Kubernetes service name.
`APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAMESPACE`	DCGM-Exporter Namespace is the Kubernetes namespace.
`APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_PORT`	Service port number for DCGM-Exporter.

Run the following command to ensure that the DCGM Exporter pods are running:
```
kubectl get pods -n gpu-operator
```

Verify that the Machine Agent is scraping GPU metrics:

kubectl exec -it -n gpu-operator <Infraviz pod> -- cat /opt/appdynamics/logs/machine-agent.log

Ensure that the GPU metrics are available in the Controller UI under Server and Cluster Agent dashboards.