Configure the DCGM Exporter (Kubernetes)

Use the following configuration to deploy the NVIDIA DCGM Exporter as a DaemonSet on GPU nodes within a Kubernetes cluster. You can configure the Machine Agent to collect GPU metrics, and enabling cluster-wide GPU monitoring using the Cluster Agent.

Before deploying the DCGM Exporter, ensure the following requirements are met:
  • Kubernetes Environment:
    • Kubernetes Flavor: Vanilla Kubernetes.

    • Kubernetes Version: 1.28 or later (verify using kubectl version).

  • GPU Nodes:

    • NVIDIA drivers (version 550.x or later)

    • NVIDIA Container Toolkit (installable via the NVIDIA GPU Operator version ≥ 24.9.x).

  • Cluster Agent Configuration: Enable GPU monitoring in the Cluster Agent spec:
    gpuMonitoringEnabled: true
  • Controller Configuration: Enable GPU monitoring at the account level using the following controller flag:
    sim.cluster.gpu.enabled=true
  • Machine Agent DaemonSet: Use the Machine Agent Docker image with GPU support. Enable GPU Monitoring on the Machine Agent using one of the following methods:
    • System Property:
      -Dappdynamics.machine.agent.gpu.enabled=true
    • Controller Configuration File (controller-info.xml):
      <gpu-enabled>true</gpu-enabled>
    • Environment Variable:
      APPDYNAMICS_MACHINE_AGENT_GPU_ENABLED=true
  • Disable Built-in DCGM Exporter in the GPU Operator. By default, the NVIDIA GPU Operator deploys its own DCGM Exporter. However, it lacks support for hostPID: true and internalTrafficPolicy: Local, which are necessary for proper configuration. Disable the built-in DCGM Exporter using the following commands:
    • If the GPU Operator is not installed:
      helm install gpu-operator nvidia/gpu-operator \
        -n gpu-operator --create-namespace \
        --set dcgmExporter.enabled=false \
        --wait
    • If the GPU Operator is already installed:
      helm upgrade --install gpu-operator nvidia/gpu-operator \
        -n gpu-operator \
        --set dcgmExporter.enabled=false \
        --reuse-values \
        --wait
  1. Deploy the DCGM Exporter as a DaemonSet with the required customizations. Use the following YAML specification:
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: "dcgm-exporter"
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "4.1.1"
    spec:
      updateStrategy:
        type: RollingUpdate
      selector:
        matchLabels:
          app.kubernetes.io/name: "dcgm-exporter"
          app.kubernetes.io/version: "4.1.1"
      template:
        metadata:
          labels:
            app.kubernetes.io/name: "dcgm-exporter"
            app.kubernetes.io/version: "4.1.1"
          name: "dcgm-exporter"
        spec:
          hostPID: true
          containers:
          - image: "nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.1-ubuntu22.04"
            name: "dcgm-exporter"
            env:
            - name: "DCGM_EXPORTER_LISTEN"
              value: ":9400"
            - name: "DCGM_EXPORTER_KUBERNETES"
              value: "true"
            ports:
            - name: "metrics"
              containerPort: 9400
            securityContext:
              runAsNonRoot: false
              runAsUser: 0
              capabilities:
                add: ["SYS_ADMIN"]
            volumeMounts:
            - name: "pod-gpu-resources"
              readOnly: true
              mountPath: "/var/lib/kubelet/pod-resources"
          volumes:
          - name: "pod-gpu-resources"
            hostPath:
              path: "/var/lib/kubelet/pod-resources"
     
    ---
     
    apiVersion: v1
    kind: Service
    metadata:
      name: "dcgm-exporter"
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "4.1.1"
    spec:
      selector:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "4.1.1"
      ports:
      - name: "metrics"
        port: 9400
      internalTrafficPolicy: Local
  2. Alternatively, you can modify an existing DCGM Exporter deployment to include the following:
    • Set hostPID: true in the DaemonSet spec.

    • Set internalTrafficPolicy: Local in the Service spec.

    Note: The internalTrafficPolicy: Local setting ensures that scraping from a pod on Node X only queries the exporter on Node X, avoiding cross-node traffic
  3. Update the Machine Agent DaemonSet environment variables to enable integration with the DCGM Exporter:
    - name: APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAMESPACE
      value: "gpu-operator"
    - name: APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAME
      value: "dcgm-exporter"
    - name: APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_PORT
      value: "9400"
    Environment Variable Description
    APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAME
    DCGM-Exporter Service Name is the Kubernetes service name.
    APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_NAMESPACE
    DCGM-Exporter Namespace is the Kubernetes namespace.
    APPDYNAMICS_MACHINE_AGENT_DCGM_EXPORTER_SERVICE_PORT
    Service port number for DCGM-Exporter.
  4. Run the following command to ensure that the DCGM Exporter pods are running:
    kubectl get pods -n gpu-operator
  5. Verify that the Machine Agent is scraping GPU metrics:
    kubectl exec -it -n gpu-operator <Infraviz pod> -- cat /opt/appdynamics/logs/machine-agent.log
  6. Ensure that the GPU metrics are available in the Controller UI under Server and Cluster Agent dashboards.