Configure the Prometheus receiver to collect NVIDIA GPU metrics

Learn how to configure the Prometheus receiver to collect NVIDIA GPU metrics.

You can monitor the performance of NVIDIA GPUs by configuring your Kubernetes cluster to send NVIDIA GPU metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from the NVIDIA DCGM Exporter, which can be installed independently or as part of the NVIDIA GPU Operator.

For more information on these NVIDIA components, see the NVIDIA DCGM Exporter GitHub repository and About the NVIDIA GPU Operator in the NVIDIA documentation. The NVIDIA DCGM Exporter exposes a /metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following steps to collect metrics from and monitor the performance of NVIDIA GPUs.

  1. (Prerequisite) Install the NVIDIA DCGM Exporter on your Kubernetes cluster.
  2. Configure and activate the component for NVIDIA GPU.
  3. Use the NVIDIA GPU navigator to monitor the performance of NVIDIA GPUs.

Prerequisite

Learn how to install the NVIDIA DCGM Exporter on your Kubernetes cluster, which is a prerequisite to configure the Prometheus receiver to collect NVIDIA GPU metrics.

To configure the Prometheus receiver to collect metrics from NVIDIA GPUs, you must install the NVIDIA DCGM Exporter on your Kubernetes cluster. You can use one of the following methods:

Configure and activate the component for NVIDIA GPU

Learn how to configure and activate the component for Nvidia GPU.

Complete the following steps to configure and activate the component for NVIDIA GPU.

  1. Install the Splunk Distribution of the OpenTelemetry Collector for Kubernetes using Helm.
  2. To activate the Prometheus receiver for the NVIDIA DCGM Exporter manually in the Collector configuration, make the following changes to your configuration file:
    1. Add receiver_creator to the receivers section. For more information on using the receiver creator receiver, see Receiver creator receiver.
    2. Add receiver_creator to the metrics pipeline of the service section.
    Example configuration file:
    agent:
      config:
        receivers:
          receiver_creator:
            watch_observers: [ k8s_observer ]
            receivers:
              prometheus:
                config:
                  config:
                    scrape_configs:
                      - job_name: gpu-metrics
                        static_configs:
                          - targets:
                              - '`endpoint`:9400'
                 rule: type == "pod" && labels["app"] == "nvidia-dcgm-exporter"
        service:
          pipelines:
            metrics/nvidia-gpu-metrics:
              exporters:
                - signalfx
              processors:
                - memory_limiter
                - batch
                - resourcedetection
                - resource
              receivers:
                - receiver_creator
    
  3. Restart the Splunk Distribution of the OpenTelemetry Collector.

Monitor the performance of NVIDIA GPUs

Learn how to navigate to the Nvidia GPU navigator, which you can use to monitor the performance of NVIDIA GPUs.

Complete the following steps to access the NVIDIA GPU navigator and monitor the performance of NVIDIA GPUs. For more information on navigators, see Use navigators.

  1. From the Splunk Observability Cloud main menu, select Infrastructure.
  2. Under AI/ML, select AI Frameworks.
  3. Select the NVIDIA GPU summary card.

Configuration settings

Learn about the configuration settings for the Prometheus receiver.

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

Learn about the available monitoring metrics for Nvidia GPUs.

The following metrics are available for NVIDIA GPUs. For more information on these metrics, see metrics-configmap.yaml in the NVIDIA DCGM Exporter GitHub repository.
Metric nameTypeUnitDescription
DCGM_FI_DEV_SM_CLOCKgaugeMHzSM clock frequency.
DCGM_FI_DEV_MEM_CLOCKgaugeMHzMemory clock frequency.
DCGM_FI_DEV_MEMORY_TEMPgaugeCMemory temperature.
DCGM_FI_DEV_GPU_TEMPgaugeCGPU temperature.
DCGM_FI_DEV_POWER_USAGEgaugeWPower draw.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONcountermJTotal energy consumption since boot.
DCGM_FI_DEV_PCIE_REPLAY_COUNTERcountercountTotal number of PCIe retries.
DCGM_FI_DEV_GPU_UTILgaugepercentGPU utilization.
DCGM_FI_DEV_MEM_COPY_UTILgaugepercentMemory utilization.
DCGM_FI_DEV_ENC_UTILgaugepercentEncoder utilization.
DCGM_FI_DEV_DEC_UTILgaugepercentDecoder utilization.
DCGM_FI_DEV_FB_FREEgaugeMiBFramebuffer memory free.
DCGM_FI_DEV_FB_USEDgaugeMiBFramebuffer memory used.
DCGM_FI_PROF_PCIE_TX_BYTEScounterbytesNumber of bytes of active PCIe TX data, including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTEScounterbytesNumber of bytes of active PCIe RX data, including both header and payload.

Attributes

Learn about the available attributes for NVIDIA GPUs.

The following attributes are available for NVIDIA GPUs.

Attribute nameTypeDescriptionExample value
appstringThe name of the application attached to the GPU.nvidia-dcgm-exporter
DCGM_FI_DRIVER_VERSIONstringThe version of the NVIDIA DCGM driver installed on the system.570.124.06
devicestringThe identifier for the specific NVIDIA device or GPU instance.nvidia0
gpunumberThe index number of the GPU within the system.0
modelNamestringThe commercial model of the NVIDIA GPU.NVIDIA A10G
UUIDstringA unique identifier assigned to the GPU.GPU-3ca2f6af-10d6-30a5-b45b-158fc83e6d33

Troubleshoot

Learn how to get help if you can't see your data in Splunk Observability Cloud.

If you are a Splunk Observability Cloud customer and are not able to see your data in Splunk Observability Cloud, you can get help in the following ways:

  • Prospective customers and free trial users can ask a question and get answers through community support in the Splunk Community.