Configure the Prometheus receiver to collect NVIDIA GPU metrics

Learn how to configure and activate the component for Nvidia GPU.

You can monitor the performance of NVIDIA GPUs by configuring your Kubernetes cluster to send NVIDIA GPU metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from the NVIDIA DCGM Exporter, which can be installed independently or as part of the NVIDIA GPU Operator.

For more information on these NVIDIA components, see the NVIDIA DCGM Exporter GitHub repository and About the NVIDIA GPU Operator in the NVIDIA documentation. The NVIDIA DCGM Exporter exposes a /metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following steps to collect metrics from NVIDIA GPUs.

To configure the Prometheus receiver to collect metrics from NVIDIA GPUs, you must install the NVIDIA DCGM Exporter on your Kubernetes cluster. You can use one of the following methods:

To install the NVIDIA DCGM Exporter using Helm, see Quickstart on Kubernetes in the NVIDIA DCGM Exporter GitHub repository.
To install the NVIDIA DCGM Exporter as part of the NVIDIA GPU Operator, see Installing the NVIDIA GPU Operator in the NVIDIA documentation.

Install the Splunk Distribution of the OpenTelemetry Collector for Kubernetes using Helm.
To activate the Prometheus receiver for the NVIDIA DCGM Exporter manually in the Collector configuration, make the following changes to your configuration file:
1. Add receiver_creator to the receivers section. For more information on using the receiver creator receiver, see Receiver creator receiver.
2. Add receiver_creator to the metrics pipeline of the service section.
Example configuration file:
YAML
agent: config: receivers: receiver_creator: watch_observers: [ k8s_observer ] receivers: prometheus: config: config: scrape_configs: - job_name: gpu-metrics static_configs: - targets: - '`endpoint`:9400' rule: type == "pod" && labels["app"] == "nvidia-dcgm-exporter" service: pipelines: metrics/nvidia-gpu-metrics: exporters: - signalfx processors: - memory_limiter - batch - resourcedetection - resource receivers: - receiver_creator
```
agent:
  config:
    receivers:
      receiver_creator:
        watch_observers: [ k8s_observer ]
        receivers:
          prometheus:
            config:
              config:
                scrape_configs:
                  - job_name: gpu-metrics
                    static_configs:
                      - targets:
                          - '`endpoint`:9400'
             rule: type == "pod" && labels["app"] == "nvidia-dcgm-exporter"
    service:
      pipelines:
        metrics/nvidia-gpu-metrics:
          exporters:
            - signalfx
          processors:
            - memory_limiter
            - batch
            - resourcedetection
            - resource
          receivers:
            - receiver_creator
```
Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

Learn about the available monitoring metrics for Nvidia GPUs.

The following metrics are available for NVIDIA GPUs. For more information on these metrics, see metrics-configmap.yaml in the NVIDIA DCGM Exporter GitHub repository.

These metrics are considered custom metrics in Splunk Observability Cloud.

Metric name	Type	Unit	Description
`DCGM_FI_DEV_SM_CLOCK`	gauge	MHz	SM clock frequency.
`DCGM_FI_DEV_MEM_CLOCK`	gauge	MHz	Memory clock frequency.
`DCGM_FI_DEV_MEMORY_TEMP`	gauge	C	Memory temperature.
`DCGM_FI_DEV_GPU_TEMP`	gauge	C	GPU temperature.
`DCGM_FI_DEV_POWER_USAGE`	gauge	W	Power draw.
`DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION`	counter	mJ	Total energy consumption since boot.
`DCGM_FI_DEV_PCIE_REPLAY_COUNTER`	counter	count	Total number of PCIe retries.
`DCGM_FI_DEV_GPU_UTIL`	gauge	percent	GPU utilization.
`DCGM_FI_DEV_MEM_COPY_UTIL`	gauge	percent	Memory utilization.
`DCGM_FI_DEV_ENC_UTIL`	gauge	percent	Encoder utilization.
`DCGM_FI_DEV_DEC_UTIL`	gauge	percent	Decoder utilization.
`DCGM_FI_DEV_FB_FREE`	gauge	MiB	Framebuffer memory free.
`DCGM_FI_DEV_FB_USED`	gauge	MiB	Framebuffer memory used.
`DCGM_FI_PROF_PCIE_TX_BYTES`	counter	bytes	Number of bytes of active PCIe TX data, including both header and payload.
`DCGM_FI_PROF_PCIE_RX_BYTES`	counter	bytes	Number of bytes of active PCIe RX data, including both header and payload.

Attributes

Learn about the available attributes for NVIDIA GPUs.

The following attributes are available for NVIDIA GPUs.


Attribute name	Type	Description	Example value
`app`	string	The name of the application attached to the GPU.	`nvidia-dcgm-exporter`
`DCGM_FI_DRIVER_VERSION`	string	The version of the NVIDIA DCGM driver installed on the system.	`570.124.06`
`device`	string	The identifier for the specific NVIDIA device or GPU instance.	`nvidia0`
`gpu`	number	The index number of the GPU within the system.	`0`
`modelName`	string	The commercial model of the NVIDIA GPU.	`NVIDIA A10G`
`UUID`	string	A unique identifier assigned to the GPU.	`GPU-3ca2f6af-10d6-30a5-b45b-158fc83e6d33`

Next steps

If needed, set up data collection for your other Cisco AI PODs components. For more information, see Get started with monitoring Cisco AI PODs.

After you set up data collection for Cisco AI PODs components, you can monitor their performance using built-in experiences in Splunk Observability Cloud. For more information, see Monitor the performance of your Cisco AI PODs.

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Configure the Prometheus receiver to collect NVIDIA GPU metrics

Configuration settings

Metrics

Attributes

Next steps