Configure the Prometheus receiver to collect NVIDIA GPU metrics
Learn how to configure and activate the component for Nvidia GPU.
You can monitor the performance of NVIDIA GPUs by configuring your Kubernetes cluster to send NVIDIA GPU metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from the NVIDIA DCGM Exporter, which can be installed independently or as part of the NVIDIA GPU Operator.
For more information on these NVIDIA components, see the NVIDIA DCGM Exporter GitHub repository and About the NVIDIA GPU Operator in the NVIDIA documentation. The NVIDIA DCGM Exporter exposes a /metrics endpoint that publishes Prometheus-compatible metrics.
Complete the following steps to collect metrics from NVIDIA GPUs.
-
To install the NVIDIA DCGM Exporter using Helm, see Quickstart on Kubernetes in the NVIDIA DCGM Exporter GitHub repository.
- To install the NVIDIA DCGM Exporter as part of the NVIDIA GPU Operator, see Installing the NVIDIA GPU Operator in the NVIDIA documentation.
Configuration settings
To view the configuration options for the Prometheus receiver, see Settings.
Metrics
Learn about the available monitoring metrics for Nvidia GPUs.
The following metrics are available for NVIDIA GPUs. For more information on these metrics, see metrics-configmap.yaml in the NVIDIA DCGM Exporter GitHub repository.
These metrics are considered custom metrics in Splunk Observability Cloud.
| Metric name | Type | Unit | Description |
|---|---|---|---|
DCGM_FI_DEV_SM_CLOCK |
gauge | MHz | SM clock frequency. |
DCGM_FI_DEV_MEM_CLOCK |
gauge | MHz | Memory clock frequency. |
DCGM_FI_DEV_MEMORY_TEMP |
gauge | C | Memory temperature. |
DCGM_FI_DEV_GPU_TEMP |
gauge | C | GPU temperature. |
DCGM_FI_DEV_POWER_USAGE |
gauge | W | Power draw. |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION |
counter | mJ | Total energy consumption since boot. |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER |
counter | count | Total number of PCIe retries. |
DCGM_FI_DEV_GPU_UTIL |
gauge | percent | GPU utilization. |
DCGM_FI_DEV_MEM_COPY_UTIL |
gauge | percent | Memory utilization. |
DCGM_FI_DEV_ENC_UTIL |
gauge | percent | Encoder utilization. |
DCGM_FI_DEV_DEC_UTIL |
gauge | percent | Decoder utilization. |
DCGM_FI_DEV_FB_FREE |
gauge | MiB | Framebuffer memory free. |
DCGM_FI_DEV_FB_USED |
gauge | MiB | Framebuffer memory used. |
DCGM_FI_PROF_PCIE_TX_BYTES |
counter | bytes | Number of bytes of active PCIe TX data, including both header and payload. |
DCGM_FI_PROF_PCIE_RX_BYTES |
counter | bytes | Number of bytes of active PCIe RX data, including both header and payload. |
Attributes
Learn about the available attributes for NVIDIA GPUs.
The following attributes are available for NVIDIA GPUs.
| Attribute name | Type | Description | Example value |
|---|---|---|---|
app |
string | The name of the application attached to the GPU. | nvidia-dcgm-exporter |
DCGM_FI_DRIVER_VERSION |
string | The version of the NVIDIA DCGM driver installed on the system. | 570.124.06 |
device |
string | The identifier for the specific NVIDIA device or GPU instance. | nvidia0 |
gpu |
number | The index number of the GPU within the system. | 0 |
modelName |
string | The commercial model of the NVIDIA GPU. | NVIDIA A10G |
UUID |
string | A unique identifier assigned to the GPU. | GPU-3ca2f6af-10d6-30a5-b45b-158fc83e6d33 |
Next steps
If needed, set up data collection for your other Cisco AI PODs components. For instructions, see Collect metrics and metadata from Cisco AI PODs.
After you set up data collection for Cisco AI PODs components, you can monitor their performance using built-in experiences in Splunk Observability Cloud. For more information, see Monitor the performance of your Cisco AI PODs.