Configure the Prometheus receiver to collect NVIDIA GPU metrics

Learn how to configure and activate the component for Nvidia GPU.

You can monitor the performance of NVIDIA GPUs by configuring your Kubernetes cluster to send NVIDIA GPU metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from the NVIDIA DCGM Exporter, which can be installed independently or as part of the NVIDIA GPU Operator.

For more information on these NVIDIA components, see the NVIDIA DCGM Exporter GitHub repository and About the NVIDIA GPU Operator in the NVIDIA documentation. The NVIDIA DCGM Exporter exposes a /metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following steps to collect metrics from NVIDIA GPUs.

To configure the Prometheus receiver to collect metrics from NVIDIA GPUs, you must install the NVIDIA DCGM Exporter on your Kubernetes cluster. You can use one of the following methods:
  1. Install the Splunk Distribution of the OpenTelemetry Collector for Kubernetes using Helm.
  2. To activate the Prometheus receiver for the NVIDIA DCGM Exporter manually in the Collector configuration, make the following changes to your configuration file:
    1. Add receiver_creator to the receivers section. For more information on using the receiver creator receiver, see Receiver creator receiver.
    2. Add receiver_creator to the metrics pipeline of the service section.
    Example configuration file:
    agent:
      config:
        receivers:
          receiver_creator:
            watch_observers: [ k8s_observer ]
            receivers:
              prometheus:
                config:
                  config:
                    scrape_configs:
                      - job_name: gpu-metrics
                        static_configs:
                          - targets:
                              - '`endpoint`:9400'
                 rule: type == "pod" && labels["app"] == "nvidia-dcgm-exporter"
        service:
          pipelines:
            metrics/nvidia-gpu-metrics:
              exporters:
                - signalfx
              processors:
                - memory_limiter
                - batch
                - resourcedetection
                - resource
              receivers:
                - receiver_creator
    
  3. Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

Learn about the configuration settings for the Prometheus receiver.

To view the configuration options for the Prometheus receiver, see Settings.

メトリクス

Nvidia GPU で使用可能なモニタリングメトリクスについて確認します。

NVIDIA GPU では、次のメトリクスを使用できます。メトリクスの詳細については、NVIDIA DCGM Exporter GitHub リポジトリの metrics-configmap.yaml を参照してください。
メトリクス名タイプユニット(Units)説明
DCGM_FI_DEV_SM_CLOCKgaugeMHzSM クロック周波数。
DCGM_FI_DEV_MEM_CLOCKgaugeMHzメモリクロック周波数。
DCGM_FI_DEV_MEMORY_TEMPgaugeCメモリ温度。
DCGM_FI_DEV_GPU_TEMPgaugeCGPU 温度。
DCGM_FI_DEV_POWER_USAGEgaugeW給電規格。
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONcountermJ起動してからの総エネルギー消費量。
DCGM_FI_DEV_PCIE_REPLAY_COUNTERcountercountPCIe 再試行の総回数。
DCGM_FI_DEV_GPU_UTILgaugepercentGPU 使用率
DCGM_FI_DEV_MEM_COPY_UTILgaugepercentメモリ使用率
DCGM_FI_DEV_ENC_UTILgaugepercentエンコーダ使用率。
DCGM_FI_DEV_DEC_UTILgaugepercentデコーダ使用率。
DCGM_FI_DEV_FB_FREEgaugeMiBフレームバッファの空きメモリ。
DCGM_FI_DEV_FB_USEDgaugeMiB使用中のフレームバッファメモリ。
DCGM_FI_PROF_PCIE_TX_BYTEScounterbytesヘッダーとペイロードの両方を含むアクティブな PCIe TX データのバイト数。
DCGM_FI_PROF_PCIE_RX_BYTEScounterbytesヘッダーとペイロードの両方を含むアクティブな PCIe RX データのバイト数。

属性

NVIDIA GPU で使用可能な属性について確認します。

NVIDIA GPU では、次の属性を使用できます。

Attribute nameタイプ説明値の例
app文字列GPU に接続されているアプリケーションの名称。nvidia-dcgm-exporter
DCGM_FI_DRIVER_VERSION文字列システムにインストールされている NVIDIA DCGM ドライバのバージョン。570.124.06
device文字列特定の NVIDIA デバイスまたは GPU インスタンスの識別子。nvidia0
gpuシステム内の GPU のインデックス番号。0
modelName文字列NVIDIA GPU の商用モデル。NVIDIA A10G
UUID文字列GPU に割り当てられた固有識別情報。GPU-3ca2f6af-10d6-30a5-b45b-158fc83e6d33

Next steps

How to monitor your AI components after you set up Observability for AI.

After you set up data collection from supported AI components to Splunk Observability Cloud, the data populates built-in experiences that you can use to monitor and troubleshoot your AI components.

The following table describes the tools you can use to monitor and troubleshoot your AI components.
Monitoring toolUse this tool toLink to documentation
Built-in navigatorsOrient and explore different layers of your AI tech stack.
Built-in dashboardsAssess service, endpoint, and system health at a glance.
Splunk Application Performance Monitoring (APM) service map and trace viewView all of your LLM service dependency graphs and user interactions in the service map or trace view.

Splunk APM を使用して LLM サービスをモニタリングする