Configure the Prometheus receiver to collect NVIDIA NIM metrics

Learn how to configure the Prometheus receiver to collect NVIDIA NIM metrics.

You can monitor the performance of NVIDIA NIMs by configuring your Kubernetes cluster to send NVIDIA NIM metrics to Splunk Observability Cloud.

This solution uses the Prometheus receiver to collect metrics from NVIDIA NIM, which can be installed on its own or as part of the NVIDIA NIM Operator. For more information on the NVIDIA NIM Operator, see About the Operator in the NVIDIA documentation. NVIDIA NIM exposes a :8000/metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following steps to collect metrics from NVIDIA NIMs.

To use the Prometheus receiver to collect metrics from NVIDIA NIMs, you must meet the following requirements.

You have installed NVIDIA NIM using one of the following methods:
- To install NVIDIA NIM separately, see Get Started with NVIDIA NIM for LLMs in the NVIDIA NIM documentation.
- To install NVIDIA NIM as part of the NVIDIA NIM Operator, see Installing NVIDIA NIM Operator in the NVIDIA documentation.
You have installed Prometheus for scraping metrics from NVIDIA NIM. For instructions, see Prometheus in the NVIDIA NIM documentation.

Install the Splunk Distribution of the OpenTelemetry Collector for Kubernetes using Helm.

To activate the Prometheus receiver for NVIDIA NIM manually in the Collector configuration, make the following changes to your configuration file:

Add prometheus/nvidianim to the receivers section. For example:

prometheus/nvidianim:
  config:
    scrape_configs:
      - job_name: nvidianim-metrics'
        metrics_path: /metrics
        scrape_interval: 15s
        static_configs:
          - targets: ["localhost:8000"]

Add prometheus/nvidianim to the metrics pipeline of the service section. For example:
```
service:
  pipelines:
    metrics:
      receivers: [prometheus/nvidianim]
```

Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

Learn about the configuration options for the Prometheus receiver.

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

Learn about the available metrics for NVIDIA NIM.

For more information on the metrics available for NVIDIA NIM, see Observability for NVIDIA NIM for LLMs in the NVIDIA documentation. These metrics fall under the default metric category in Splunk Observability Cloud.

Attributes

Learn about the available resource attributes are available for NVIDIA NIM.

The following resource attributes are available for NVIDIA NIM.


Resource attribute name	Type	Description	Example value
`model_name`	string	The name of the deployed model.	`meta/llama-3.1-8b-instruct`
`computationId`	string	The unique identifier for the computation.	`comp-5678xyz`

Next steps

How to monitor your AI components after you set up Observability for AI.

After you set up data collection from supported AI components to Splunk Observability Cloud, the data populates built-in experiences that you can use to monitor and troubleshoot your AI components.

The following table describes the tools you can use to monitor and troubleshoot your AI components.


Monitoring tool	Use this tool to	Link to documentation
Built-in navigators	Orient and explore different layers of your AI tech stack.	Use navigators Monitor LLM costs with navigators
Built-in dashboards	Assess service, endpoint, and system health at a glance.	Built-in dashboards View dashboards in Splunk Observability Cloud
Splunk Application Performance Monitoring (APM) service map and trace view	View all of your LLM service dependency graphs and user interactions in the service map or trace view.	Monitor LLM services with Splunk APM