Configure the Prometheus receiver to collect NVIDIA NIM metrics

Learn how to configure the Prometheus receiver to collect NVIDIA NIM metrics.

You can monitor the performance of NVIDIA NIMs by configuring your Kubernetes cluster to send NVIDIA NIM metrics to Splunk Observability Cloud.

This solution uses the Prometheus receiver to collect metrics from NVIDIA NIM, which can be installed on its own or as part of the NVIDIA NIM Operator. For more information on the NVIDIA NIM Operator, see About the Operator in the NVIDIA documentation. NVIDIA NIM exposes a :8000/metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following steps to collect metrics from NVIDIA NIMs.

To use the Prometheus receiver to collect metrics from NVIDIA NIMs, you must meet the following requirements.

You have installed NVIDIA NIM using one of the following methods:
- To install NVIDIA NIM separately, see Get Started with NVIDIA NIM for LLMs in the NVIDIA NIM documentation.
- To install NVIDIA NIM as part of the NVIDIA NIM Operator, see Installing NVIDIA NIM Operator in the NVIDIA documentation.
You have installed Prometheus for scraping metrics from NVIDIA NIM. For instructions, see Prometheus in the NVIDIA NIM documentation.

Install the Splunk Distribution of the OpenTelemetry Collector for Kubernetes using Helm.
To activate the Prometheus receiver for NVIDIA NIM manually in the Collector configuration, make the following changes to your configuration file:
1. Add receiver_creator/nvidia to the receivers section. For more information on using the receiver creator receiver, see Receiver creator receiver.
2. Add the Prometheus receiver for each NVIDIA NIM LLM that you want to monitor to the receiver_creator/nvidia section. For example, you can add prometheus/nim-llm, prometheus/nim-embedding, and/or prometheus/nim-reranking.
3. Add receiver_creator/nvidia to the metrics pipeline of the service section.
  Example configuration file:
  YAML
  agent: config: receivers: receiver_creator/nvidia: # Name of the extensions to watch for endpoints to start and stop. watch_observers: [ k8s_observer ] receivers: prometheus/nim-llm: config: config: scrape_configs: - job_name: nim-for-llm-metrics scrape_interval: 10s metrics_path: /v1/metrics static_configs: - targets: - '`endpoint`:8000' rule: type == "pod" && labels["app"] == "llm" prometheus/nim-embedding: config: config: scrape_configs: - job_name: nim-for-embedqallm-metrics scrape_interval: 10s metrics_path: /v1/metrics static_configs: - targets: - '`endpoint`:8000' rule: type == "pod" && labels["app"] == "embedqa" prometheus/nim-reranking: config: config: scrape_configs: - job_name: nim-for-rerankqallm-metrics scrape_interval: 10s metrics_path: /v1/metrics static_configs: - targets: - '`endpoint`:8000' rule: type == "pod" && labels["app"] == "rerankqa" service: pipelines: metrics/nvidianim-metrics: exporters: - signalfx processors: - memory_limiter - batch - resourcedetection - resource receivers: - receiver_creator/nvidia
```
agent:
  config:
    receivers:
      receiver_creator/nvidia:
        # Name of the extensions to watch for endpoints to start and stop.
        watch_observers: [ k8s_observer ]
        receivers:
          prometheus/nim-llm:
            config:
              config:
                scrape_configs:
                  - job_name: nim-for-llm-metrics
                    scrape_interval: 10s
                    metrics_path: /v1/metrics
                    static_configs:
                      - targets:
                          - '`endpoint`:8000'
            rule: type == "pod" && labels["app"] == "llm"

          prometheus/nim-embedding:
            config:
              config:
                scrape_configs:
                  - job_name: nim-for-embedqallm-metrics
                    scrape_interval: 10s
                    metrics_path: /v1/metrics
                    static_configs:
                      - targets:
                          - '`endpoint`:8000'
            rule: type == "pod" && labels["app"] == "embedqa"

          prometheus/nim-reranking:
            config:
              config:
                scrape_configs:
                  - job_name: nim-for-rerankqallm-metrics
                    scrape_interval: 10s
                    metrics_path: /v1/metrics
                    static_configs:
                      - targets:
                          - '`endpoint`:8000'
            rule: type == "pod" && labels["app"] == "rerankqa"
   service:
      pipelines:
        metrics/nvidianim-metrics:
          exporters:
            - signalfx
          processors:
            - memory_limiter
            - batch
            - resourcedetection
            - resource
          receivers:
            - receiver_creator/nvidia
```
Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

Learn about the configuration options for the Prometheus receiver.

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

Learn about the available metrics for NVIDIA NIM.

For more information on the metrics available for NVIDIA NIM, see Observability for NVIDIA NIM for LLMs in the NVIDIA documentation.

These metrics are considered custom metrics in Splunk Observability Cloud.

Attributes

Learn about the available resource attributes are available for NVIDIA NIM.

The following resource attributes are available for NVIDIA NIM.


Resource attribute name	Type	Description	Example value
`model_name`	string	The name of the deployed model.	`meta/llama-3.1-8b-instruct`
`computationId`	string	The unique identifier for the computation.	`comp-5678xyz`

Next steps

If needed, set up data collection for your other Cisco AI PODs components. For more information, see Get started with monitoring Cisco AI PODs.

After you set up data collection for Cisco AI PODs components, you can monitor their performance using built-in experiences in Splunk Observability Cloud. For more information, see Monitor the performance of your Cisco AI PODs.

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

Virtual Appliance (Self-Hosted)

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Configure the Prometheus receiver to collect NVIDIA NIM metrics

Configuration settings

Metrics

Attributes

Next steps