Configure the Prometheus receiver to collect TensorFlow Serving metrics

Configure the Splunk Distribution of the OpenTelemetry Collector to send TensorFlow Serving metrics to Splunk Observability Cloud.

You can monitor the performance of your TensorFlow Serving system by configuring the Splunk Distribution of the OpenTelemetry Collector to send TensorFlow Serving metrics to Splunk Observability Cloud.

This solution uses the Prometheus receiver to collect metrics from TensorFlow Serving, which exposes a customizable API endpoint that publishes Prometheus-compatible metrics.

To use this data integration, you must have a running TensorFlow Serving instance.

Deploy the Splunk Distribution of the OpenTelemetry Collector to your host or container platform:
Provide a monitoring configuration to the server by using the --monitoring_config_file flag to specify a file that contains a MonitoringConfig protocol buffer. For example:
JSON
prometheus_config { enable: true, path: "/monitoring/prometheus/metrics" }
```
prometheus_config { 
 enable: true, 
 path: "/monitoring/prometheus/metrics" 
}
```
To read metrics from the /monitoring/prometheus/metrics URL, you must first enable the HTTP server by setting the --rest_api_port flag. For more information on specifying a MonitoringConfig protocol buffer, see TensorFlow Serving in the Google Cloud documentation.
To manually activate the Prometheus receiver for TensorFlow Serving, make the following changes to your Collector values.yaml configuration file.
1. Add prometheus/tensorflow-serving to the receivers section. For example:
  YAML
  prometheus/tensorflow-serving: config: scrape_configs: - job_name: tensorflow-serving scrape_interval: 10s kubernetes_sd_configs: - role: service relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: tensorflow-serving - source_labels: [__meta_kubernetes_pod_container_port_number] action: keep regex: 8501 - target_label: __metrics_path__ replacement: /monitoring/prometheus/metrics
```
prometheus/tensorflow-serving:  
    config:  
      scrape_configs:  
      - job_name: tensorflow-serving  
        scrape_interval: 10s 
        kubernetes_sd_configs: 
             - role: service 
        relabel_configs: 
             - source_labels: [__meta_kubernetes_pod_label_app] 
                action: keep 
                regex: tensorflow-serving 
             - source_labels: [__meta_kubernetes_pod_container_port_number] 
                action: keep 
                regex: 8501 
             - target_label: __metrics_path__ 
                replacement: /monitoring/prometheus/metrics
```
  TensorFlow Serving pods must have a label. For example:
  YAML
  labels: app: tensorflow-serving
```
labels: 
  	    app: tensorflow-serving
```
2. Add prometheus/tensorflow-serving to the metrics pipeline of the service section. For example:
  YAML
  service: pipelines: metrics: receivers: [prometheus/tensorflow-serving]
```
service:   
  pipelines:   
    metrics:   
      receivers: [prometheus/tensorflow-serving]
```
Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

The following metrics are available for TensorFlow Serving.

These metrics are considered custom metrics in Splunk Observability Cloud.


Metric name	Type	Description
`:tensorflow:cc:saved_model:load_attempt_count`	counter	Tensorflow C++ API saved model load attempt count.
`:tensorflow:cc:saved_model:load_latency`	counter	Tensorflow C++ API saved model load latency.
`:tensorflow:cc:saved_model:load_latency_by_stage`	histogram	Tensorflow C++ API saved model load latency by stage.
`:tensorflow:core:direct_session_runs`	counter	Tensorflow Core direct session run count.
`:tensorflow:core:eager_client_error_count`	counter	Tensorflow Core eager client error count.
`:tensorflow:core:graph_build_calls`	counter	Tensorflow Core graph build call count.
`:tensorflow:core:graph_build_time_usecs`	counter	Tensorflow Core graph build time in usecs.
`:tensorflow:core:graph_run_time_usecs`	counter	Tensorflow Core graph run rime in usecs.
`:tensorflow:core:graph_run_time_usecs_histogram`	histogram	Tensorflow Core graph run time in usecs as a histogram metric.
`:tensorflow:core:graph_runs`	counter	Tensorflow Core graph run count.
`:tensorflow:core:saved_model:read:api`	counter	Tensorflow Core saved model read API count.
`:tensorflow:core:saved_model:read:count`	counter	Tensorflow Core saved model read count.
`:tensorflow:core:session_created`	gauge	Current number of created and alive Tensorflow Core session objects.
`:tensorflow:serving:batching_session:queuing_latency`	histogram	Tensorflow Serving batching session queuing latency.
`:tensorflow:serving:batching_session:wrapped_run_count`	counter	Tensorflow Serving batching session wrapped run count.
`:tensorflow:serving:mlmd_map`	gauge	Tensorflow Serving Mlmd map count.
`:tensorflow:serving:model_warmup_latency`	histogram	Tensorflow Serving model warmup latency.
`:tensorflow:serving:request_count`	counter	Tensorflow Serving request count.
`:tensorflow:serving:request_example_count_total`	counter	Tensorflow Serving request example count total.
`:tensorflow:serving:request_example_counts`	histogram	Tensorflow Serving request example counts.
`:tensorflow:serving:request_latency`	histogram	Tensorflow Serving request latency.
`:tensorflow:serving:request_log_count`	counter	Tensorflow Serving request log count.
`:tensorflow:serving:runtime_latency`	histogram	Tensorflow Serving runtime latency.
`:tensorflow:tpu:op_error_count`	counter	Tensorflow TPU Op error count.

Attributes

The following resource attributes for TensorFlow Serving are derived from the metric source.


Attribute name	Description
`API`	Indicates the type of API operation performed by the model. For example, prediction, inference, training, or embedding.
`model_name`	Name of the model being used to process the request or generate results.
`model_path`	File system or storage location where the model is stored or loaded from.

The following resource attributes for TensorFlow Serving may be added by the Splunk Distribution of the OpenTelemetry Collector, depending on where the Prometheus receiver is defined.


Attribute name	Description
`AWSUniqueId`	Unique identifier associated with the AWS resource or account.
`cloud.account.id`	Identifier of the cloud account where the resource is running.
`cloud.availability_zone`	Availability Zone in which the resource is deployed.
`cloud.platform`	Cloud platform in use. For example, AWS, GCP, or Azure.
`cloud.provider`	Name of the cloud service provider.
`cloud.region`	Geographic region of the cloud deployment.
`host.id`	Unique identifier of the host machine.
`host.image.id`	Identifier of the machine image used to create the host.
`host.name`	Hostname of the machine where the service is running.
`host.type`	Type or instance class of the host machine.
`k8s.cluster.name`	Name of the Kubernetes cluster.
`k8s.node.name`	Name of the Kubernetes node.
`metric_source`	Origin or source from which the metric is generated.
`os.type`	Operating system type of the host. For example: linux, windows).
`server.address`	Network address (IP or hostname) of the server.
`server.port`	Network port on which the server is listening.
`service.instance.id`	Unique identifier of the running service instance.
`service.name`	Logical name of the service emitting the metric.
`url.scheme`	URL scheme used for communication For example: http, https.

Next steps

After you set up data collection, the data populates built-in dashboards that you can use to monitor and troubleshoot your instances.

For more information on using built-in dashboards in Splunk Observability Cloud, see:

Built-in dashboards

View dashboards in Splunk Observability Cloud

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

Configure the Prometheus receiver to collect TensorFlow Serving metrics

Configuration settings

Metrics

Attributes

Next steps