Configure the Prometheus receiver to collect TensorFlow Serving metrics

Configure the Splunk Distribution of the OpenTelemetry Collector to send TensorFlow Serving metrics to Splunk Observability Cloud.

You can monitor the performance of your TensorFlow Serving system by configuring the Splunk Distribution of the OpenTelemetry Collector to send TensorFlow Serving metrics to Splunk Observability Cloud.

This solution uses the Prometheus receiver to collect metrics from TensorFlow Serving, which exposes a customizable API endpoint that publishes Prometheus-compatible metrics.

To use this data integration, you must have a running TensorFlow Serving instance.
  1. Deploy the Splunk Distribution of the OpenTelemetry Collector to your host or container platform:
  2. Provide a monitoring configuration to the server by using the --monitoring_config_file flag to specify a file that contains a MonitoringConfig protocol buffer. For example:
    JSON
    prometheus_config { 
     enable: true, 
     path: "/monitoring/prometheus/metrics" 
    }
    To read metrics from the /monitoring/prometheus/metrics URL, you must first enable the HTTP server by setting the --rest_api_port flag. For more information on specifying a MonitoringConfig protocol buffer, see TensorFlow Serving in the Google Cloud documentation.
  3. To manually activate the Prometheus receiver for TensorFlow Serving, make the following changes to your Collector values.yaml configuration file.
    1. Add prometheus/tensorflow-serving to the receivers section. For example:
      YAML
      prometheus/tensorflow-serving:  
          config:  
            scrape_configs:  
            - job_name: tensorflow-serving  
              scrape_interval: 10s 
              kubernetes_sd_configs: 
                   - role: service 
              relabel_configs: 
                   - source_labels: [__meta_kubernetes_pod_label_app] 
                      action: keep 
                      regex: tensorflow-serving 
                   - source_labels: [__meta_kubernetes_pod_container_port_number] 
                      action: keep 
                      regex: 8501 
                   - target_label: __metrics_path__ 
                      replacement: /monitoring/prometheus/metrics
      TensorFlow Serving pods must have a label. For example:
      YAML
      labels: 
        	    app: tensorflow-serving
    2. Add prometheus/tensorflow-serving to the metrics pipeline of the service section. For example:
      YAML
      service:   
        pipelines:   
          metrics:   
            receivers: [prometheus/tensorflow-serving]
  4. Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

The following metrics are available for TensorFlow Serving.

These metrics are considered custom metrics in Splunk Observability Cloud.

Metric name Type Description
:tensorflow:cc:saved_model:load_attempt_count counter Tensorflow C++ API saved model load attempt count.
:tensorflow:cc:saved_model:load_latency counter Tensorflow C++ API saved model load latency.
:tensorflow:cc:saved_model:load_latency_by_stage histogram Tensorflow C++ API saved model load latency by stage.
:tensorflow:core:direct_session_runs counter Tensorflow Core direct session run count.
:tensorflow:core:eager_client_error_count counter Tensorflow Core eager client error count.
:tensorflow:core:graph_build_calls counter Tensorflow Core graph build call count.
:tensorflow:core:graph_build_time_usecs counter Tensorflow Core graph build time in usecs.
:tensorflow:core:graph_run_time_usecs counter Tensorflow Core graph run rime in usecs.
:tensorflow:core:graph_run_time_usecs_histogram histogram Tensorflow Core graph run time in usecs as a histogram metric.
:tensorflow:core:graph_runs counter Tensorflow Core graph run count.
:tensorflow:core:saved_model:read:api counter Tensorflow Core saved model read API count.
:tensorflow:core:saved_model:read:count counter Tensorflow Core saved model read count.
:tensorflow:core:session_created gauge Current number of created and alive Tensorflow Core session objects.
:tensorflow:serving:batching_session:queuing_latency histogram Tensorflow Serving batching session queuing latency.
:tensorflow:serving:batching_session:wrapped_run_count counter Tensorflow Serving batching session wrapped run count.
:tensorflow:serving:mlmd_map gauge Tensorflow Serving Mlmd map count.
:tensorflow:serving:model_warmup_latency histogram Tensorflow Serving model warmup latency.
:tensorflow:serving:request_count counter Tensorflow Serving request count.
:tensorflow:serving:request_example_count_total counter Tensorflow Serving request example count total.
:tensorflow:serving:request_example_counts histogram Tensorflow Serving request example counts.
:tensorflow:serving:request_latency histogram Tensorflow Serving request latency.
:tensorflow:serving:request_log_count counter Tensorflow Serving request log count.
:tensorflow:serving:runtime_latency histogram Tensorflow Serving runtime latency.
:tensorflow:tpu:op_error_count counter Tensorflow TPU Op error count.

Attributes

The following resource attributes for TensorFlow Serving are derived from the metric source.

Attribute name Description
API Indicates the type of API operation performed by the model. For example, prediction, inference, training, or embedding.
model_name Name of the model being used to process the request or generate results.
model_path File system or storage location where the model is stored or loaded from.

The following resource attributes for TensorFlow Serving may be added by the Splunk Distribution of the OpenTelemetry Collector, depending on where the Prometheus receiver is defined.

Attribute name Description
AWSUniqueId Unique identifier associated with the AWS resource or account.
cloud.account.id Identifier of the cloud account where the resource is running.
cloud.availability_zone Availability Zone in which the resource is deployed.
cloud.platform Cloud platform in use. For example, AWS, GCP, or Azure.
cloud.provider Name of the cloud service provider.
cloud.region Geographic region of the cloud deployment.
host.id Unique identifier of the host machine.
host.image.id Identifier of the machine image used to create the host.
host.name Hostname of the machine where the service is running.
host.type Type or instance class of the host machine.
k8s.cluster.name Name of the Kubernetes cluster.
k8s.node.name Name of the Kubernetes node.
metric_source Origin or source from which the metric is generated.
os.type Operating system type of the host. For example: linux, windows).
server.address Network address (IP or hostname) of the server.
server.port Network port on which the server is listening.
service.instance.id Unique identifier of the running service instance.
service.name Logical name of the service emitting the metric.
url.scheme URL scheme used for communication For example: http, https.