Configure the Prometheus receiver to collect TensorFlow Serving metrics
Configure the Splunk Distribution of the OpenTelemetry Collector to send TensorFlow Serving metrics to Splunk Observability Cloud.
You can monitor the performance of your TensorFlow Serving system by configuring the Splunk Distribution of the OpenTelemetry Collector to send TensorFlow Serving metrics to Splunk Observability Cloud.
This solution uses the Prometheus receiver to collect metrics from TensorFlow Serving, which exposes a customizable API endpoint that publishes Prometheus-compatible metrics.
Configuration settings
To view the configuration options for the Prometheus receiver, see Settings.
Metrics
The following metrics are available for TensorFlow Serving.
These metrics are considered custom metrics in Splunk Observability Cloud.
| Metric name | Type | Description |
|---|---|---|
:tensorflow:cc:saved_model:load_attempt_count |
counter | Tensorflow C++ API saved model load attempt count. |
:tensorflow:cc:saved_model:load_latency |
counter | Tensorflow C++ API saved model load latency. |
:tensorflow:cc:saved_model:load_latency_by_stage |
histogram | Tensorflow C++ API saved model load latency by stage. |
:tensorflow:core:direct_session_runs |
counter | Tensorflow Core direct session run count. |
:tensorflow:core:eager_client_error_count |
counter | Tensorflow Core eager client error count. |
:tensorflow:core:graph_build_calls |
counter | Tensorflow Core graph build call count. |
:tensorflow:core:graph_build_time_usecs |
counter | Tensorflow Core graph build time in usecs. |
:tensorflow:core:graph_run_time_usecs |
counter | Tensorflow Core graph run rime in usecs. |
:tensorflow:core:graph_run_time_usecs_histogram |
histogram | Tensorflow Core graph run time in usecs as a histogram metric. |
:tensorflow:core:graph_runs |
counter | Tensorflow Core graph run count. |
:tensorflow:core:saved_model:read:api |
counter | Tensorflow Core saved model read API count. |
:tensorflow:core:saved_model:read:count |
counter | Tensorflow Core saved model read count. |
:tensorflow:core:session_created |
gauge | Current number of created and alive Tensorflow Core session objects. |
:tensorflow:serving:batching_session:queuing_latency |
histogram | Tensorflow Serving batching session queuing latency. |
:tensorflow:serving:batching_session:wrapped_run_count |
counter | Tensorflow Serving batching session wrapped run count. |
:tensorflow:serving:mlmd_map |
gauge | Tensorflow Serving Mlmd map count. |
:tensorflow:serving:model_warmup_latency |
histogram | Tensorflow Serving model warmup latency. |
:tensorflow:serving:request_count |
counter | Tensorflow Serving request count. |
:tensorflow:serving:request_example_count_total |
counter | Tensorflow Serving request example count total. |
:tensorflow:serving:request_example_counts |
histogram | Tensorflow Serving request example counts. |
:tensorflow:serving:request_latency |
histogram | Tensorflow Serving request latency. |
:tensorflow:serving:request_log_count |
counter | Tensorflow Serving request log count. |
:tensorflow:serving:runtime_latency |
histogram | Tensorflow Serving runtime latency. |
|
counter | Tensorflow TPU Op error count. |
Attributes
The following resource attributes for TensorFlow Serving are derived from the metric source.
| Attribute name | Description |
|---|---|
API |
Indicates the type of API operation performed by the model. For example, prediction, inference, training, or embedding. |
model_name |
Name of the model being used to process the request or generate results. |
model_path |
File system or storage location where the model is stored or loaded from. |
The following resource attributes for TensorFlow Serving may be added by the Splunk Distribution of the OpenTelemetry Collector, depending on where the Prometheus receiver is defined.
| Attribute name | Description |
|---|---|
AWSUniqueId |
Unique identifier associated with the AWS resource or account. |
cloud.account.id |
Identifier of the cloud account where the resource is running. |
cloud.availability_zone |
Availability Zone in which the resource is deployed. |
cloud.platform |
Cloud platform in use. For example, AWS, GCP, or Azure. |
cloud.provider |
Name of the cloud service provider. |
cloud.region |
Geographic region of the cloud deployment. |
host.id |
Unique identifier of the host machine. |
host.image.id |
Identifier of the machine image used to create the host. |
host.name |
Hostname of the machine where the service is running. |
host.type |
Type or instance class of the host machine. |
k8s.cluster.name |
Name of the Kubernetes cluster. |
k8s.node.name |
Name of the Kubernetes node. |
metric_source |
Origin or source from which the metric is generated. |
os.type |
Operating system type of the host. For example: linux, windows). |
server.address |
Network address (IP or hostname) of the server. |
server.port |
Network port on which the server is listening. |
service.instance.id |
Unique identifier of the running service instance. |
service.name |
Logical name of the service emitting the metric. |
url.scheme |
URL scheme used for communication For example: http, https. |
Next steps
After you set up data collection, the data populates built-in dashboards that you can use to monitor and troubleshoot your instances.
For more information on using built-in dashboards in Splunk Observability Cloud, see: