Configure the Prometheus receiver to collect Kubeflow Pipelines metrics
Configure the Splunk Distribution of the OpenTelemetry Collector to send Kubeflow metrics to Splunk Observability Cloud.
You can monitor the performance of Kubeflow by configuring the Splunk Distribution of the OpenTelemetry Collector to send Kubeflow metrics to Splunk Observability Cloud.
This solution uses the Prometheus receiver to collect metrics from the Kubeflow ML Pipeline service, which exposes Prometheus-compatible endpoints for monitoring pipeline runs, workflow status, and performance metrics. For example, metrics can be accessed at endpoints such as http://<kubeflow-pipeline-service>:8888/metrics.
- Deploy the Splunk Distribution of the OpenTelemetry Collector to your host or container platform:
- To manually activate the Prometheus receiver for Kubeflow, make the following changes to your Collector
values.yamlconfiguration file. - Restart the Splunk Distribution of the OpenTelemetry Collector.
Configuration settings
To view the configuration options for the Prometheus receiver, see Settings.
Metrics
The following metrics are available for Kubeflow.
These metrics are considered custom metrics in Splunk Observability Cloud.
| Metric name | Type | Description |
|---|---|---|
experiment_server_archive_requests |
counter | The total number of ArchiveExperiment requests. |
experiment_server_create_requests |
counter | The total number of CreateExperiment requests. |
experiment_server_delete_requests |
counter | The total number of DeleteExperiment requests. |
experiment_server_get_requests |
counter | The total number of GetExperiment requests. |
experiment_server_list_requests |
counter | The total number of ListExperimentsV1 requests. |
experiment_server_run_count |
gauge | The current number of experiments in the Kubeflow Pipelines instance. |
experiment_server_unarchive_requests |
counter | The total number of UnarchiveExperiment requests. |
job_server_create_requests |
counter | The total number of CreateJob requests. |
job_server_delete_requests |
counter | The total number of CreateJob requests. |
job_server_disable_requests |
counter | The total number of DeleteJob requests. |
job_server_enable_requests |
counter | The total number of DisableJob requests. |
job_server_get_requests |
counter | The total number of EnableJob requests. |
job_server_job_count |
gauge | The total number of GetJob requests. |
job_server_list_requests |
counter | The total number of ListJobs requests. |
pipeline_server_create_requests |
counter | The total number of CreatePipeline requests. |
pipeline_server_create_version_requests |
counter | The total number of CreatePipelineVersion requests. |
pipeline_server_delete_requests |
counter | The total number of DeletePipeline requests. |
pipeline_server_delete_version_requests |
counter | The total number of DeletePipelineVersion requests. |
pipeline_server_get_requests |
counter | The total number of GetPipeline requests. |
pipeline_server_get_version_requests |
counter | The total number of GetPipelineVersion requests. |
pipeline_server_list_requests |
counter | The total number of ListPipelines requests. |
pipeline_server_pipeline_count |
gauge | The current number of pipelines in the Kubeflow Pipelines instance. |
pipeline_server_pipeline_version_count |
gauge | The current number of pipeline versions in the Kubeflow Pipelines instance. |
pipeline_server_update_default_version_requests |
counter | The total number of UpdatePipelineDefaultVersion requests. |
process_cpu_seconds_total |
counter | Total user and system CPU time spent in seconds. |
process_max_fds |
gauge | Maximum number of open file descriptors. |
process_max_fds |
counter | Number of bytes received by the process over the network. |
process_network_transmit_bytes_total |
counter | Number of bytes sent by the process over the network. |
process_open_fds |
gauge | Number of open file descriptors. |
process_resident_memory_bytes |
gauge | Resident memory size in bytes. |
run_server_archive_requests |
counter | The total number of ArchiveRun requests. |
run_server_create_requests |
counter | The total number of CreateRun requests. |
run_server_delete_requests |
counter | The total number of DeleteRun requests. |
run_server_get_requests |
counter | The total number of GetRun requests. |
run_server_list_requests |
counter | The total number of ListRuns requests. |
run_server_read_artifact_requests |
counter | The total number of ReadArtifact requests. |
run_server_report_metrics_requests |
counter | The total number of ReportRunMetrics requests. |
run_server_retry_requests |
counter | The total number of RetryRun requests. |
run_server_run_count |
gauge | The current number of runs in the Kubeflow Pipelines instance. |
run_server_terminate_requests |
counter | The total number of TerminateRun requests. |
run_server_unarchive_requests |
counter | The total number of UnarchiveRun requests. |
resource_manager_workflow_runs_success |
gauge | The current number of successful workflow runs. |
resource_manager_workflow_runs_failed |
gauge | The current number of failed workflow runs. |
Attributes
The following resource attributes are available for Kubeflow.
| Attribute name | Description |
|---|---|
host.name |
Hostname of the machine where the service is running. |
k8s.cluster.name |
Name of the Kubernetes cluster. |
k8s.node.name |
Name of the Kubernetes node. |
os.type |
Operating system type of the host. For example: linux, windows. |
profile |
Tenant or ownership boundary under which the workload is running. |
server.address |
Network address (IP or hostname) of the server. |
service.instance.id |
Unique identifier of the running service instance. |
service.name |
Logical name of the service emitting the metric. |
server.port |
Network port number on which the server is listening. |
url.scheme |
URL scheme used to access the service. |
workflow |
The Pipeline or workflow execution that produced the metric. |
Next steps
After you set up data collection, the data populates built-in dashboards that you can use to monitor and troubleshoot your instances.
For more information on using built-in dashboards in Splunk Observability Cloud, see: