Configure the Prometheus receiver to collect Kubeflow Pipelines metrics

Configure the Splunk Distribution of the OpenTelemetry Collector to send Kubeflow metrics to Splunk Observability Cloud.

You can monitor the performance of Kubeflow by configuring the Splunk Distribution of the OpenTelemetry Collector to send Kubeflow metrics to Splunk Observability Cloud.

This solution uses the Prometheus receiver to collect metrics from the Kubeflow ML Pipeline service, which exposes Prometheus-compatible endpoints for monitoring pipeline runs, workflow status, and performance metrics. For example, metrics can be accessed at endpoints such as http://<kubeflow-pipeline-service>:8888/metrics.

Kubeflow namespace isolation, NetworkPolicies, or service meshes like Istio may block cross-namespace traffic by default. If your Kubeflow instance uses any of these features, verify that the target endpoint is reachable from the Splunk Distribution of the OpenTelemetry Collector instance.
  1. Deploy the Splunk Distribution of the OpenTelemetry Collector to your host or container platform:
  2. To manually activate the Prometheus receiver for Kubeflow, make the following changes to your Collector values.yaml configuration file.
    1. Add prometheus/kubeflow to the receivers section. For example:
      YAML
      prometheus/Kubeflow:  
             config:  
                  scrape_configs:  
                  - job_name: 'kubeflow-ml-pipeline'  
                    scrape_interval: 10s  
                    metrics_path: /metrics  
                    kubernetes_sd_configs:  
                    - role: pod  
                    relabel_configs:  
                    - source_labels: [__meta_kubernetes_pod_label_app]  
                       action: keep  
                       regex: ml-pipeline  
                    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]  
                       action: keep  
                       regex: true  
                    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]  
                       action: replace  
                       target_label: metrics_path  
                       regex: (.+)  
                    - source_labels: [address, _meta_kubernetes_pod_annotation_prometheus_io_port] 
                       action: replace  
                       regex: (.+):(?:\d+)  
                       replacement: $1:$2  
                       target_label: address
    2. Add prometheus/kubeflow to the metrics pipeline of the service section. For example:
      YAML
      service:   
        pipelines:   
          metrics:   
            receivers: [prometheus/kubeflow]
  3. Restart the Splunk Distribution of the OpenTelemetry Collector.

Configuration settings

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

The following metrics are available for Kubeflow.

These metrics are considered custom metrics in Splunk Observability Cloud.

Metric name Type Description
experiment_server_archive_requests counter The total number of ArchiveExperiment requests.
experiment_server_create_requests counter The total number of CreateExperiment requests.
experiment_server_delete_requests counter The total number of DeleteExperiment requests.
experiment_server_get_requests counter The total number of GetExperiment requests.
experiment_server_list_requests counter The total number of ListExperimentsV1 requests.
experiment_server_run_count gauge The current number of experiments in the Kubeflow Pipelines instance.
experiment_server_unarchive_requests counter The total number of UnarchiveExperiment requests.
job_server_create_requests counter The total number of CreateJob requests.
job_server_delete_requests counter The total number of CreateJob requests.
job_server_disable_requests counter The total number of DeleteJob requests.
job_server_enable_requests counter The total number of DisableJob requests.
job_server_get_requests counter The total number of EnableJob requests.
job_server_job_count gauge The total number of GetJob requests.
job_server_list_requests counter The total number of ListJobs requests.
pipeline_server_create_requests counter The total number of CreatePipeline requests.
pipeline_server_create_version_requests counter The total number of CreatePipelineVersion requests.
pipeline_server_delete_requests counter The total number of DeletePipeline requests.
pipeline_server_delete_version_requests counter The total number of DeletePipelineVersion requests.
pipeline_server_get_requests counter The total number of GetPipeline requests.
pipeline_server_get_version_requests counter The total number of GetPipelineVersion requests.
pipeline_server_list_requests counter The total number of ListPipelines requests.
pipeline_server_pipeline_count gauge The current number of pipelines in the Kubeflow Pipelines instance.
pipeline_server_pipeline_version_count gauge The current number of pipeline versions in the Kubeflow Pipelines instance.
pipeline_server_update_default_version_requests counter The total number of UpdatePipelineDefaultVersion requests.
process_cpu_seconds_total counter Total user and system CPU time spent in seconds.
process_max_fds gauge Maximum number of open file descriptors.
process_max_fds counter Number of bytes received by the process over the network.
process_network_transmit_bytes_total counter Number of bytes sent by the process over the network.
process_open_fds gauge Number of open file descriptors.
process_resident_memory_bytes gauge Resident memory size in bytes.
run_server_archive_requests counter The total number of ArchiveRun requests.
run_server_create_requests counter The total number of CreateRun requests.
run_server_delete_requests counter The total number of DeleteRun requests.
run_server_get_requests counter The total number of GetRun requests.
run_server_list_requests counter The total number of ListRuns requests.
run_server_read_artifact_requests counter The total number of ReadArtifact requests.
run_server_report_metrics_requests counter The total number of ReportRunMetrics requests.
run_server_retry_requests counter The total number of RetryRun requests.
run_server_run_count gauge The current number of runs in the Kubeflow Pipelines instance.
run_server_terminate_requests counter The total number of TerminateRun requests.
run_server_unarchive_requests counter The total number of UnarchiveRun requests.
resource_manager_workflow_runs_success gauge The current number of successful workflow runs.
resource_manager_workflow_runs_failed gauge The current number of failed workflow runs.

Attributes

The following resource attributes are available for Kubeflow.

Attribute name Description
host.name Hostname of the machine where the service is running.
k8s.cluster.name Name of the Kubernetes cluster.
k8s.node.name Name of the Kubernetes node.
os.type Operating system type of the host. For example: linux, windows.
profile Tenant or ownership boundary under which the workload is running.
server.address Network address (IP or hostname) of the server.
service.instance.id Unique identifier of the running service instance.
service.name Logical name of the service emitting the metric.
server.port Network port number on which the server is listening.
url.scheme URL scheme used to access the service.
workflow The Pipeline or workflow execution that produced the metric.