Configure the Prometheus receiver to collect Ray cluster metrics

Learn how to configure the Prometheus receiver to collect Ray cluster metrics.

You can monitor the performance of Ray cluster large-language model (LLM) applications by configuring your Ray cluster applications to send metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from Ray, which exposes a /metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following high-level steps to collect metrics from and monitor your Ray cluster applications.

  1. Ensure that you meet the prerequisites.
  2. Configure and activate the component for Ray.
  3. Use the Ray navigator to monitor the performance of your Ray cluster applications.

Prerequisites

Learn about the prerequisites for configuring the Prometheus receiver to collect Ray cluster metrics.

To configure the Prometheus receiver to collect metrics from Ray cluster, you must first deploy Ray on local or on a cloud server.

Configure and activate the component for Ray

Learn how to configure and activate the component for Ray.

Complete the following steps to configure and activate the component for Ray.
  1. Deploy the Splunk Distribution of the OpenTelemetry Collector to your host or container platform:
  2. To activate the Prometheus receiver for Ray cluster manually in the Collector configuration, make the following changes to your configuration file:
    1. Add prometheus/ray to the receivers section. For example:
      prometheus/ray:
              config:
                scrape_configs:
                  - job_name: ray-metrics
                    metrics_path: /metrics
                    static_configs:
                      - targets: ['localhost:8080']
    2. Add prometheus/ray to the metrics pipeline of the service section. For example:
      service:
        pipelines:
          metrics:
            receivers: [prometheus/ray]
  3. Restart the Splunk Distribution of the OpenTelemetry Collector.

Monitor the performance of Ray cluster applications

Learn how to navigate to the Ray navigators, which you can use to monitor the performance of Ray cluster applications.

Complete the following steps to access the Ray navigator and monitor the performance of Ray cluster applications. For more information on navigators, see Use navigators.

  1. From the Splunk Observability Cloud main menu, select Infrastructure.
  2. Under AI/ML, select AI Frameworks.
  3. Select the Ray summary card.

Configuration settings

Learn about the configuration settings for the Prometheus receiver.

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

Learn about the available metrics for Ray cluster applications.

The following metrics are available for Ray cluster applications. For more information on these metrics, see System metrics in the Ray documentation.
Metric nameTypeUnitDescription
ray_actorsgaugecountThe number of actor processes.
ray_cluster_active_nodesgaugecountThe number of active Ray nodes in the cluster.
ray_component_cpu_percentagegaugepercentTotal CPU usage of the components on a node.
ray_component_mem_shared_bytesgaugebytesSHM usage of all components of the node. Equivalent to the top command's SHR column.
ray_component_rss_mbgauge megabytesRSS usage of all components on the node.
ray_gcs_placement_group_countgaugecountNumber of placement groups broken down by state {Registered, Pending, Infeasible}.
ray_gcs_storage_operation_count_totalcumulative countercounterNumber of operations invoked on Google Cloud storage.
ray_gcs_storage_operation_latency_mshistogrammsTime to invoke an operation on Google Cloud storage.
ray_gcs_task_manager_task_events_droppedgaugecountNumber of task events dropped per type {PROFILEEVENT, STATUSEVENT}.
ray_grpc_server_req_finished_totalcumulative countercounterNumber of finished requests in the grpc server.
ray_grpc_server_req_handling_totalcumulative countercounterNumber of handling requests in the grpc server.

ray_grpc_server_req_new_total

countcountNumber of new requests in the grpc server.
ray_grpc_server_req_process_time_mshistogrammsRequest latency in grpc server.
ray_internal_num_infeasible_scheduling_classesgaugecountcountThe number of unique scheduling classes that are infeasible.
ray_internal_num_processes_skipped_job_mismatchgaugecountThe total number of cached workers skipped due to job mismatch.
ray_internal_num_processes_skipped_runtime_environment_mismatchgaugecountThe total number of cached workers skipped due to runtime environment mismatch.
ray_internal_num_processes_startedcountcountThe number of Ray worker processes started.
ray_internal_num_processes_started_from_cachegaugecountThe total number of workers started from a cached worker process.
ray_internal_num_spilled_tasksgaugecount The cumulative number of lease requests that this raylet has spilled to other raylets.
ray_node_cpu_countgaugecpuTotal CPUs available on a Ray node.
ray_node_cpu_utilizationgaugepercentTotal CPU usage on a Ray node.
ray_node_disk_io_read_speedgaugebytes/sDisk read speed.

ray_node_disk_io_write_count

gaugeoperationTotal write ops to disk.

ray_node_disk_io_write_speed

gaugebytesDisk write speed.
ray_node_disk_read_iopsgaugeoperations/sDisk read IOPS.
ray_node_disk_utilization_percentagegaugepercentTotal disk utilization (percentage) on a Ray node.
ray_node_disk_write_iopsgaugeoperations/sDisk write IOPS.
ray_node_mem_totalgaugebytesTotal memory on a Ray node.
ray_node_mem_usedgaugebytesMemory usage on a Ray node.
ray_node_network_receivedgaugebytesTotal network received.
ray_node_network_send_speedgaugebytes/sNetwork send speed.
ray_node_network_sentgauge

bytes

Total network sent.
ray_object_directory_added_locationsgaugelocationNumber of object locations added per second.If this is high, a lot of objects have been added on this node.
ray_object_directory_lookupsgaugecountNumber of object location lookups per second. If this is high, the raylet is waiting on a high number of objects.
ray_object_directory_removed_locationsgaugecountNumber of object locations removed per second. If this is high, a high number of objects have been removed from this node.
ray_object_directory_subscriptionsgaugecountNumber of object location subscriptions. If this is high, the raylet is attempting to pull a high number of objects.
ray_object_directory_updatesgaugecount Number of object location updates per second. If this is high, the raylet is attempting to pull a high number of objects and/or the locations for objects are frequently changing (e.g. due to many object copies or evictions).
ray_object_manager_bytesgaugebytesNumber of bytes pushed or received by type {PushedFromLocalPlasma, PushedFromLocalDisk, Received}.
ray_object_manager_num_pull_requestsgaugecountNumber of active pull requests for objects.
ray_object_manager_received_chunksgaugecountNumber object chunks received by type {Total, FailedTotal, FailedCancelled, FailedPlasmaFull}.
ray_object_store_available_memorygaugebytesAmount of memory currently available in the object store.
ray_object_store_fallback_memorygaugebytesAmount of memory in fallback allocations in the filesystem.
ray_object_store_memorygaugebytes

Object store memory by various sub-kinds on this node/

ray_object_store_used_memorygaugebytesUsed memory in the Ray object store.
ray_object_store_num_local_objectsgaugecountNumber of objects currently in the object store.
ray_pull_manager_active_bundlesgaugecountNumber of active bundle requests.
ray_pull_manager_num_object_pinscountcountNumber of object pin attempts by the pull manager. Can be {Success, Failure}.
ray_pull_manager_requestscountcount

Number of requested bundles per type {Get, Wait, TaskArgs}.

ray_pull_manager_retries_totalgaugecountNumber of cumulative pull retries.
ray_pull_manager_usage_bytesgaugebytesThe total number of bytes usage per type {Available, BeingPulled, Pinned}.
ray_push_manager_chunkscountcountNumber of data chunks pushed by the push manager.
ray_push_manager_in_flight_pushesgaugecountNumber of in-flight object push requests.
ray_resourcesgaugeresource

Logical Ray resources by state {AVAILABLE, USED}.

ray_scheduler_failed_worker_startup_totalgaugecountNumber of tasks that fail to be scheduled because workers were not available. Labels are broken up per reason {JobConfigMissing, RegistrationTimedOut, RateLimited}.
ray_scheduler_tasksgaugecountNumber of tasks waiting for scheduling by state {Cancelled, Executing, Waiting, Dispatched, Received}.
ray_scheduler_unscheduleable_tasksgaugecountNumber of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}.
ray_serve_controller_control_loop_duration_sgaugecountNumber of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}.
ray_serve_deployment_queued_queriesgaugecountThe current number of queries to this deployment waiting to be assigned to a replica.
ray_serve_deployment_replica_healthygaugebooleanTracks whether this deployment replica is healthy. 1 means healthy, 0 means unhealthy.
ray_serve_num_deployment_http_error_requestsgaugecountThe number of non-200 HTTP responses returned by each deployment.
ray_serve_num_http_error_requestsgaugecountThe number of non-200 HTTP responses.
ray_serve_num_http_requestsgaugecount

The number of HTTP requests processed.

ray_spill_manager_objectsgaugecountNumber of local objects by state {Pinned, PendingRestore, PendingSpill}.
ray_spill_manager_objects_bytesgaugebytesByte size of local objects by state {Pinned, PendingSpill}.
ray_spill_manager_request_totalgaugecountNumber of {spill, restore} requests.
ray_tasksgaugecountCurrent number of tasks currently in a particular state.
ray_worker_register_time_mshistogramprocessEnd-to-end latency of register worker processes.

Attributes

Learn about the available attributes for Ray clusters.sec

The following attributes are available for all supported Ray metrics:

  • host_kernel_release

  • host_physical_cpus

  • server.address

  • host_cpu_cores

  • host_kernel_version

  • SessionName

  • net.host.port

  • http.scheme

  • url.scheme

  • service.instance.id

  • server.port

  • k8s.node.name

  • sf_environment

  • node_type

  • Version

  • host_kernel_name

  • host.name

  • deployment.environment

  • host_machine

  • net.host.name

  • k8s.cluster.name

  • os.type

  • host_mem_total

  • host_logical_cpus

  • sf_service

  • host_processor

  • service.name

Troubleshoot

Learn how to get help if you can't see your data in Splunk Observability Cloud.

If you are a Splunk Observability Cloud customer and are not able to see your data in Splunk Observability Cloud, you can get help in the following ways:

  • Prospective customers and free trial users can ask a question and get answers through community support in the Splunk Community.