Configure the Prometheus receiver to collect Ray cluster metrics

Learn how to configure the Prometheus receiver to collect Ray cluster metrics.

You can monitor the performance of Ray cluster large-language model (LLM) applications by configuring your Ray cluster applications to send metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from Ray, which exposes a /metrics endpoint that publishes Prometheus-compatible metrics.

Complete the following high-level steps to collect metrics from and monitor your Ray cluster applications.

Ensure that you meet the prerequisites.
Configure and activate the component for Ray.
Use the Ray navigator to monitor the performance of your Ray cluster applications.

Prerequisites

Learn about the prerequisites for configuring the Prometheus receiver to collect Ray cluster metrics.

To configure the Prometheus receiver to collect metrics from Ray cluster, you must first deploy Ray on local or on a cloud server.

Configure and activate the component for Ray

Learn how to configure and activate the component for Ray.

Complete the following steps to configure and activate the component for Ray.

Deploy the Splunk Distribution of the OpenTelemetry Collector to your host or container platform:
To activate the Prometheus receiver for Ray cluster manually in the Collector configuration, make the following changes to your configuration file:
1. Add prometheus/ray to the receivers section. For example:
```
prometheus/ray:
        config:
          scrape_configs:
            - job_name: ray-metrics
              metrics_path: /metrics
              static_configs:
                - targets: ['localhost:8080']
```
2. Add prometheus/ray to the metrics pipeline of the service section. For example:
```
service:
  pipelines:
    metrics:
      receivers: [prometheus/ray]
```
Restart the Splunk Distribution of the OpenTelemetry Collector.

Monitor the performance of Ray cluster applications

Learn how to navigate to the Ray navigators, which you can use to monitor the performance of Ray cluster applications.

Complete the following steps to access the Ray navigator and monitor the performance of Ray cluster applications. For more information on navigators, see Use navigators.

From the Splunk Observability Cloud main menu, select Infrastructure.
Under AI/ML, select AI Frameworks.
Select the Ray summary card.

Configuration settings

Learn about the configuration settings for the Prometheus receiver.

To view the configuration options for the Prometheus receiver, see Settings.

Metrics

Learn about the available metrics for Ray cluster applications.

The following metrics are available for Ray cluster applications. For more information on these metrics, see System metrics in the Ray documentation.


Metric name	Type	Unit	Description
`ray_actors`	gauge	count	The number of actor processes.
`ray_cluster_active_nodes`	gauge	count	The number of active Ray nodes in the cluster.
`ray_component_cpu_percentage`	gauge	percent	Total CPU usage of the components on a node.
`ray_component_mem_shared_bytes`	gauge	bytes	SHM usage of all components of the node. Equivalent to the top command's SHR column.
`ray_component_rss_mb`	gauge	megabytes	RSS usage of all components on the node.
`ray_gcs_placement_group_count`	gauge	count	Number of placement groups broken down by state {Registered, Pending, Infeasible}.
`ray_gcs_storage_operation_count_total`	cumulative counter	counter	Number of operations invoked on Google Cloud storage.
`ray_gcs_storage_operation_latency_ms`	histogram	ms	Time to invoke an operation on Google Cloud storage.
`ray_gcs_task_manager_task_events_dropped`	gauge	count	Number of task events dropped per type {PROFILEEVENT, STATUSEVENT}.
`ray_grpc_server_req_finished_total`	cumulative counter	counter	Number of finished requests in the grpc server.
`ray_grpc_server_req_handling_total`	cumulative counter	counter	Number of handling requests in the grpc server.
`ray_grpc_server_req_new_total`	count	count	Number of new requests in the grpc server.
`ray_grpc_server_req_process_time_ms`	histogram	ms	Request latency in grpc server.
`ray_internal_num_infeasible_scheduling_classes`	gaugecount	count	The number of unique scheduling classes that are infeasible.
`ray_internal_num_processes_skipped_job_mismatch`	gauge	count	The total number of cached workers skipped due to job mismatch.
`ray_internal_num_processes_skipped_runtime_environment_mismatch`	gauge	count	The total number of cached workers skipped due to runtime environment mismatch.
`ray_internal_num_processes_started`	count	count	The number of Ray worker processes started.
`ray_internal_num_processes_started_from_cache`	gauge	count	The total number of workers started from a cached worker process.
`ray_internal_num_spilled_tasks`	gauge	count	The cumulative number of lease requests that this raylet has spilled to other raylets.
`ray_node_cpu_count`	gauge	cpu	Total CPUs available on a Ray node.
`ray_node_cpu_utilization`	gauge	percent	Total CPU usage on a Ray node.
`ray_node_disk_io_read_speed`	gauge	bytes/s	Disk read speed.
`ray_node_disk_io_write_count`	gauge	operation	Total write ops to disk.
`ray_node_disk_io_write_speed`	gauge	bytes	Disk write speed.
`ray_node_disk_read_iops`	gauge	operations/s	Disk read IOPS.
`ray_node_disk_utilization_percentage`	gauge	percent	Total disk utilization (percentage) on a Ray node.
`ray_node_disk_write_iops`	gauge	operations/s	Disk write IOPS.
`ray_node_mem_total`	gauge	bytes	Total memory on a Ray node.
`ray_node_mem_used`	gauge	bytes	Memory usage on a Ray node.
`ray_node_network_received`	gauge	bytes	Total network received.
`ray_node_network_send_speed`	gauge	bytes/s	Network send speed.
`ray_node_network_sent`	gauge	bytes	Total network sent.
`ray_object_directory_added_locations`	gauge	location	Number of object locations added per second.If this is high, a lot of objects have been added on this node.
`ray_object_directory_lookups`	gauge	count	Number of object location lookups per second. If this is high, the raylet is waiting on a high number of objects.
`ray_object_directory_removed_locations`	gauge	count	Number of object locations removed per second. If this is high, a high number of objects have been removed from this node.
`ray_object_directory_subscriptions`	gauge	count	Number of object location subscriptions. If this is high, the raylet is attempting to pull a high number of objects.
`ray_object_directory_updates`	gauge	count	Number of object location updates per second. If this is high, the raylet is attempting to pull a high number of objects and/or the locations for objects are frequently changing (e.g. due to many object copies or evictions).
`ray_object_manager_bytes`	gauge	bytes	Number of bytes pushed or received by type {PushedFromLocalPlasma, PushedFromLocalDisk, Received}.
`ray_object_manager_num_pull_requests`	gauge	count	Number of active pull requests for objects.
`ray_object_manager_received_chunks`	gauge	count	Number object chunks received by type {Total, FailedTotal, FailedCancelled, FailedPlasmaFull}.
`ray_object_store_available_memory`	gauge	bytes	Amount of memory currently available in the object store.
`ray_object_store_fallback_memory`	gauge	bytes	Amount of memory in fallback allocations in the filesystem.
`ray_object_store_memory`	gauge	bytes	Object store memory by various sub-kinds on this node/
`ray_object_store_used_memory`	gauge	bytes	Used memory in the Ray object store.
`ray_object_store_num_local_objects`	gauge	count	Number of objects currently in the object store.
`ray_pull_manager_active_bundles`	gauge	count	Number of active bundle requests.
`ray_pull_manager_num_object_pins`	count	count	Number of object pin attempts by the pull manager. Can be {Success, Failure}.
`ray_pull_manager_requests`	count	count	Number of requested bundles per type {Get, Wait, TaskArgs}.
`ray_pull_manager_retries_total`	gauge	count	Number of cumulative pull retries.
`ray_pull_manager_usage_bytes`	gauge	bytes	The total number of bytes usage per type {Available, BeingPulled, Pinned}.
`ray_push_manager_chunks`	count	count	Number of data chunks pushed by the push manager.
`ray_push_manager_in_flight_pushes`	gauge	count	Number of in-flight object push requests.
`ray_resources`	gauge	resource	Logical Ray resources by state {AVAILABLE, USED}.
`ray_scheduler_failed_worker_startup_total`	gauge	count	Number of tasks that fail to be scheduled because workers were not available. Labels are broken up per reason {JobConfigMissing, RegistrationTimedOut, RateLimited}.
`ray_scheduler_tasks`	gauge	count	Number of tasks waiting for scheduling by state {Cancelled, Executing, Waiting, Dispatched, Received}.
`ray_scheduler_unscheduleable_tasks`	gauge	count	Number of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}.
`ray_serve_controller_control_loop_duration_s`	gauge	count	Number of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}.
`ray_serve_deployment_queued_queries`	gauge	count	The current number of queries to this deployment waiting to be assigned to a replica.
`ray_serve_deployment_replica_healthy`	gauge	boolean	Tracks whether this deployment replica is healthy. 1 means healthy, 0 means unhealthy.
`ray_serve_num_deployment_http_error_requests`	gauge	count	The number of non-200 HTTP responses returned by each deployment.
`ray_serve_num_http_error_requests`	gauge	count	The number of non-200 HTTP responses.
`ray_serve_num_http_requests`	gauge	count	The number of HTTP requests processed.
`ray_spill_manager_objects`	gauge	count	Number of local objects by state {Pinned, PendingRestore, PendingSpill}.
`ray_spill_manager_objects_bytes`	gauge	bytes	Byte size of local objects by state {Pinned, PendingSpill}.
`ray_spill_manager_request_total`	gauge	count	Number of {spill, restore} requests.
`ray_tasks`	gauge	count	Current number of tasks currently in a particular state.
`ray_worker_register_time_ms`	histogram	process	End-to-end latency of register worker processes.

Attributes

Learn about the available attributes for Ray clusters.sec

The following attributes are available for all supported Ray metrics:

host_kernel_release
host_physical_cpus
server.address
host_cpu_cores
host_kernel_version
SessionName
net.host.port
http.scheme
url.scheme
service.instance.id
server.port
k8s.node.name
sf_environment
node_type
Version
host_kernel_name
host.name
deployment.environment
host_machine
net.host.name
k8s.cluster.name
os.type
host_mem_total
host_logical_cpus
sf_service
host_processor
service.name

Troubleshoot

Learn how to get help if you can't see your data in Splunk Observability Cloud.

If you are a Splunk Observability Cloud customer and are not able to see your data in Splunk Observability Cloud, you can get help in the following ways:

Splunk Observability Cloud customers can submit a case in the Splunk Support Portal or contact Splunk Support.

Prospective customers and free trial users can ask a question and get answers through community support in the Splunk Community.

Documentation