Configure the Prometheus receiver to collect Ray cluster metrics
Learn how to configure the Prometheus receiver to collect Ray cluster metrics.
You can monitor the performance of Ray cluster large-language model (LLM) applications by configuring your Ray cluster applications to send metrics to Splunk Observability Cloud. This solution uses the Prometheus receiver to collect metrics from Ray, which exposes a /metrics endpoint that publishes Prometheus-compatible metrics.
Complete the following high-level steps to collect metrics from and monitor your Ray cluster applications.
Prerequisites
Learn about the prerequisites for configuring the Prometheus receiver to collect Ray cluster metrics.
To configure the Prometheus receiver to collect metrics from Ray cluster, you must first deploy Ray on local or on a cloud server.
Configure and activate the component for Ray
Learn how to configure and activate the component for Ray.
- Deploy the Splunk Distribution of the OpenTelemetry Collector to your host or container platform:
- To activate the Prometheus receiver for Ray cluster manually in the Collector configuration, make the following changes to your configuration file:
- Restart the Splunk Distribution of the OpenTelemetry Collector.
Monitor the performance of Ray cluster applications
Learn how to navigate to the Ray navigators, which you can use to monitor the performance of Ray cluster applications.
Complete the following steps to access the Ray navigator and monitor the performance of Ray cluster applications. For more information on navigators, see Use navigators.
- From the Splunk Observability Cloud main menu, select Infrastructure.
- Under AI/ML, select AI Frameworks.
- Select the Ray summary card.
Configuration settings
Learn about the configuration settings for the Prometheus receiver.
To view the configuration options for the Prometheus receiver, see Settings.
Metrics
Learn about the available metrics for Ray cluster applications.
Metric name | Type | Unit | Description |
---|---|---|---|
ray_actors | gauge | count | The number of actor processes. |
ray_cluster_active_nodes | gauge | count | The number of active Ray nodes in the cluster. |
ray_component_cpu_percentage | gauge | percent | Total CPU usage of the components on a node. |
ray_component_mem_shared_bytes | gauge | bytes | SHM usage of all components of the node. Equivalent to the top command's SHR column. |
ray_component_rss_mb | gauge | megabytes | RSS usage of all components on the node. |
ray_gcs_placement_group_count | gauge | count | Number of placement groups broken down by state {Registered, Pending, Infeasible}. |
ray_gcs_storage_operation_count_total | cumulative counter | counter | Number of operations invoked on Google Cloud storage. |
ray_gcs_storage_operation_latency_ms | histogram | ms | Time to invoke an operation on Google Cloud storage. |
ray_gcs_task_manager_task_events_dropped | gauge | count | Number of task events dropped per type {PROFILEEVENT, STATUSEVENT}. |
ray_grpc_server_req_finished_total | cumulative counter | counter | Number of finished requests in the grpc server. |
ray_grpc_server_req_handling_total | cumulative counter | counter | Number of handling requests in the grpc server. |
| count | count | Number of new requests in the grpc server. |
ray_grpc_server_req_process_time_ms | histogram | ms | Request latency in grpc server. |
ray_internal_num_infeasible_scheduling_classes | gaugecount | count | The number of unique scheduling classes that are infeasible. |
ray_internal_num_processes_skipped_job_mismatch | gauge | count | The total number of cached workers skipped due to job mismatch. |
ray_internal_num_processes_skipped_runtime_environment_mismatch | gauge | count | The total number of cached workers skipped due to runtime environment mismatch. |
ray_internal_num_processes_started | count | count | The number of Ray worker processes started. |
ray_internal_num_processes_started_from_cache | gauge | count | The total number of workers started from a cached worker process. |
ray_internal_num_spilled_tasks | gauge | count | The cumulative number of lease requests that this raylet has spilled to other raylets. |
ray_node_cpu_count | gauge | cpu | Total CPUs available on a Ray node. |
ray_node_cpu_utilization | gauge | percent | Total CPU usage on a Ray node. |
ray_node_disk_io_read_speed | gauge | bytes/s | Disk read speed. |
| gauge | operation | Total write ops to disk. |
| gauge | bytes | Disk write speed. |
ray_node_disk_read_iops | gauge | operations/s | Disk read IOPS. |
ray_node_disk_utilization_percentage | gauge | percent | Total disk utilization (percentage) on a Ray node. |
ray_node_disk_write_iops | gauge | operations/s | Disk write IOPS. |
ray_node_mem_total | gauge | bytes | Total memory on a Ray node. |
ray_node_mem_used | gauge | bytes | Memory usage on a Ray node. |
ray_node_network_received | gauge | bytes | Total network received. |
ray_node_network_send_speed | gauge | bytes/s | Network send speed. |
ray_node_network_sent | gauge | bytes | Total network sent. |
ray_object_directory_added_locations | gauge | location | Number of object locations added per second.If this is high, a lot of objects have been added on this node. |
ray_object_directory_lookups | gauge | count | Number of object location lookups per second. If this is high, the raylet is waiting on a high number of objects. |
ray_object_directory_removed_locations | gauge | count | Number of object locations removed per second. If this is high, a high number of objects have been removed from this node. |
ray_object_directory_subscriptions | gauge | count | Number of object location subscriptions. If this is high, the raylet is attempting to pull a high number of objects. |
ray_object_directory_updates | gauge | count | Number of object location updates per second. If this is high, the raylet is attempting to pull a high number of objects and/or the locations for objects are frequently changing (e.g. due to many object copies or evictions). |
ray_object_manager_bytes | gauge | bytes | Number of bytes pushed or received by type {PushedFromLocalPlasma, PushedFromLocalDisk, Received}. |
ray_object_manager_num_pull_requests | gauge | count | Number of active pull requests for objects. |
ray_object_manager_received_chunks | gauge | count | Number object chunks received by type {Total, FailedTotal, FailedCancelled, FailedPlasmaFull}. |
ray_object_store_available_memory | gauge | bytes | Amount of memory currently available in the object store. |
ray_object_store_fallback_memory | gauge | bytes | Amount of memory in fallback allocations in the filesystem. |
ray_object_store_memory | gauge | bytes | Object store memory by various sub-kinds on this node/ |
ray_object_store_used_memory | gauge | bytes | Used memory in the Ray object store. |
ray_object_store_num_local_objects | gauge | count | Number of objects currently in the object store. |
ray_pull_manager_active_bundles | gauge | count | Number of active bundle requests. |
ray_pull_manager_num_object_pins | count | count | Number of object pin attempts by the pull manager. Can be {Success, Failure}. |
ray_pull_manager_requests | count | count | Number of requested bundles per type {Get, Wait, TaskArgs}. |
ray_pull_manager_retries_total | gauge | count | Number of cumulative pull retries. |
ray_pull_manager_usage_bytes | gauge | bytes | The total number of bytes usage per type {Available, BeingPulled, Pinned}. |
ray_push_manager_chunks | count | count | Number of data chunks pushed by the push manager. |
ray_push_manager_in_flight_pushes | gauge | count | Number of in-flight object push requests. |
ray_resources | gauge | resource | Logical Ray resources by state {AVAILABLE, USED}. |
ray_scheduler_failed_worker_startup_total | gauge | count | Number of tasks that fail to be scheduled because workers were not available. Labels are broken up per reason {JobConfigMissing, RegistrationTimedOut, RateLimited}. |
ray_scheduler_tasks | gauge | count | Number of tasks waiting for scheduling by state {Cancelled, Executing, Waiting, Dispatched, Received}. |
ray_scheduler_unscheduleable_tasks | gauge | count | Number of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}. |
ray_serve_controller_control_loop_duration_s | gauge | count | Number of pending tasks (not scheduleable tasks) by reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}. |
ray_serve_deployment_queued_queries | gauge | count | The current number of queries to this deployment waiting to be assigned to a replica. |
ray_serve_deployment_replica_healthy | gauge | boolean | Tracks whether this deployment replica is healthy. 1 means healthy, 0 means unhealthy. |
ray_serve_num_deployment_http_error_requests | gauge | count | The number of non-200 HTTP responses returned by each deployment. |
ray_serve_num_http_error_requests | gauge | count | The number of non-200 HTTP responses. |
ray_serve_num_http_requests | gauge | count | The number of HTTP requests processed. |
ray_spill_manager_objects | gauge | count | Number of local objects by state {Pinned, PendingRestore, PendingSpill}. |
ray_spill_manager_objects_bytes | gauge | bytes | Byte size of local objects by state {Pinned, PendingSpill}. |
ray_spill_manager_request_total | gauge | count | Number of {spill, restore} requests. |
ray_tasks | gauge | count | Current number of tasks currently in a particular state. |
ray_worker_register_time_ms | histogram | process | End-to-end latency of register worker processes. |
Attributes
Learn about the available attributes for Ray clusters.sec
The following attributes are available for all supported Ray metrics:
host_kernel_release
host_physical_cpus
server.address
host_cpu_cores
host_kernel_version
SessionName
net.host.port
http.scheme
url.scheme
service.instance.id
server.port
k8s.node.name
sf_environment
node_type
Version
host_kernel_name
host.name
deployment.environment
host_machine
net.host.name
k8s.cluster.name
os.type
host_mem_total
host_logical_cpus
sf_service
host_processor
service.name
Troubleshoot
Learn how to get help if you can't see your data in Splunk Observability Cloud.
If you are a Splunk Observability Cloud customer and are not able to see your data in Splunk Observability Cloud, you can get help in the following ways:
Splunk Observability Cloud customers can submit a case in the Splunk Support Portal or contact Splunk Support.
Prospective customers and free trial users can ask a question and get answers through community support in the Splunk Community.