GPU Metrics
Once configured, the DCGM Exporter automatically collects and sends GPU metrics on Server Dashboard. Key metrics include:
Metric | Description |
---|---|
GPU Utilization (%) | Percentage of time the GPU actively executes compute kernels. |
GPU Memory Utilization (%) | Percentage of GPU memory in use (used / total × 100). |
GPU PCIe Tx Throughput | Outbound PCIe bandwidth from GPU to the host. |
GPU Power Usage (W) | Instantaneous power draw of the GPU. |
GPU PCIe Rx Throughput | Inbound PCIe bandwidth from host to GPU. |
GPU Temperature (°C) | Current core temperature of the GPU. |
The following GPU metrics are available on the Cluster, Pods and Container Dashboard:
Metric | Description | Scope |
---|---|---|
Total GPUs | Total count of GPU devices detected across all nodes in the cluster. | CLUSTER |
Active GPUs | Number of GPUs currently processing workloads. | CLUSTER |
Idle GPUs | Number of Idle GPUs present in a Cluster (No utilization for some threshold amount of time). | CLUSTER |
GPU Limit (%) | Total number of cluster GPUs expressed as a percentage (Total GPU compute capacity). | CLUSTER |
GPU Used (%) | Sum of actual GPU usage across pods, as a percentage of total cluster GPUs. | CLUSTER |
GPU Request (%) | Sum of GPU resource requests by pods as a percentage. | CLUSTER |
GPU Memory Limit (%) | Total number of cluster GPUs expressed as a percentage (Total GPU memory capacity). | CLUSTER |
GPU Memory Used (%) | Sum of actual GPU memory usage across pods, as a percentage of total cluster GPUs. | CLUSTER |
GPU Memory Request (%) | Sum of GPU memory resource requests by pods as a percentage. | CLUSTER |
GPU % | Percentage of available GPU compute capacity currently used by a pod (with respect to total node capacity). | POD |
GPU Memory % | Percentage of total GPU memory in use by a pod (with respect to total node capacity). | POD |
GPU Utilization (%) | Percentage of time a container’s GPU was actively processing compute work (with respect to total node capacity). | CONTAINER |
GPU Memory Utilization (%) | Percentage of a container’s GPU memory in use (with respect to total node capacity). | CONTAINER |
For the full list of available metrics, see Metrics Browser.