Troubleshooting GPU Monitoring
Issue | Public Cause |
---|---|
No GPU metrics in the UI | sim.cluster.gpu.enabled=true is not set on the Controller. |
gpuMonitoringEnabled: true is missing in the Cluster Agent spec. | |
Machine Agent environment variables are misconfigured. | |
Cross-node metric mixing |
The Service is missing internalTrafficPolicy: Local . |
DNS resolution failure | Machine Agent pods cannot resolve nvidia-dcgm-exporter.gpu-operator.svc.cluster.local . Verify DNS settings. |