AI PODs Custom Dashboard

When you configure the AI components to collect metrics, the dashboards get populated with the various metrics. These dashboards provide real-time visibility into the operational efficiency of the AI components.

The following metrics of different AI components are charted on the dashboards:

Components Metrics
Cisco Nexus data center switches
  • Interfaces configured on each switch
  • Transmit errors on each switch
  • Transmit drops on each switch
  • Receive errors on each switch
  • Receive drops on each switch
  • Receive Multicast & Broadcast packets on each switch
  • Transmitted bytes by Pure Storage and compute node interfaces
  • Received bytes by Pure Storage and compute node interfaces
  • Transmitted bytes by selected interfaces
  • Received bytes by selected interfaces
NVIDIA GPU
  • GPU Utilization
  • GPU Power (W)

  • GPU Memory Used

  • GPU Temperature

  • GPU Power (W)

  • Average GPU Utilization

  • Average GPU Memory Used

  • GPU Memory % Utilization

NVIDIA NIM
  • Running Requests
  • Total Input Tokens

  • Total Output Tokens

  • KV Cache Utilization

  • Time Per Output Token (TPOT)

  • KV Cache Utilization

  • Input vs Output Tokens

  • Time To First Token

  • Finished Requests

  • Time Per Output Token

  • E2E Request Latency

  • Inter-Token Latency (ITL)

  • Time to First Token (TTFT)

  • E2E Request Latency vs Running Requests

  • Throughput (tok/sec)

  • KV Cache Utilization

  • Prompts token/sec

  • Generation tokens/sec

Vector Database
  • Proxy - Collections
  • Proxy - Total Operations

  • Milvus Proxy - Operations by Status

  • Milvus Proxy : Cache Hits

  • Milvus Proxy - Request Latency by Function

  • Milvus Proxy - Operations by Function

  • RootCoord - DDL Operations

  • RootCoord - DDL Channels
  • Milvus Coord - DDL Operations in queue latency

  • RootCoord - Count of ID allocated

Storage
  • PX Cluster Total Storage Capacity
  • PX Cluster Storage Utilization

  • Storage Nodes Online
  • Storage Nodes Offline
  • Total Read & Write IOPS

  • PX Total Read & Write Throughput

  • PX CPU Utilization

  • PX Avg Read & Write Latency

Limitations

The following are current limitations of dashboards based on Infrastructure Visibility metrics:

  • Infrastructure Visibility gathers metrics from each individual node, which may lead to duplicate data being reported at the cluster level.
  • Metrics such as storage do not include cluster-level aggregates. Only metrics from individual hosts are reported by Infrastructure Visibility.
  • There is no support for metrics that encompass all cluster pods within widgets, making it impossible to create custom dashboard widgets without reverting to the default Cluster Agent dashboard UI for mapping such Splunk widgets.