3.3 GPU Monitoring Key Measures
What the exam tests
DCGM’s role, the key GPU metrics operators watch, and how the monitoring stack (DCGM → Prometheus → Grafana) is assembled.
DCGM — Data Center GPU Manager
DCGM (NVIDIA Data Center GPU Manager) is the primary tool for monitoring and managing GPU health in data center environments.
What DCGM provides
- Health monitoring: Continuously checks GPU health and reports errors
- Diagnostics: Run active diagnostics (memory tests, compute stress tests) to validate GPUs
- Telemetry collection: Collect hundreds of metrics per GPU in real time
- Job statistics: Per-job GPU utilization, memory, power (when integrated with Slurm/Kubernetes)
- Policy management: Set GPU operating modes, power limits, clock settings
- NVML binding: Built on NVIDIA Management Library (NVML); same library
nvidia-smiuses
DCGM deployment
┌─────────────────────────────────────────────────────────┐
│ Each GPU node │
│ │
│ ┌──────────────┐ ┌───────────────────────────────┐ │
│ │ GPU Driver │◄───│ DCGM daemon │ │
│ │ (NVML) │ │ (datacenter-gpu-manager.service)│ │
│ └──────────────┘ └───────────────┬───────────────┘ │
│ │ metrics │
│ ┌──────┴──────┐ │
│ │ DCGM Exporter│ │
│ │ (Prometheus │ │
│ │ format) │ │
│ └──────┬──────┘ │
└──────────────────────────────────────┼──────────────────┘
│ scrape
┌────┴────┐
│Prometheus│
└────┬────┘
│
┌────┴────┐
│ Grafana │
└─────────┘
DCGM Exporter
- Runs as a sidecar container or daemonset on Kubernetes GPU nodes
- Translates DCGM metrics into Prometheus exposition format
- Prometheus scrapes the exporter endpoint; Grafana visualizes
Key GPU metrics to monitor
Utilization metrics
| Metric | What it measures | Healthy value |
|---|---|---|
| GPU Utilization (%) | Percentage of time any kernel is executing on the GPU (SM active) | > 80% during training |
| SM Active (%) | Percentage of SMs executing a warp — finer grain than GPU utilization | > 80% for compute-bound |
| SM Occupancy | Ratio of active warps to maximum possible warps | Workload-dependent |
| Memory Utilization (%) | Fraction of GPU VRAM in use | < 95% (leave headroom) |
| Tensor Active (%) | Fraction of time Tensor Cores are executing | Key for AI workloads |
Why GPU utilization < 80% is a problem:
- CPU is bottlenecking data preprocessing (DALI helps)
- Network is bottlenecking gradient exchange (bandwidth/latency issue)
- Storage is bottlenecking data loading (need faster parallel FS)
- Job is memory-bound (too many small operations)
Memory and bandwidth metrics
| Metric | Description |
|---|---|
| GPU Memory Used (MB) | Absolute VRAM consumption |
| Memory Bandwidth Utilization (%) | HBM read/write bandwidth vs peak |
| NVLink Bandwidth (GB/s) | Throughput on each NVLink port (Tx and Rx) |
| PCIe Bandwidth | Data transfer between CPU and GPU via PCIe |
Power and thermal metrics
| Metric | Description | Alert threshold |
|---|---|---|
| Power Draw (W) | Current GPU power consumption | Near TDP (e.g., 700W for H100 SXM) |
| GPU Temperature (°C) | Die temperature | > 85°C is warning; > 90°C is critical |
| Memory Temperature (°C) | HBM temperature | Typically follows die temp |
| Fan Speed (RPM) | Active fan speed (PCIe cards) | Rising fan = rising temperature |
Thermal throttling: GPUs automatically reduce clock speeds when temperature exceeds threshold (85–90°C). This silently reduces training performance — temperature monitoring is critical for detecting cooling issues early.
Error metrics
| Metric | Description |
|---|---|
| ECC Single-bit Errors (SBE) | Corrected memory errors — high rates indicate potential hardware failure |
| ECC Double-bit Errors (DBE) | Uncorrectable memory errors — page retirement required; escalate to hardware replacement |
| XID Errors | GPU driver-level error codes; XID 79 = GPU fallen off the bus; XID 48 = ECC DBE |
| NVLink Errors | Errors on NVLink lanes — may indicate cable or connector issue |
NVLink bandwidth monitoring
For multi-GPU nodes, NVLink bandwidth is a critical metric during distributed training:
During all-reduce (gradient sync):
Expected: each GPU transmitting 900 GB/s (full NVLink 4 bandwidth)
Actual < 80% of peak → investigation needed:
- Job using tensor parallelism correctly?
- NVLink health errors?
- Gradient compression applied?
Common monitoring scenarios
Scenario 1: Training job slower than expected
Check in order:
- GPU Utilization — is it < 70%? → CPU/data pipeline bottleneck
- NVLink Bandwidth — is it < expected? → communication bottleneck
- SM Active % — high utilization but slow? → memory-bandwidth bound workload
- GPU Temperature — throttling reducing clocks?
Scenario 2: Unexpected GPU errors
- Check XID errors in DCGM or
nvidia-smi -qfor error codes - Run DCGM diagnostics:
dcgmi diag -r 3(level 3 = long GPU stress test) - Check ECC double-bit errors → page retirement or GPU replacement
- Review NVLink error counters
Scenario 3: Power/cooling alert
- Per-GPU power draw near TDP across all GPUs → data center power limit may be reached
- Temperature > 85°C → check cooling system; verify airflow; check DLC flow
- Throttling events in DCGM → temporary clock reduction → performance degradation
Key DCGM field IDs (exam reference)
| DCGM Field | Metric name |
|---|---|
| DCGM_FI_DEV_GPU_UTIL | GPU utilization (%) |
| DCGM_FI_DEV_MEM_COPY_UTIL | Memory copy engine utilization |
| DCGM_FI_DEV_FB_USED | Framebuffer (GPU memory) used |
| DCGM_FI_DEV_GPU_TEMP | GPU temperature |
| DCGM_FI_DEV_POWER_USAGE | Power draw (W) |
| DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | Total NVLink bandwidth |
| DCGM_FI_DEV_ECC_SBE_VOL_TOTAL | Single-bit ECC errors (volatile) |
| DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | Double-bit ECC errors (volatile) |
Self-check questions
- What does DCGM stand for and what are its four main capabilities?
- What is the role of the DCGM Exporter in a Kubernetes monitoring stack?
- What does GPU Utilization (%) actually measure?
- What is the significance of ECC double-bit errors?
- At what GPU temperature should an operator investigate cooling issues?
Answers
1. Data Center GPU Manager. Capabilities: health monitoring, active diagnostics, telemetry collection, and policy management (clock/power limit configuration).2. DCGM Exporter translates DCGM metrics into Prometheus exposition format and serves them on an HTTP endpoint. Prometheus scrapes this endpoint on each node; Grafana queries Prometheus to create dashboards and alerts.
3. GPU Utilization (%) measures the percentage of time over a sample period that at least one kernel is executing on the GPU (SM is active). It does NOT measure how fully utilized the SMs are — a single kernel running on 1 of 128 SMs still shows 100% GPU utilization. SM Active % provides finer granularity.
4. ECC double-bit errors (DBE) are uncorrectable memory errors. The GPU hardware cannot fix them (unlike single-bit errors). Affected memory pages must be retired, and persistent DBE errors indicate hardware failure requiring GPU replacement.
5. > 85°C is warning level; > 90°C is critical — investigate immediately. Thermal throttling begins around 83–87°C (GPU-dependent) and will silently reduce performance by reducing GPU clocks.