3.1 AI Data Center Management and Monitoring Essentials

What the exam tests

Out-of-band management, BMC/IPMI, NVIDIA Base Command Platform, and the tools used to manage large-scale AI infrastructure day-to-day.

Two management planes

┌────────────────────────────────────────────────────────────────┐
│  In-Band Management                                             │
│  • Uses production network (same cables as workload traffic)    │
│  • Requires the server OS to be running                         │
│  • Tools: SSH, SNMP, Kubernetes API, Slurm CLI                  │
│                                                                  │
│  Out-of-Band (OOB) Management                                   │
│  • Separate dedicated management network                        │
│  • Works regardless of server OS state                          │
│  • Can power cycle, image, diagnose before OS loads             │
│  • Tools: BMC, IPMI, Redfish, iDRAC (Dell), iLO (HPE)          │
└────────────────────────────────────────────────────────────────┘

BMC — Baseboard Management Controller

BMC is a dedicated microcontroller on every server motherboard that provides out-of-band management capability.

What BMC enables

Power control: Power on/off, hard reset, graceful shutdown — even when OS is hung
Serial console access (SOL): See server boot messages and OS console remotely
Hardware sensor monitoring: CPU/memory/fan/power sensor readings
Remote firmware updates: BIOS/UEFI updates without physical access
Hardware event logging: SEL (System Event Log) records hardware errors
Remote media mounting: Boot from ISO image over network

BMC protocols

OOB network design

Dedicated 1GbE management port per server (separate physical NIC from production)
Connected to a separate management switch (VLAN-isolated or physically separate)
Accessible from management VLAN only — not routable from production
Provides access even during kernel panic, BIOS POST, or physical maintenance

NVIDIA Base Command Platform

NVIDIA Base Command Platform is the AI infrastructure management software for DGX-based clusters.

What it provides

Cluster management: Provision, configure, and monitor DGX systems at scale
Job scheduling: Integrated workload scheduler for AI training jobs
Software lifecycle: Deploy and update OS, drivers, CUDA, NGC containers across the cluster
Monitoring: GPU utilization, health, alerts, capacity planning dashboards
Multi-cluster support: Manage multiple DGX clusters from a single pane of glass

Relationship to NGC

Base Command Platform orchestrates jobs that run NGC containers. You define a training job pointing at an NGC container and dataset; Base Command handles scheduling, resource allocation, and monitoring.

Who uses it

Enterprise customers with DGX H100/B200 clusters. It’s NVIDIA’s answer to the operational complexity of managing hundreds of DGX systems running AI workloads 24/7.

Data Center Management Tools

DCIM — Data Center Infrastructure Management

DCIM platforms (e.g., Sunbird, Nlyte, Vertiv) provide:

Asset inventory (physical rack location, connections)
Power monitoring per rack/server
Cooling/thermal management
Capacity planning
Change management

For AI data centers, DCIM integration with GPU-level telemetry (via DCGM) creates a full stack from physical power consumption to per-GPU compute utilization.

Monitoring stack

GPU Hardware Metrics
       │
       ▼
DCGM (Data Center GPU Manager)
       │
       ▼
DCGM Exporter (Prometheus format)
       │
       ▼
Prometheus (time-series collection)
       │
       ▼
Grafana (dashboards & alerting)

See 3.3 GPU Monitoring for full DCGM details.

Management network topology

                ┌─────────────────┐
                │  Management     │
                │  Workstation    │
                └────────┬────────┘
                         │
                ┌────────┴────────┐
                │   OOB Switch    │  ← 1GbE dedicated
                │  (mgmt-only)    │
                └────────┬────────┘
              ┌──────────┴───────────┐
    ┌─────────┴──┐            ┌──────┴──────┐
    │  BMC (DGX1) │  ...      │  BMC (DGX N) │
    └─────────────┘            └─────────────┘
    (powered, always            (powered, always
     accessible)                 accessible)

Even during BIOS POST, OS crash, or network card failure — the BMC is reachable via the dedicated OOB network.

Self-check questions

What is the difference between in-band and out-of-band management?
What can you do with a BMC that you cannot do via SSH into the OS?
What modern protocol is replacing IPMI for BMC communication?
Which NVIDIA platform manages DGX clusters including job scheduling and software lifecycle?
Why is a dedicated OOB network required rather than using the production network for BMC access?

Answers

1. In-band management uses the production network and requires the server OS to be running. Out-of-band management uses a separate dedicated network (connected to the BMC) and works regardless of OS state — even if the server is powered off, crashed, or in BIOS.
2. With BMC you can: power cycle/reset the server even if the OS is hung; access the serial console during boot before the OS loads; read hardware sensors (temperature, fan, power) independently of OS; update firmware remotely; mount virtual media over the network for OS reinstallation.
3. Redfish — a modern REST/JSON-based API defined by DMTF that replaces the older binary IPMI protocol.
4. NVIDIA Base Command Platform.
5. If the production network is used for both workload traffic and management, a network failure or misconfiguration could cut off management access entirely. A dedicated OOB network ensures management access is always available regardless of production network state.