2.3 Power and Cooling Requirements

What the exam tests

GPU thermal design power (TDP), rack power density tiers, cooling method trade-offs, and PUE as a data center efficiency metric.

GPU power (TDP) by product

GPU	Form Factor	TDP
NVIDIA L4	PCIe	72 W
NVIDIA L40S	PCIe	350 W
NVIDIA H100	PCIe	350 W
NVIDIA H100	SXM5	700 W
NVIDIA B200	SXM	~1,000 W
NVIDIA GB300 NVL72 (full rack)	Rack	~120 kW total rack

Key insight: A DGX H100 (8× H100 SXM5 at 700W each) draws ~10.2 kW from the GPUs alone — plus CPU, memory, storage, networking. Total DGX H100 rack power: ~10.2 kW. A GB300 NVL72 full-rack system draws ~120 kW.

This is why AI data centers require purpose-built power infrastructure — a traditional enterprise server rack runs 3–10 kW; an AI rack runs 30–120+ kW.

Rack power density evolution

Traditional enterprise IT:    3–10 kW / rack
AI-ready data center:        20–40 kW / rack
Modern GPU cluster:          40–80 kW / rack
DGX B200/GB300 dense:       80–120+ kW / rack

Practical implication: Most existing data centers cannot physically host modern GPU clusters without power and cooling upgrades. Colocation providers increasingly offer “AI-ready” halls with direct liquid cooling (DLC) and 40–80 kW/rack power circuits.

Cooling methods

Air cooling

Uses CRAC/CRAH units to circulate cold air through server front intakes and exhaust hot air from rear
Hot-aisle/cold-aisle containment improves efficiency
Practical limit: ~30 kW/rack before thermal density becomes unmanageable
Still used for L40S-class (350W PCIe) servers in standard rack deployments

Direct Liquid Cooling (DLC)

Water or coolant flows directly to cold plates mounted on GPU die and CPU
Removes heat from the chip before it enters the room air
Enables 50–80 kW/rack densities
Required for DGX H100 SXM configurations and higher
Rear-door heat exchangers: liquid cooling at the back of the rack, cooling hot exhaust air before it enters the room

Immersion cooling

Servers submerged in dielectric fluid (mineral oil or synthetic fluid)
Single-phase: fluid stays liquid, captures heat via heat exchanger
Two-phase: fluid boils, vapor condenses on a chiller; extremely efficient
Enables 100+ kW/rack, virtually silent
Higher upfront cost; specialized facilities required
Used for ultra-dense GB300/B200 deployments

Comparison

Method	Max density	Cost	Retrofit ease
Air (hot/cold aisle)	~30 kW/rack	Low	High (standard CRAC)
Direct Liquid Cooling	50–80 kW/rack	Medium	Medium (plumbing required)
Rear-door HX	~40 kW/rack	Medium	Medium
Immersion	100+ kW/rack	High	Low (specialized facility)

PUE — Power Usage Effectiveness

PUE measures how efficiently a data center uses energy:

PUE = Total Facility Power / IT Equipment Power

Where:
  Total Facility Power = power consumed by ALL equipment (IT + cooling + lighting + UPS losses)
  IT Equipment Power = power consumed by servers, storage, networking only

PUE	Rating
1.0	Perfect (theoretical) — 100% of power goes to IT
1.1–1.2	Excellent — hyperscale data centers (Google, Meta)
1.3–1.4	Good — purpose-built colocation
1.5–1.7	Average — older enterprise data centers
2.0+	Poor — inefficient legacy facilities

AI data center targets PUE < 1.2 with direct liquid cooling. Air-cooled facilities typically achieve 1.3–1.5.

Exam trap: Lower PUE = better efficiency. PUE of 1.0 is perfect; higher numbers mean wasted energy on cooling overhead.

Power infrastructure requirements

Power delivery

3-phase power is standard for rack-scale AI deployments (more efficient than single-phase for high current)
UPS (Uninterruptible Power Supply): Provides short-term battery backup during utility power transitions; typically 10–15 minutes to allow graceful shutdown
PDUs (Power Distribution Units): Intelligent PDUs enable per-outlet power monitoring and remote switching
Redundancy: N+1 or 2N power paths for mission-critical training clusters

Power capacity planning

Total AI rack power = GPU TDP × number of GPUs
                    + CPU TDP × number of CPUs  
                    + Storage (NVMe: ~10W per drive)
                    + Networking (ConnectX-7: ~20W per NIC)
                    + Memory, fans, board overhead

Rule of thumb: multiply calculated IT load × 1.25 for cooling overhead at PUE 1.25

Self-check questions

What is the TDP of the H100 in SXM5 form factor?
What does PUE stand for and what value indicates a highly efficient data center?
What cooling method is required for racks exceeding 40 kW?
Why can’t most traditional enterprise data centers host GPU clusters without upgrades?
A data center has PUE of 1.6. If IT equipment consumes 1 MW, how much total power does the facility use?

Answers

1. 700W TDP (H100 SXM5).
2. Power Usage Effectiveness. PUE = 1.1–1.2 indicates excellent efficiency (hyperscale class). Lower is better; 1.0 is theoretical perfect.
3. Direct Liquid Cooling (DLC) — either cold-plate direct to chip or rear-door heat exchangers. Air cooling becomes impractical above ~30 kW/rack.
4. Traditional data centers are designed for 3–10 kW/rack with standard air cooling. GPU racks require 30–120+ kW/rack, which exceeds both the power circuit capacity (PDU/breaker ratings) and the cooling capacity of standard CRAC units.
5. 1.6 × 1 MW = 1.6 MW total facility power (0.6 MW goes to cooling, lighting, UPS losses).