2.6 Facility Requirements for AI Data Centers

What the exam tests

The physical infrastructure requirements — power, cooling, space, connectivity, and security — needed to host GPU-dense AI workloads.

Why AI changes facility requirements

Traditional enterprise data centers were designed for 3–10 kW/rack with air cooling. Modern AI racks draw 30–120+ kW — this is not just “more of the same,” it requires a fundamentally different facility design.

Traditional enterprise rack:  3–10 kW      → standard air cooling, standard PDU
AI-optimized rack (H100):    ~30 kW         → high-density air or rear-door DLC
AI rack (B200/DGX):          60–80 kW       → direct liquid cooling (mandatory)
GB300 NVL72 rack:            ~120 kW        → purpose-built liquid infrastructure

Power infrastructure

Power feed

3-phase power is standard for high-density racks (more efficient for high current loads)
Per-rack circuit: 30 kW racks require 30A @ 208V 3-phase circuits; higher for denser racks
Transformer proximity: Step-down transformers should be close to the row to minimize voltage drop

UPS (Uninterruptible Power Supply)

Provides battery backup during utility outages or generator switchover
AI training jobs take minutes to hours — UPS is not a substitute for a generator; it bridges the gap
Typical UPS runtime: 10–15 minutes at full load
Types: Double-conversion online UPS (clean power, zero transfer time) preferred for GPU clusters

Redundancy tiers (Uptime Institute)

| Tier | Redundancy | Expected availability | |—|—|—| | Tier I | None (N) | 99.671% | | Tier II | N+1 | 99.741% | | Tier III | N+1 concurrently maintainable | 99.982% | | Tier IV | 2N fully fault-tolerant | 99.995% |

AI training clusters commonly target Tier III — maintainable without downtime. Full Tier IV adds significant cost that’s often not justified if training jobs can checkpoint and resume.

Generators

Diesel or natural gas generators provide sustained power beyond UPS runtime
Transfer time: manual (> 30s) or automatic transfer switch (ATS, < 10s)
Generator sizing: must cover full IT load plus cooling equipment

Cooling infrastructure

(See 2.3 Power and Cooling for full detail)

Facility-level cooling components

CRAC (Computer Room Air Conditioning): Traditional raised-floor units; air-based; practical to ~30 kW/rack
CRAH (Computer Room Air Handler): Works with building chilled water plant; more efficient than self-contained CRAC
Chilled Water Plant: Chillers, cooling towers, pumps; backbone of large data center cooling
Direct Liquid Cooling infrastructure: CDUs (Coolant Distribution Units), manifolds, facility chilled water supply/return

Hot/cold aisle containment

Separates cold intake air from hot exhaust air
Cold-aisle containment: encloses cold aisles with doors/roof; hot air freely exits to room return
Hot-aisle containment: encloses hot aisles; cold air floods the room

Space and physical layout

Raised floor

Traditional: raised floor provides plenum for cold air distribution and cable management
Not required with overhead cabling and row-based cooling (in-row coolers)
AI clusters often use overhead cable management with floor drains (for liquid cooling leak management)

Floor load capacity

Standard data center floor: 100–150 kg/m² (about 300–450 lbs/ft²)
GPU racks with liquid cooling can exceed this: DGX H100 weighs ~350 kg
Structural assessment is required before installing AI racks in existing facilities

Rack unit (U) planning

1U = 1.75 inches; standard rack is 42U
DGX H100 = 10U
GB300 NVL72 = full rack (purpose-built)
Leave space for top-of-rack switches, patch panels, and cable management

Connectivity

External network connectivity

Multiple diverse fiber paths from different providers — no single point of failure at the building
Cross-connects in the Meet-Me Room (MMR) to carriers and internet exchanges
Bandwidth: 100G, 400G uplinks for clusters with high external data transfer needs

Internal network runs

InfiniBand requires low-loss fiber or copper (DAC) up to 3m, active optical cables (AOC) for longer runs
Fat-tree topologies may require structured cabling from compute to spine switches — plan cable path before rack installation

Physical security

Layer	Measures
Perimeter	Fenced facility, vehicle barriers, security guard
Building access	Biometric or badge reader entry, mantrap/airlock
Data center floor	Separate badge zone; camera coverage (CCTV); visitor log
Rack level	Lockable rack doors; cage-based isolation for multi-tenant
OOB management	Dedicated IPMI/BMC network physically separated from production

Summary checklist before deploying AI racks

Confirm rack power circuit capacity (30–120 kW per rack as required)
Verify floor load rating supports GPU rack weights
Determine cooling method (air / rear-door DLC / direct DLC / immersion)
Confirm chilled water supply is available (if DLC)
Plan cable management for InfiniBand high-density connections
Confirm redundant power feeds (UPS + generator)
Assess external bandwidth for training data ingestion and model uploads
Verify physical security meets compliance requirements

Self-check questions

What facility rating (Uptime Tier) is most commonly targeted for AI training clusters?
Why does floor load capacity matter for GPU rack deployments?
What is the purpose of hot-aisle/cold-aisle containment?
What bridges the gap between utility power failure and generator startup?
For racks above 40 kW, what cooling method is typically required?

Answers

1. Tier III — concurrently maintainable, 99.982% availability. Provides redundancy without the full cost of Tier IV, and training jobs can checkpoint/resume if needed.
2. GPU racks (e.g., DGX H100 at ~350 kg) can exceed the structural floor load rating of standard data center floors (typically 100–150 kg/m²). Exceeding this risks structural damage. A load assessment and possibly structural reinforcement are required.
3. Hot/cold aisle containment prevents cold supply air from mixing with hot exhaust before it cools servers. Without containment, hot and cold air mix, reducing cooling effectiveness and forcing over-provisioning of cooling capacity.
4. UPS (Uninterruptible Power Supply) — provides battery-backed power for 10–15 minutes while the generator starts up and stabilizes.
5. Direct Liquid Cooling (DLC) — cold plates on GPUs/CPUs, fed by facility chilled water. Rear-door heat exchangers are also used in the 30–50 kW range.