1.2 Training vs Inference Architecture Requirements
What the exam tests
The different compute, memory, and latency profiles of training and inference workloads — and which NVIDIA hardware and software targets each.
The two phases of AI deployment
┌──────────────────────────────────────────────────────────────┐
│ TRAINING INFERENCE │
│ │
│ Objective: Find model weights Objective: Use weights to │
│ that minimize loss function answer a query fast │
│ │
│ Runs: Once (or iteratively) Runs: Millions of times │
│ Duration: Hours to weeks Duration: Milliseconds │
│ Users: Data scientists Users: End-users/apps │
└──────────────────────────────────────────────────────────────┘
Training requirements
Compute profile
- Massive FLOPS — billions to trillions of multiply-accumulate operations per forward + backward pass
- FP16/BF16 training — most modern training runs in BF16 (better dynamic range than FP16) with Tensor Core acceleration
- Gradient computation — backpropagation requires storing all intermediate activations; memory-hungry
Memory requirements
- Large model memory — GPT-4 class models require tens of TB of GPU memory in aggregate
- High memory bandwidth — weight loading and gradient accumulation stress HBM bandwidth
- Distributed training requires NVLink + InfiniBand — model parallelism, data parallelism, and pipeline parallelism spread across many GPUs
Parallelism strategies
| Strategy | What it distributes | When to use |
|---|---|---|
| Data parallelism | Mini-batches across GPUs | Fits model on one GPU; scale throughput |
| Model/tensor parallelism | Model layers/tensors across GPUs | Model too large for one GPU |
| Pipeline parallelism | Layers in pipeline stages | Very deep models with micro-batch pipelining |
| Sequence parallelism | Long context attention | Extremely long sequences |
NCCL handles the all-reduce operations that synchronize gradients across GPUs/nodes.
Best hardware for training
- H100 SXM5 / B200 SXM — high HBM bandwidth, 4th/5th gen NVLink, highest-density compute
- DGX H100 / DGX B200 — 8 GPUs all-to-all connected via NVSwitch; optimal for large model training
- InfiniBand (HDR/NDR/800G) — required for multi-node training at scale; RoCE is an alternative
Inference requirements
Compute profile
- Low latency — user-facing applications require sub-100ms response times
- High throughput — serving many concurrent users simultaneously
- INT8 / FP8 / FP4 quantization — reduced precision maintains accuracy while doubling or quadrupling throughput vs FP16
Memory requirements
- KV cache — transformer inference stores key-value tensors per token per layer; grows with context length
- Model fits in GPU memory — inference typically runs on a single GPU or small GPU cluster
- Memory bandwidth matters more than FLOPS at small batch sizes (memory-bandwidth bound)
Inference optimization techniques
| Technique | Description |
|---|---|
| Quantization | FP32 → FP16 → INT8 → FP8 → FP4; reduces weight size and increases throughput |
| Layer fusion | Combine multiple operations into one kernel launch (TensorRT) |
| Continuous batching | Process tokens from multiple requests in the same forward pass as they arrive |
| Paged KV cache | Store KV cache in non-contiguous pages, like OS virtual memory (TensorRT-LLM) |
| Speculative decoding | Draft model generates candidates; target model verifies in parallel |
Best hardware for inference
- H100 / B200 — for large LLM inference requiring maximum throughput
- L40S — for inference + graphics combined; cost-efficient for medium models
- L4 — edge/cloud inference for smaller models; very power-efficient (72W TDP)
- TensorRT — always use to optimize models before deploying to any NVIDIA GPU
Side-by-side comparison
| Dimension | Training | Inference |
|---|---|---|
| Primary metric | Throughput (samples/sec, tokens/sec training) | Latency (TTFT, tokens/sec generation) |
| Precision | FP32, BF16, FP16 | INT8, FP8, FP4, FP16 |
| Batch size | Large (improves GPU utilization) | Small to large (dynamic batching) |
| Memory need | Very high (activations + gradients + weights) | Moderate (weights + KV cache only) |
| Communication | All-reduce across many GPUs (NCCL) | Tensor parallel across a few GPUs |
| Key NVIDIA tool | NeMo, Megatron-LM, NCCL | TensorRT, Triton, TensorRT-LLM |
| Preferred interconnect | InfiniBand / NVLink | NVLink (within node), Ethernet (scale-out) |
Key terms
- TTFT (Time to First Token): Latency from request to first token returned — critical for interactive applications
- Tokens/sec: Throughput metric for generation workloads
- MFU (Model FLOP Utilization): Fraction of theoretical peak FLOPS actually used; good training runs achieve 40–60% MFU
- Batch size: Number of samples processed in parallel; larger = better GPU utilization but higher latency
Self-check questions
- Why does training require storing all intermediate activations?
- What is the purpose of quantizing a model from FP16 to INT8 for inference?
- Which NCCL operation synchronizes gradients across GPUs during distributed training?
- What is the difference between model parallelism and data parallelism?
- Which NVIDIA GPU is designed for inference + professional visualization on a single card?
Answers
1. Backpropagation computes gradients layer-by-layer from the output back to the input. Each layer needs its input activation (computed during the forward pass) to calculate its gradient. So all activations must be stored until the backward pass completes.2. INT8 uses 8-bit integers instead of 16-bit floats, halving memory usage and doubling throughput on Tensor Cores that support INT8, at minor accuracy cost (calibrated quantization minimizes this).
3. All-reduce — each GPU starts with its local gradient; after all-reduce, every GPU has the sum of all gradients, enabling synchronized weight updates.
4. Data parallelism: each GPU has a full copy of the model but processes different data batches; gradients are averaged at the end. Model parallelism: the model itself is split across GPUs, each holding different layers or tensor slices.
5. L40S (Ada Lovelace) — handles AI inference + rendering/visualization on one card.