3.2 Cluster Orchestration and Job Scheduling

What the exam tests

Kubernetes vs Slurm for AI workloads, the NVIDIA GPU Operator, Run:ai, and how GPU resources are requested and allocated in production clusters.

Why AI scheduling is different

Traditional compute schedulers allocate CPU cores. AI schedulers must:

Allocate entire GPUs (or MIG partitions) to jobs
Enforce GPU-aware gang scheduling — a training job needs all its GPUs to start simultaneously, or none (partial allocation wastes GPUs and causes deadlock)
Handle heterogeneous resources — GPU type, memory, NVLink topology, InfiniBand locality
Support preemption — kill lower-priority jobs to free GPUs for high-priority ones
Track GPU utilization and quota across teams/projects

Slurm — HPC Job Scheduler

Slurm (Simple Linux Utility for Resource Management) is the dominant HPC job scheduler, widely used for AI training clusters.

Architecture

┌─────────────────┐     ┌─────────────────────────────────┐
│  Submit host     │────▶│           Slurm Controller       │
│  (sbatch, srun)  │     │  (slurmctld — tracks jobs,       │
└─────────────────┘     │   nodes, resource state)          │
                         └──────────────┬──────────────────┘
                                        │
                    ┌───────────────────┼───────────────────┐
                    │                   │                   │
              ┌─────┴────┐       ┌──────┴───┐       ┌──────┴────┐
              │  Node 1   │       │  Node 2   │       │  Node N   │
              │ (slurmd)  │       │ (slurmd)  │       │ (slurmd)  │
              └─────┬────┘       └──────┬───┘       └──────┬────┘
                    │                   │                   │
                GPU, CPU, RAM      GPU, CPU, RAM       GPU, CPU, RAM

Key Slurm concepts

Concept	Description
Partition	Queue of nodes with specific policies (priority, time limit, GPU type)
Job	A unit of work submitted with resource requirements
sbatch	Submit a batch script (non-interactive)
srun	Run a parallel application interactively or as job step
GRES (Generic Resource)	How Slurm tracks GPUs: `--gres=gpu:8` requests 8 GPUs
Gang scheduling	All nodes of a job allocated simultaneously

Why Slurm for AI

Mature, battle-tested in HPC environments
Excellent multi-node job support
Integrates with InfiniBand topology-aware scheduling (allocate jobs on same IB switch for lowest latency)
Used by most HPC centers running AI workloads alongside traditional simulation

Kubernetes — Container Orchestration

Kubernetes (K8s) is the dominant platform for containerized workloads and is increasingly used for AI inference and even training.

GPU support in Kubernetes

Without GPU awareness, Kubernetes cannot schedule GPU workloads. Three layers are needed:

1. NVIDIA GPU Operator Automates the deployment of everything required to run GPU workloads on Kubernetes:

NVIDIA drivers (kernel module)
CUDA runtime
DCGM (monitoring)
Node Feature Discovery (labels nodes with GPU capabilities)
Device Plugin (exposes GPUs as schedulable Kubernetes resources)
MIG manager (configures MIG partitions)
Validator (checks the full stack is working)

The GPU Operator is a Kubernetes Operator that manages all of these components as a single unit — one helm install installs and manages the entire GPU stack on every node.

2. NVIDIA Device Plugin Exposes NVIDIA GPUs as a Kubernetes resource:

resources:
  requests:
    nvidia.com/gpu: "8"  # Request 8 GPUs
  limits:
    nvidia.com/gpu: "8"

3. Container runtime NVIDIA Container Toolkit (nvidia-docker2) allows containers to access GPU hardware. Required on every node.

AI workload types on Kubernetes

Workload	Kubernetes mechanism
Batch training	Job or MPIJob (Kubeflow MPI Operator)
Distributed training	PyTorchJob, TFJob (Kubeflow Training Operator)
Inference serving	Deployment + HPA (autoscaling on GPU utilization)
Jupyter notebooks	StatefulSet with GPU request

Slurm vs Kubernetes

Dimension	Slurm	Kubernetes
Primary use	HPC batch jobs, large training	Containerized apps, inference, mixed workloads
Scheduling model	Job queue with priorities	Declarative pod scheduling
GPU support	Native GRES plugin	Via NVIDIA GPU Operator + Device Plugin
Multi-node MPI jobs	Excellent (native)	Good (MPI Operator add-on)
Inference serving	Not designed for it	Excellent
Container support	Yes (with enroot/pyxis)	Native
Real-time scaling	Manual	Automatic (HPA, KEDA)
Typical environment	Research HPC, academic, enterprise AI	Cloud-native enterprise, inference platforms

Most large AI clusters run both: Slurm for large training jobs (researchers familiar with it); Kubernetes for inference services and MLOps pipelines.

Run:ai — Kubernetes-Based GPU Scheduling

Run:ai is a commercial Kubernetes-native GPU scheduling platform that adds AI-specific capabilities on top of standard Kubernetes:

GPU fractions: Share one GPU between multiple jobs (not full MIG — software-level time-slicing)
Dynamic GPU allocation: Allocate GPUs to jobs as needed; return them when idle
Guaranteed quotas + fair-share: Teams get guaranteed GPU quotas; idle quota is shared with others
Preemption with checkpointing: Preempt lower-priority jobs; if the job checkpoints, it resumes from last checkpoint
GPU utilization analytics: Per-project, per-user utilization reports

Run:ai sits on top of Kubernetes — it adds a scheduler that replaces the default Kubernetes scheduler for GPU workloads.

Docker — Containerization

Docker provides:

Image packaging: Bundle the application + libraries + CUDA + framework into a portable image
Runtime isolation: GPU workloads run in containers without interfering with host OS or other containers
NGC integration: NVIDIA NGC images are Docker images; docker pull nvcr.io/nvidia/pytorch:24.01-py3

NVIDIA Container Toolkit (nvidia-ctk) enables Docker containers to access GPU hardware through the container runtime.

Self-check questions

What does the NVIDIA GPU Operator install on a Kubernetes cluster?
How does Slurm request GPUs for a job?
What is gang scheduling and why is it critical for AI training?
What does Run:ai add on top of Kubernetes that standard K8s lacks?
Which scheduler is typically preferred for large multi-node distributed training vs inference serving?

Answers

1. The GPU Operator automatically deploys: NVIDIA drivers, CUDA runtime, DCGM (monitoring), NVIDIA Device Plugin (exposes GPUs to K8s scheduler), Node Feature Discovery, MIG manager, and the NVIDIA Container Toolkit — the entire GPU software stack on every node.
2. Via GRES (Generic Resource Scheduling) flags: `--gres=gpu:8` requests 8 GPUs. SBATCH scripts or srun commands include this flag.
3. Gang scheduling allocates all nodes/GPUs a distributed job needs simultaneously before the job starts. Without it, a job might get 7 of 8 required nodes and hold those GPUs idle while waiting for the 8th — causing GPU waste and potential deadlock.
4. GPU fractions (software time-slicing of a single GPU), dynamic quota allocation, guaranteed-plus-fair-share scheduling between teams, and preemption with checkpoint-aware resumption — capabilities beyond what the default Kubernetes scheduler provides for GPU workloads.
5. Slurm for large multi-node distributed training (HPC origins, excellent MPI support, topology-aware placement); Kubernetes for inference serving (declarative deployment, autoscaling, service mesh integration, rolling updates).