1.1 NVIDIA Software Stack in an AI Environment

What the exam tests

The names, purposes, and relationships of the major layers in the NVIDIA software ecosystem — from bare-metal drivers up to application frameworks.

Stack overview

┌─────────────────────────────────────────────────────────────┐
│              AI Applications & Services                      │
│   (LLM inference, recommendation, computer vision, etc.)    │
├──────────────┬──────────────┬──────────────┬────────────────┤
│   TensorRT   │     NeMo     │    RAPIDS    │   Triton IS    │
│  (inference  │  (LLM/ASR/  │  (data sci)  │  (inference    │
│   optim.)    │   TTS fw.)   │              │   serving)     │
├──────────────┴──────────────┴──────────────┴────────────────┤
│              NVIDIA AI Enterprise (software suite)           │
├──────────────────────────────────────────────────────────────┤
│                    NGC (Container Registry)                   │
├──────────────────────────────────────────────────────────────┤
│           CUDA / cuDNN / cuBLAS / NCCL / cuSPARSE           │
│                   (compute primitives)                        │
├──────────────────────────────────────────────────────────────┤
│              GPU Driver + CUDA Runtime                        │
├──────────────────────────────────────────────────────────────┤
│                   GPU Hardware                                │
└──────────────────────────────────────────────────────────────┘

Layer by layer

CUDA — the foundation

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model. It exposes the GPU’s thousands of cores to general-purpose programs via an extension to C/C++/Python.

Introduced 2007; now the de-facto standard for GPU compute
Every NVIDIA AI framework (PyTorch, TensorFlow, JAX) executes on CUDA underneath
CUDA toolkit includes: compiler (nvcc), runtime libraries, debugger, profiler (Nsight)

Key libraries built on CUDA:

Library	Purpose
cuDNN	Optimized primitives for deep neural networks (convolutions, pooling, normalization)
cuBLAS	GPU-accelerated BLAS (matrix/vector operations)
NCCL	Collective communications for multi-GPU/multi-node (all-reduce, broadcast) — critical for distributed training
cuSPARSE	Sparse matrix operations
RAPIDS	GPU-accelerated data science (see below)

NGC — NVIDIA GPU Cloud

NGC is NVIDIA’s catalog of GPU-optimized containers, pre-trained models, and SDKs.

Contains PyTorch, TensorFlow, JAX, TensorRT, NeMo, and others — all pre-configured and performance-tuned
Containers are versioned and tested against specific GPU driver + CUDA combinations
Available on-prem (pulling to your server) and directly on major cloud GPU instances
Free to access — NGC is a catalog, not a cloud service

Why it matters for enterprise AI: Instead of manually installing and version-managing 20 interdependent libraries, you pull a single NGC container with everything pre-configured. This is the standard starting point for NVIDIA-certified AI deployments.

NVIDIA AI Enterprise

NVIDIA AI Enterprise is a full-stack, enterprise-grade software suite that runs on NVIDIA-certified hardware (bare metal, VMware vSphere, Red Hat OpenShift, etc.).

Includes: NVIDIA Triton Inference Server, TensorRT, NeMo, RAPIDS, CUDA-X libraries
Comes with enterprise support SLA from NVIDIA (critical for production deployments)
Licenses per GPU — annual subscription
Enables running AI on VMs (vGPU) with the same performance guarantees as bare metal

TensorRT — inference optimization

TensorRT is NVIDIA’s SDK for high-performance deep learning inference.

Takes a trained model (ONNX, TensorFlow, PyTorch) and optimizes it for a target GPU
Optimizations: layer fusion, precision calibration (FP32 → FP16 → INT8 → FP4), kernel auto-tuning, memory layout optimization
Results in 2–10× lower latency and higher throughput vs running the raw framework
TensorRT-LLM: extension specifically for large language model inference (paged KV cache, in-flight batching, continuous batching)

Triton Inference Server

NVIDIA Triton Inference Server (open source) provides a standardized HTTP/gRPC inference serving framework.

Serves any model format: TensorRT, ONNX, PyTorch TorchScript, TensorFlow SavedModel, Python custom backends
Dynamic batching: automatically batches concurrent requests to maximize GPU utilization
Concurrent model execution: run multiple models simultaneously on one GPU
Model management: load/unload models at runtime without server restart
Integrates with Kubernetes for horizontal scaling

NeMo — conversational AI framework

NVIDIA NeMo is an open-source framework for building, training, and fine-tuning large language models, automatic speech recognition (ASR), and text-to-speech (TTS) models.

Built on PyTorch Lightning; distributed training via Megatron-LM
Supports LLM fine-tuning (SFT, RLHF, PEFT/LoRA)
Pre-trained model collections available on NGC
NeMo Guardrails: adds safety/topic control for LLM applications in production

RAPIDS — GPU-accelerated data science

RAPIDS is a suite of open-source libraries that accelerates data science pipelines entirely on GPU:

Library	Equivalent	Purpose
cuDF	pandas	DataFrame operations on GPU
cuML	scikit-learn	ML algorithms on GPU
cuGraph	NetworkX	Graph analytics on GPU
cuSpatial	GeoPandas	Geospatial analytics

RAPIDS integrates with PyTorch and TensorFlow — GPU-accelerated preprocessing feeds directly into GPU training without CPU↔GPU round-trips.

Software stack relationships (exam summary)

Need to run inference fast?         → TensorRT
Need to serve inference at scale?   → Triton Inference Server
Need to train/fine-tune an LLM?     → NeMo (+ NCCL for multi-GPU)
Need GPU-accelerated data prep?     → RAPIDS
Need enterprise support + vGPU?     → NVIDIA AI Enterprise
Need optimized containers/models?   → NGC
All of the above built on:          → CUDA

Self-check questions

What does NCCL stand for and why is it essential for distributed training?
What does TensorRT do to a trained model before deployment?
What is the difference between NGC and NVIDIA AI Enterprise?
Which NVIDIA framework is used for training large language models and conversational AI?
An organization wants to run AI models on VMware VMs with enterprise support. Which NVIDIA product enables this?

Answers

1. NVIDIA Collective Communications Library. It provides all-reduce, broadcast, and other collective operations across GPUs on multiple nodes — essential for synchronizing gradients during distributed training.
2. TensorRT optimizes the model for a specific GPU: fuses layers, calibrates precision (FP32→INT8), and auto-tunes kernels. The result is a TensorRT Engine with significantly lower latency.
3. NGC is a free catalog of GPU-optimized containers and models. NVIDIA AI Enterprise is a paid software suite with enterprise SLA support that includes NGC content plus vGPU support and production-grade tools.
4. NeMo (NVIDIA NeMo)
5. NVIDIA AI Enterprise — it's certified on VMware vSphere (with vGPU) and provides enterprise support.