Order: Fifth. After graph, kernels, compiler, and quantization (01–04), you deploy and measure in production-like settings.
Role target: DL Inference Optimization Engineer · MTS Kernels (deployment, production reliability, measurable outcomes).
Your kernels and optimizations only matter if they run correctly in a runtime and meet latency/throughput goals. This unit covers the main inference runtimes and how to measure and compare them.
- TensorRT — Engine build, plugins, dynamic shapes, DLA. How your kernels and graph optimizations show up in the engine.
- ONNX Runtime — Execution providers (CUDA, TensorRT, OpenVINO). Graph optimizations and provider selection.
- Triton Inference Server — Batching, model concurrency, metrics. Serving multiple models and dynamic batching.
TensorRT-LLM is NVIDIA's open-source library purpose-built for LLM inference on NVIDIA GPUs. It extends TensorRT with LLM-specific optimizations that general-purpose runtimes cannot match. If you deploy large language models on NVIDIA hardware, this is the production path.
Standard TensorRT handles vision and small models well, but LLMs have unique challenges:
- KV-cache management — grows linearly with context length, must be paged and reused across requests.
- Autoregressive decoding — each token depends on all previous tokens; batching is complex.
- Multi-GPU serving — models that don't fit on one GPU need tensor/pipeline parallelism.
- Mixed workloads — prefill (compute-bound) and decode (memory-bound) phases have opposite bottlenecks.
TensorRT-LLM solves all of these with a Python API that compiles LLMs into optimized TensorRT engines with LLM-specific runtime features.
-
In-flight batching (continuous batching):
- Batch requests as they arrive — don't wait for a full batch. New requests join while others are mid-generation.
- Maximizes GPU utilization by mixing prefill and decode phases across requests.
-
Paged KV-cache:
- Inspired by vLLM's PagedAttention — allocates KV-cache in fixed-size blocks, not contiguous per-sequence.
- Eliminates memory fragmentation; enables serving more concurrent sequences.
-
Quantization:
- FP8 (Hopper+), INT8 (SmoothQuant), INT4 (AWQ, GPTQ) — all with fused dequantize in GEMM kernels.
- FP4 on Blackwell for maximum throughput.
-
Tensor parallelism and pipeline parallelism:
- Split model across GPUs: tensor parallel (split within layers) or pipeline parallel (split across layers).
- NCCL-based communication, overlapped with compute.
-
Speculative decoding:
- Draft model generates candidate tokens; main model verifies in one forward pass.
- Reduces time-to-first-token and overall latency.
-
Custom Hopper/Blackwell kernels:
- CUTLASS-based GEMM kernels optimized for each GPU generation.
- Warp specialization, persistent kernels, Transformer Engine integration.
-
CUDA Graphs:
- Captures the decode loop as a CUDA graph — eliminates per-token kernel launch overhead.
# Install
pip install tensorrt-llm
# Step 1: Convert model checkpoint to TRT-LLM format
python convert_checkpoint.py \
--model_dir ./llama-3-8b \
--output_dir ./trt_ckpt \
--dtype float16 \
--tp_size 2 # tensor parallel across 2 GPUs
# Step 2: Build TRT-LLM engine
trtllm-build \
--checkpoint_dir ./trt_ckpt \
--output_dir ./trt_engine \
--gemm_plugin float16 \
--max_batch_size 64 \
--max_input_len 2048 \
--max_seq_len 4096 \
--paged_kv_cache enable \
--use_fused_mlp enable
# Step 3: Run inference
python run.py \
--engine_dir ./trt_engine \
--tokenizer_dir ./llama-3-8b \
--input_text "Explain how a systolic array works"
# Step 4: Serve with Triton Inference Server
# TRT-LLM integrates with Triton via the TRT-LLM backend
# → in-flight batching, streaming, multi-model serving| TensorRT-LLM | vLLM | |
|---|---|---|
| Approach | Compile model to optimized engine (ahead-of-time) | JIT with PyTorch + custom CUDA kernels |
| Performance | Highest throughput on NVIDIA GPUs (custom Hopper/Blackwell kernels) | Very good; slightly lower peak but faster iteration |
| Quantization | FP8, INT8, INT4, FP4 with fused kernels | GPTQ, AWQ, FP8 via external libraries |
| Multi-GPU | TP + PP via NCCL | TP via NCCL |
| Setup complexity | Higher — build step required | Lower — load and serve |
| Model support | Major LLMs (Llama, Mistral, GPT, Falcon, etc.) | Broader model support via HuggingFace |
| Hardware | NVIDIA only | NVIDIA + AMD (ROCm) |
| Best for | Maximum throughput in production on NVIDIA | Rapid prototyping, AMD support, flexibility |
- Prefill vs decode phases — Prefill processes the entire prompt in one pass (compute-bound, high arithmetic intensity). Decode generates one token at a time (memory-bound, low arithmetic intensity). TRT-LLM optimizes both with different kernel strategies.
- KV-cache sizing — For a 7B model at FP16 with 4096 context: ~2 GB KV-cache per sequence. With paged KV-cache, 80 GB H100 can serve ~30 concurrent sequences. Understanding this math is essential.
- Engine build trade-offs —
max_batch_size,max_input_len,max_seq_lenare baked into the engine. Larger = more memory reserved, fewer concurrent engines. Size for your actual workload.
- Latency — p50, p99; what to measure (single request, batch). For LLMs: time-to-first-token (TTFT) and inter-token latency (ITL).
- Throughput — QPS, tokens/s; how batch size and concurrency affect it. For LLMs: output tokens/s across all concurrent requests.
- Memory footprint — Peak GPU/system memory; impact of batching and precision. For LLMs: model weights + KV-cache + activation memory.
- Methodology — Reproducible benchmarks; A/B comparison (e.g. before/after kernel change, or TensorRT-LLM vs vLLM).
- TensorRT Best Practices
- TensorRT-LLM GitHub — Source, examples, model support matrix.
- TensorRT-LLM Documentation — Build, deploy, and optimize guides.
- vLLM — Alternative LLM serving engine for comparison.
- Triton Inference Server — Production serving with TRT-LLM backend.
- MLPerf Inference — Reference benchmarks and methodology.
- Runtime comparison — Deploy the same model with ONNX Runtime and TensorRT (same hardware). Compare latency and throughput; document configuration and measurement method.
- Triton server — Set up a minimal Triton server with dynamic batching. Measure QPS vs batch size and document how batching affects latency and throughput.
- Benchmark report — For one model and one runtime, produce a one-page benchmark report: latency (p50/p99), throughput, memory, and exact environment (GPU, driver, runtime version).
- TensorRT-LLM engine build — Build a TensorRT-LLM engine for Llama-3-8B with FP16 and INT8 quantization. Measure tokens/s, TTFT, and memory usage. Compare with vLLM on the same hardware.
- Multi-GPU LLM serving — Deploy a 70B model across 2+ GPUs with tensor parallelism using TensorRT-LLM. Measure scaling efficiency (tokens/s per GPU) vs single-GPU with a smaller model.
- TRT-LLM + Triton — Deploy TensorRT-LLM engine behind Triton Inference Server with in-flight batching enabled. Load test with concurrent clients and measure p50/p99 latency under load.
→ 06 — tinygrad Deep Dive (optional) — Hands-on compiler/kernel interface: IR, scheduler, backends, and adding a simple optimization.