Includes all the essential concepts mentioned … continuous batching, parallelism strategies, caching, offloading/eviction policies—and link to the most relevant open-source frameworks (vLLM, TGI, TensorRT-LLM, etc.), whitepapers, videos, and documentation.

System Design Study Guide: LLM Inference & Training Infrastructure

This guide is a comprehensive prep toolkit for a Machine Learning Engineer – Inference role (such as at Together AI). It focuses on the system-level architectural patterns for large-scale LLM inference system design and the supporting LLM training infrastructure.

1. Continuous Batching for LLM Inference

Summary: Continuous batching (also called dynamic batching or iteration-level scheduling) is a technique to maximize GPU utilization by p (TensorRT: Optimizing Model Inference for Maximum Performance | by Kishore C S | Medium)requests at the token level rather than one-request-at-a-time. In static or request-level batching, all requests in a batch must finish before new ones start, leaving the GPU underutilized when shorter sequences finish early. Continuous batching instead fills those “gaps” immediately with new incoming requests, processing many requests in an interleaved fashion token-by-token. This yields much higher throughput (often an order of magnitude or more) at minimal latency cost, especially under real-world multi-user loads. Key challenges include scheduling policies (when to add new requests vs. wait), handling timeouts and sequence padding efficiently, and ensuring memory for attention key/value caches is managed as sequences of different lengths are mixed. Modern LLM inference servers like Hugging Face’s Text Generation Inference (TGI) and UC Berkeley’s vLLM implement continuous batching to achieve significantly better throughput and cost-efficiency compared to naive batching.

Resources:

ORCA Paper (OSDI 2022) – Introduced iteration-level scheduling for transformer inference (the idea behind continuous batching). ORCA shows that once a sequence finishes, a new one can be inserted in its place, increasing utilization. Achieved 36× throughput improvement on GPT-3 175B vs. static batching.
Anyscale Blog on Continuous Batching (2023) – Great overview with benchmarks. Shows up to 23× higher throughput and lower latency using continuous batching (via vLLM) vs. conventional methods of static and request-based batching, then how continuous batching works in systems like Ray Serve and HF TGI.
Baseten Blog – Continuous vs Dynamic Batching – Explains the differences between static, dynamic, and continuous batching in simple terms. Recommends continuous batching for LLMs (token-level scheduling) to eliminate idle time waiting on the longest sequence. Provides analogies (like a bus filling freed seats mid-route) and mentions frameworks: TGI and vLLM offer continuous batching, while NVIDIA TensorRT-LLM uses a similar “in-flight batching” approach.
LLM Inference Optimization (Medium) – Discusses continuous batching and selective batching in ORCA and vLLM. Also touches on mem (TensorRT: Optimizing Model Inference for Maximum Performance | by Kishore C S | Medium) (TensorRT: Optimizing Model Inference for Maximum Performance | by Kishore C S | Medium)y iteration-level processing (like vLLM’s PagedAttention for efficient KV cache management).

2. Parallelism Strategies for LLMs (Model & Data Parallelism)

Summary: Large LLMs often require splitting work across multiple GPUs or machines. Model parallelism refers to partitioning a single model’s execution across GPUs. This comes in flavors: tensor parallelism (TP) splits the tensors within each layer (e.g. a weight matrix is divided among GPUs), while pipeline parallelism (PP) assigns different layers (or groups of layers) to different GPUs, passing intermediate activations along a pipeline. Many large-model training setups (e.g. NVIDIA Megatron-LM) use a combination of TP + PP to fit and accelerate models across dozens of GPUs. Data parallelism (DP), on the other hand, replicates the full model on each GPU and splits different input data among them – this is more common in training (synchronizing gradients) but offers limited benefit for single-query inference. For inference serving, DP can be used to handle more requests in parallel (throughput scaling) by running multiple model instances. GPU multi-streaming refers to using CUDA streams to execute multiple inference kernels concurrently on one GPU (when one model alone doesn’t fully saturate it). This can increase utilization for smaller models, though for large LLMs single-stream usually keeps the GPU busy (and context-switching overhead or memory bandwidth can bottleneck multi-stream performance). In practice, high-throughput LLM inference might use batching + single-stream for each model, and scale out to multiple GPUs via model parallelism or multiple replicas rather than concurrent streams on one device. The key is understanding the trade-offs: tensor/model parallelism adds communication overhead (especially across nodes), pipeline parallelism adds latency due to fill/drain of the pipeline, and data parallelism is memory-intensive (multiple copies of weights) – so efficient combinations and hardware-aware tuning (NVLink, InfiniBand usage) are required for multi-GPU/multi-node LLM serving.

Resources:

NVIDIA NeMo Guide – Parallelism – Official definitions of parallel strategies. TP “distributes the parameter tensor of an individual layer across GPUs” (e.g. splitting a large fully-connected layer’s weights), while PP “assigns consecutive layers to different GPUs”. Also covers sequence parallelism and optimizer/gradient sharding in training.
Hugging Face Transformers – Model Parallelism Docs – Describes how pipeline parellism are used to spread a model like GPT-2 or GPT-3 across GPUs. Noted that pipeline parallelism addresses GPU idle time by using micro-batches to overlap computation.
DeterminedAI Blog – Tensor Parallelism – Tutorial-style explanation of TP, including how intermediate results are combined from multiple devices. Useful for intuition on how splitting a matrix multiplication across GPUs works in practice.
Victor Leung (2025) – NVIDIA Inference Optimizations – Discusses that as LLMs grow, tensor parallelism be (How Pytorch 2.0 Accelerates Deep Learning with Operator Fusion ...)ial even if a model fits on one GPU, because TP can double memory bandwidth and compute by using 2 GPUs, thus improving latency (with some communication cost). Recommends NVLink/HGX systems for ef (TensorRT: Optimizing Model Inference for Maximum Performance | by Kishore C S | Medium) (TensorRT: Optimizing Model Inference for Maximum Performance | by Kishore C S | Medium)
Stack Overflow: CUDA Streams & Concurrency – Notes that by default, frameworks execute one stream per GPU, so multiple requests get serialized. Using custom CUDA streams can allow overlapping operations, but for large LLM kernels the benefit might be limited (as one inference already heavily uses the GPU). If serving many small queries, frameworks like Triton can spawn multiple model execution instances on one GPU (using streams) to increase throughput.