Includes all the essential concepts mentioned … continuous batching, parallelism strategies, caching, offloading/eviction policies—and link to the most relevant open-source frameworks (vLLM, TGI, TensorRT-LLM, etc.), whitepapers, videos, and documentation.
This guide is a comprehensive prep toolkit for a Machine Learning Engineer – Inference role (such as at Together AI). It focuses on the system-level architectural patterns for large-scale LLM inference system design and the supporting LLM training infrastructure.
Summary: Continuous batching (also called dynamic batching or iteration-level scheduling) is a technique to maximize GPU utilization by p (TensorRT: Optimizing Model Inference for Maximum Performance | by Kishore C S | Medium)requests at the token level rather than one-request-at-a-time. In static or request-level batching, all requests in a batch must finish before new ones start, leaving the GPU underutilized when shorter sequences finish early. Continuous batching instead fills those “gaps” immediately with new incoming requests, processing many requests in an interleaved fashion token-by-token. This yields much higher throughput (often an order of magnitude or more) at minimal latency cost, especially under real-world multi-user loads. Key challenges include scheduling policies (when to add new requests vs. wait), handling timeouts and sequence padding efficiently, and ensuring memory for attention key/value caches is managed as sequences of different lengths are mixed. Modern LLM inference servers like Hugging Face’s Text Generation Inference (TGI) and UC Berkeley’s vLLM implement continuous batching to achieve significantly better throughput and cost-efficiency compared to naive batching.
Resources:
Summary: Large LLMs often require splitting work across multiple GPUs or machines. Model parallelism refers to partitioning a single model’s execution across GPUs. This comes in flavors: tensor parallelism (TP) splits the tensors within each layer (e.g. a weight matrix is divided among GPUs), while pipeline parallelism (PP) assigns different layers (or groups of layers) to different GPUs, passing intermediate activations along a pipeline. Many large-model training setups (e.g. NVIDIA Megatron-LM) use a combination of TP + PP to fit and accelerate models across dozens of GPUs. Data parallelism (DP), on the other hand, replicates the full model on each GPU and splits different input data among them – this is more common in training (synchronizing gradients) but offers limited benefit for single-query inference. For inference serving, DP can be used to handle more requests in parallel (throughput scaling) by running multiple model instances. GPU multi-streaming refers to using CUDA streams to execute multiple inference kernels concurrently on one GPU (when one model alone doesn’t fully saturate it). This can increase utilization for smaller models, though for large LLMs single-stream usually keeps the GPU busy (and context-switching overhead or memory bandwidth can bottleneck multi-stream performance). In practice, high-throughput LLM inference might use batching + single-stream for each model, and scale out to multiple GPUs via model parallelism or multiple replicas rather than concurrent streams on one device. The key is understanding the trade-offs: tensor/model parallelism adds communication overhead (especially across nodes), pipeline parallelism adds latency due to fill/drain of the pipeline, and data parallelism is memory-intensive (multiple copies of weights) – so efficient combinations and hardware-aware tuning (NVLink, InfiniBand usage) are required for multi-GPU/multi-node LLM serving.
Resources: