Parallelism Techniques for Large-Scale LLM Inference

Data Parallelism

Data parallelism replicates the entire model on each GPU (each colored box) to process different input data in parallel.

How it works: Data parallelism involves deploying multiple copies of the model on different GPUs or nodes. Each model replica handles a different portion of the incoming requests or batch independently, much like having multiple instances of the same microservice handling separate users . All GPUs have the full model loaded, so this technique does not split the model itself; instead it splits the data. For example, if you have 4 identical GPUs each with a copy of a 7B parameter model, you can send different input sequences to each GPU concurrently and get 4 times the throughput (in ideal conditions). Importantly, this does not help if a single inference is too large to fit on one GPU’s memory – data parallelism assumes the model does fit in one device’s memory .

Benefits:
- Linear throughput scaling: Increases overall serving throughput almost linearly with the number of replicas, since each GPU can handle a separate request in parallel . This is great for serving many independent queries simultaneously (e.g. many users at once).
- No model coordination latency: Each inference runs in isolation on one GPU, so there’s no cross-GPU communication needed during forward passes. This keeps per-request latency low (each request is just like running on a single GPU) and avoids networking overheads during inference.
- Simplicity: It’s straightforward to implement – essentially running multiple model instances – and is compatible with existing inference servers. Scaling to more GPUs or nodes is as simple as adding more replica instances.
Trade-offs:
- Memory duplication: Every GPU must hold a complete copy of the model weights. This is wasteful for very large models and can become impossible if the model is larger than a single GPU’s memory (e.g. a 175B model cannot be served on one 24GB GPU by data parallelism alone) . It also means using N GPUs requires N times the memory for weights.
- No single-query speedup: Data parallelism does not reduce the latency for a single input. One query can only run on one replica, so if you need to accelerate one long or complex inference (rather than many in parallel), data parallelism doesn’t help. It’s geared toward throughput over latency.
- Diminishing returns: If the workload (number of concurrent requests) is low, many model replicas will sit idle – scaling out beyond the demand just wastes resources. And in multi-node scenarios, distributing requests adds slight overhead (e.g. load balancing, network transfer of input/output), though typically minor compared to the inference compute.
Scalability: Data parallelism scales horizontally across GPUs and even across multiple machines. In theory, if you have 100 identical GPUs, you can serve ~100x more requests per second. In practice, near-linear scaling is achievable as long as the external factors (like request dispatch and network I/O) are not bottlenecks. Unlike model-sharding techniques, data parallel replicas don’t need high-speed interconnects between GPUs because they run independently – this makes it suitable for scaling across nodes with just standard networking. However, each replica still needs to fit the model in memory, so extremely large models might require model parallelism instead of or in addition to data parallelism.
Use cases: Data parallelism shines for multi-user inference serving and high-throughput APIs. For example, a deployment of a 13B model that easily fits on a GPU can be replicated 8 times to handle 8 concurrent requests with minimal latency impact, which is common in production systems . It’s also used in batch inference: a large batch can be split so each GPU processes a chunk of the batch in parallel, then results are combined. However, data parallelism fails when the model is too big for one GPU or when one needs to speed up a single inference beyond the capability of one device – in those cases, model or pipeline parallelism is needed.

Interview Questions – Data Parallelism

Throughput vs. latency: If you needed to increase the throughput of an LLM service (serving many queries per second), why is data parallelism a good choice? And conversely, why does it fail to improve the latency of a single query?
Memory overhead: In an inference deployment, what are the memory implications of using data parallelism with a 30B parameter model across 4 GPUs? How might this influence your decision to use data parallelism or not?
Scaling across nodes: When scaling data parallel inference to multiple machines, what network considerations arise (for input/output handling or model updates) even though the model replicas don’t directly communicate during inference?
Combining parallelism: If a model barely fits on one GPU, can you simply use data parallelism to leverage two GPUs for one request? Why not, and which parallelism technique would you consider in that scenario?
Use-case judgment: Imagine a scenario with sporadic, heavy single-user queries (long, expensive prompts) rather than many concurrent users. Would you invest in data parallel replicas or another approach to handle this load? Explain your reasoning.

Tensor (Model) Parallelism

Tensor parallelism (a type of model parallelism) splits each weight matrix or tensor across multiple devices. In this illustration, matrix B is split into two parts (green and blue) and multiplied with A in parallel; partial results are then combined (all-gather) to form the final output C .

How it works: Tensor parallelism (also called intra-layer model parallelism) partitions the computations within each model layer across multiple GPUs . Instead of replicating the whole model, different GPUs hold different slices of the model’s weight tensors. During inference, a single forward pass is distributed: each GPU computes its fragment of the layer and the partial results are then aggregated to produce the same output as the full layer. For example, if a transformer’s feed-forward layer has a large weight matrix, one can split that matrix into 2 or 4 chunks and place each chunk on a different GPU. Each GPU multiplies its chunk by the input simultaneously, and then the results are summed or concatenated to get the final output . Similarly, for multi-head attention, different heads (or groups of heads) can be assigned to different GPUs to be computed in parallel . In essence, tensor parallelism “slices” the tensor operations along a dimension and uses an all-reduce or gather operation to combine outputs at the end of the layer. This allows a single huge model to be spread across multiple devices within each layer.

Benefits:
- Enables larger models: Tensor parallelism makes it possible to serve models that are too large for one GPU’s memory by dividing the model’s parameters. Each GPU only stores a fraction of the weights (e.g. 1/4 of each large matrix if using 4-way parallelism), reducing per-GPU memory usage proportionally . This was key to deploying models like GPT-3 on clusters of GPUs.
- Potential speedups: By sharing the compute of a layer across multiple GPUs, you can shorten the wall-clock time for that layer’s forward pass. Each GPU handles a smaller matrix multiplication, which might complete faster, and the results are merged. In an ideal scenario with fast interconnect, splitting a compute-heavy layer across N GPUs could approach an N× speedup for that layer. This can reduce latency for a single large inference if the GPUs are efficiently utilized in parallel.
- Transparent model output: The outputs are the same as if computed on a single device (just assembled from parts), so accuracy isn’t affected. It parallelizes the math without approximating it. From the outside, it’s still one model – just partitioned under the hood.
- Mix-and-match with other parallelism: Tensor parallelism can be combined with data parallelism or pipeline parallelism. For example, you might shard the model across 2 GPUs (tensor parallel) and also have 2 replicas of that setup (data parallel) to serve more queries. It’s a flexible building block for scaling.
Trade-offs:
- Communication overhead: Splitting a layer means that at some point GPUs must exchange data (e.g. send their partial outputs to each other) to produce the final result . Typically this involves high-speed GPU-to-GPU communication (like an all-reduce or all-gather after every layer). This overhead can significantly hurt latency if the interconnect is slow. In fact, the network bandwidth between GPUs often becomes the bottleneck for tensor parallel scaling . Using NVLink or NVSwitch (on-node) or Infiniband (across nodes) is usually required to make it performant.
- Synchronization: All parallel GPUs must wait for each other at the synchronization points each layer. The inference can only move at the speed of the slowest participant (e.g., if one GPU is momentarily slower or has more load, it delays the others). This means latency doesn’t always scale linearly – adding more GPUs might give diminishing returns if communication and synchronization costs grow.
- Memory overheads: While each GPU stores only a slice of the big weight matrices, some parts of the model might still be replicated on all GPUs (for example, small layers like layernorm or embeddings might be kept full on each for simplicity, or you need memory for communication buffers). There’s also overhead in storing partial activations that must be exchanged. So memory per GPU is reduced, but not by exactly the factor of parallelism in all cases.
- Complex implementation: Using tensor parallel inference often requires a custom runtime or libraries (Megatron-LM, FasterTransformer, etc.) that know how to split the computations and do all-reduce. It adds engineering complexity compared to running a model on one device. Load balancing the split (deciding how to partition tensors) can also be complex for different layer types.
Scalability: Tensor parallelism is effective up to a point. Typically, we see good scaling for moderate numbers of GPUs (e.g. 2, 4, 8) especially with excellent interconnects. For instance, splitting an attention or MLP across 8 GPUs can work if they share NVSwitch, but beyond that, the communication overhead can dominate. Each additional GPU contributes less speedup if the network traffic (which grows with more shards) becomes the limiting factor . Also, some operations can’t be split arbitrarily – if a layer is very small, splitting it might not be worthwhile. In practice, large LLMs often use a combination of tensor parallel groups (e.g. groups of 8 GPUs) to shard the model, possibly combined with pipeline parallel groups for further scaling. The key scalability consideration is network bandwidth: a high-bandwidth, low-latency interconnect (such as NVLink or PCIe5 or better) is needed to make tensor parallel inference across GPUs efficient. Multi-node tensor parallelism is possible but requires extremely fast network (e.g. InfiniBand) and even then may incur significant latency per token.