Data Parallelism

Data parallelism replicates the entire model on each GPU (each colored box) to process different input data in parallel.

How it works: Data parallelism involves deploying multiple copies of the model on different GPUs or nodes. Each model replica handles a different portion of the incoming requests or batch independently, much like having multiple instances of the same microservice handling separate users . All GPUs have the full model loaded, so this technique does not split the model itself; instead it splits the data. For example, if you have 4 identical GPUs each with a copy of a 7B parameter model, you can send different input sequences to each GPU concurrently and get 4 times the throughput (in ideal conditions). Importantly, this does not help if a single inference is too large to fit on one GPU’s memory – data parallelism assumes the model does fit in one device’s memory .

Interview Questions – Data Parallelism

  1. Throughput vs. latency: If you needed to increase the throughput of an LLM service (serving many queries per second), why is data parallelism a good choice? And conversely, why does it fail to improve the latency of a single query?
  2. Memory overhead: In an inference deployment, what are the memory implications of using data parallelism with a 30B parameter model across 4 GPUs? How might this influence your decision to use data parallelism or not?
  3. Scaling across nodes: When scaling data parallel inference to multiple machines, what network considerations arise (for input/output handling or model updates) even though the model replicas don’t directly communicate during inference?
  4. Combining parallelism: If a model barely fits on one GPU, can you simply use data parallelism to leverage two GPUs for one request? Why not, and which parallelism technique would you consider in that scenario?
  5. Use-case judgment: Imagine a scenario with sporadic, heavy single-user queries (long, expensive prompts) rather than many concurrent users. Would you invest in data parallel replicas or another approach to handle this load? Explain your reasoning.

Tensor (Model) Parallelism

Tensor parallelism (a type of model parallelism) splits each weight matrix or tensor across multiple devices. In this illustration, matrix B is split into two parts (green and blue) and multiplied with A in parallel; partial results are then combined (all-gather) to form the final output C .

How it works: Tensor parallelism (also called intra-layer model parallelism) partitions the computations within each model layer across multiple GPUs . Instead of replicating the whole model, different GPUs hold different slices of the model’s weight tensors. During inference, a single forward pass is distributed: each GPU computes its fragment of the layer and the partial results are then aggregated to produce the same output as the full layer. For example, if a transformer’s feed-forward layer has a large weight matrix, one can split that matrix into 2 or 4 chunks and place each chunk on a different GPU. Each GPU multiplies its chunk by the input simultaneously, and then the results are summed or concatenated to get the final output . Similarly, for multi-head attention, different heads (or groups of heads) can be assigned to different GPUs to be computed in parallel . In essence, tensor parallelism “slices” the tensor operations along a dimension and uses an all-reduce or gather operation to combine outputs at the end of the layer. This allows a single huge model to be spread across multiple devices within each layer.