High level conceptual notes

1. Model Serving (Local & Global)

Local Serving
- Single data center or a single node handling inferences.
- Lower latency (no cross-region hops), simpler infrastructure.
- Works best if all your users/traffic are geographically close.
Global Serving
- Multiple data centers or cloud regions, each with replicas of the model.
- Load balancer directs requests based on proximity or resource availability.
- Minimizes latency for globally distributed users, but cost and complexity go up.
- Must consider cross-region load balancing, network overhead, and possible replication of large model artifacts.
Key Considerations
- Latency: Real-time user experience requires local-ish compute.
- Concurrency: More replicas → handle more requests but higher overall cost.
- Resilience: Multi-region can handle outages, but orchestration gets complex.
- Cost: GPU hours in multiple regions can be pricey; you might scale down in regions with low usage.

Core Idea
- Autoregressive decoding merges multiple requests at the same token step into one forward pass.
- Greatly boosts GPU utilization by combining small workloads into a single big kernel launch.
Benefits
- Throughput: Fewer, larger GPU kernels instead of many small ones.
- Latency: Small added “batch wait” (a few ms) can be worth the big throughput gain.
- Easy Integration: Tools like vLLM, Hugging Face TGI do this automatically.
Mechanics
- Orchestrator checks which sessions are at decode step n, merges them, runs one forward pass for that token → merges again for n+1, etc.
- Works even if requests arrive asynchronously or have different prompt lengths.
Trade-Off
- If concurrency is very low (fewer requests), your batch size might be 1. Gains may be smaller.
- Must carefully handle user experience so we don’t stall them too long while batching.

Concept: Replicate the entire model on each GPU (or node).
Pros
- Simple: Each GPU runs a full copy of the model, no cross-GPU sync for forward passes.
- Scales throughput linearly for many concurrent requests.
Cons
- Doesn’t reduce per-request latency. One large inference still runs on one GPU.
- Memory duplication for all replicas → might be expensive if model is huge.
Use Case
- High concurrency (many user requests), and the model fits on one GPU.

Concept: Split each weight matrix (or attention heads) across multiple GPUs.
Pros
- Enables serving models larger than a single GPU’s memory.
- Can reduce latency if compute is heavy and interconnect is fast.