50 Essential PyTorch Coding Interview Questions (LLM Inference & Optimization)

Easy Questions (Fundamentals & Basics)

(Easy): Use PyTorch's inference mode properly: put a model in evaluation mode and disable gradient calculations. Write a code snippet that wraps a forward pass in model.eval() and torch.no_grad() to ensure no gradients are tracked.
(Easy): Perform device management for inference: given a PyTorch model and input tensor, write code to move them to CUDA GPU for faster inference, then transfer the result back to CPU (e.g., using model.to('cuda') and tensor.to('cuda'), then .cpu() on the output).
(Easy): Implement the softmax function from scratch for a given logits tensor and use it to get a prediction. For example, compute probabilities with exponentiation and normalization (without using torch.softmax), then use torch.argmax to find the index of the highest probability.
(Easy): Define a simple neural network module in PyTorch and run a forward pass. For instance, implement an nn.Module with one nn.Linear layer followed by a ReLU activation. Show how to instantiate this model and feed a sample input through it.
(Easy): Use an embedding layer to map token IDs to vectors. For example, given a batch of token indices, create an nn.Embedding of appropriate size and show how to retrieve the embedding tensor for the batch (by calling the embedding layer on the input indices).
(Easy): Combine embedding vectors with positional encodings. Suppose you have a tensor of word embeddings and a tensor of positional encodings of the same shape; write code to add them together elementwise to form the final input for a transformer model.
(Easy): Pad sequences for batching: write a function that takes a list of sequences (lists of token IDs of varying lengths) and pads them with a PAD token (e.g., 0) to the same length. Also return an attention mask indicating which positions are real tokens (1) and which are padding (0).
(Easy): Implement a basic greedy decoding loop for text generation. Starting from an initial prompt (sequence of input IDs), iteratively feed it into the model to get next-token logits, pick the token with the highest probability (argmax), append it to the sequence, and repeat until an end-of-sequence token is produced.
(Easy): Calculate model size: write code to compute the total number of parameters in a given PyTorch model and estimate its memory footprint. (Hint: sum up param.numel() * param.element_size() for each parameter to get total bytes, and convert to MB or GB.)

Intermediate Questions (Moderate Difficulty)

(Medium): Implement scaled dot-product attention. Given query, key, and value tensors (Q, K, V) of shape (batch, seq_len, dim), compute the attention output = softmax$(QK^T / \sqrt{d}})$ · V. Include support for an attention mask (e.g., ignore certain positions by adding inf to logits before softmax for masked positions).
(Medium): Implement the Transformer's feed-forward network block. Given an input tensor of shape (batch, seq_len, dim), pass it through a two-layer MLP: first Linear(dim → hidden_dim), apply an activation (e.g., GELU), then Linear(hidden_dim → dim). Show this in PyTorch code (you can assume some hidden_dim value).
(Medium): Implement top-k sampling for one step of language model decoding. Given a tensor of logits for the next token, filter it to the top k highest values (use torch.topk), then sample a token from those top-k probabilities (e.g., with torch.multinomial). The code should output an index for the sampled token.
(Medium): Implement nucleus (top-p) sampling for one decoding step. Given logits and a probability threshold p, sort the token probabilities, compute their cumulative sum, and select the smallest set of tokens whose cumulative probability ≥ p. Then sample the next token from that set. Provide code to perform this selection and sampling.
(Medium): Add caching to an autoregressive transformer decoding loop. Modify a naive generation function so that it passes a “past key-values” cache to the model. Show how you would store the K and V from each timestep (e.g., in lists or a preallocated tensor) and reuse them in subsequent model calls to avoid recomputing attention on previous tokens.
(Medium): Batch by sequence length for efficiency. Given a list of input sequences of different lengths, write code to sort them by length, batch those of similar lengths together, pad within each batch, and then run the model on each batch. (This minimizes padding and idle compute, improving throughput on variable-length inputs.)
(Medium): Implement micro-batching for inference. If a batch of N inputs is too large to process at once on the GPU, show how to split it into smaller sub-batches, run the model on each sub-batch sequentially (accumulating outputs), and then concatenate the results. Ensure the final outputs preserve the original input order.