LLM specific questions

Below are 25 basic (yet thorough) coding-focused questions that test fundamental PyTorch skills relevant to building and running LLMs. They range from creating and manipulating tensors, to implementing small transformer components, to applying sampling methods. Each question should prompt you to write working code (in a live environment or whiteboard style), ensuring you can demonstrate good coding practices in PyTorch for LLM use cases.

Tensor Creation & Manipulation
- Question: Write a small function to:
  1. Create a 2D PyTorch tensor (e.g. shape [3,4]) of random floats.
  2. Print its shape and data type.
  3. Move it to a GPU if available.
  4. Reshape it to [2,6].
- Tests ability to create, reshape, and manage devices for tensors.
Embedding Lookup
- Question: Suppose you have a vocabulary size of 10,000 and an embedding dimension of 768. Create an nn.Embedding for this vocabulary. Then:
  1. Generate a batch of token indices (e.g., shape [batch_size=4, seq_len=5]).
  2. Pass these indices through the embedding to get the corresponding embeddings.
- Tests creation and usage of embedding layers, along with batch dimension handling.
Forward Pass Through a Simple Network
- Question: Define a small nn.Module that includes:
  1. An nn.Embedding layer.
  2. A single nn.Linear layer mapping from embedding dimension to a “hidden” dimension of your choice.
  3. A forward method that takes token indices, embeds them, and produces a final output tensor.
- Tests understanding of custom modules, forward methods, and dimension handling.
Positional Encoding
- Question: Write a function that takes input embeddings (batch, seq_len, embedding_dim) and adds sinusoidal positional encodings of shape (seq_len, embedding_dim) to them. Show how you would:
  1. Generate the sinusoidal encodings (using sin and cos).
  2. Broadcast-add them to the batch of embeddings.
- Tests how to handle shape broadcasting and incorporate positional information for LLMs.
Basic Autoregressive Decoding Loop (Greedy)
- Question: Assume you have a function model.forward(input_ids) that returns logits over your vocabulary. Write a greedy decoding loop that:
  1. Starts with a prompt (list of token IDs).
  2. Iteratively obtains the next token by taking argmax of the logits at each step.
  3. Continues until you reach a special <EOS> token or a maximum length.
- Tests ability to implement the simplest decode strategy by coding a loop with a model call each iteration.
Top-k Sampling Decoding
- Question: Modify the above decoding loop to implement top-k sampling instead of greedy:
  1. Use torch.topk on the logits to keep only the top-k tokens.
  2. Sample from the resulting distribution using torch.multinomial.
- Tests handling probability distributions and random sampling in PyTorch for more varied text generation.
Nucleus (Top-p) Sampling
- Question: Write a function to implement top-p (nucleus) sampling in one step of decoding:
  1. Sort token probabilities by descending order.
  2. Select tokens until their cumulative probability ≥ p.
  3. Sample from the truncated distribution.
- Tests dynamic selection of a token set based on cumulative probability.
Mini “Attention” Mechanism
- Question: Implement a simple scaled dot-product attention from scratch. Given Q, K, V of shape (batch, seq_len, dim), compute:
  
  $$ Attention(Q,K,V)=softmax(QK⊤d)V\text{Attention}(Q, K, V) = \mathrm{softmax}\Big(\frac{Q K^\top}{\sqrt{d}}\Big) V $$
  - Tests basic matrix multiplications, shape alignment, and understanding of attention in LLMs.
Layer Normalization
- Question: Write your own PyTorch module that implements LayerNorm manually (i.e., do not use nn.LayerNorm). Show how you’d:
  1. Compute mean and variance across the last dimension.
  2. Subtract mean, divide by std, multiply by a learnable gamma, and add a learnable beta.
- Tests knowledge of normalization steps and custom parameter usage.
Masking in Attention

Question: Extend your scaled dot-product attention code to support an attention mask (e.g., a boolean mask of shape (batch, seq_len, seq_len)). Any position where the mask is False should be assigned a very negative value (like 1e9) before the softmax.
Tests the typical approach for ignoring future tokens or padded tokens in attention calculation.

KV Caching

Question: Show how to implement a basic “past key-value” cache for an autoregressive model. Suppose your model returns (logits, new_k, new_v), and you want to store (k, v) from all previous timesteps to avoid recomputing them. Demonstrate how you’d:
1. Initialize empty lists or tensors for the cache.
2. Append new_k, new_v at each time step.
3. Pass the entire cached (k, v) to the attention mechanism.
Tests your understanding of how LLMs speed up inference by caching previous computations.

Dynamic Padding / Batching

Question: Suppose you have a list of sequences (lists of token IDs) of varying lengths. Write a function collate_fn(batch) that:
1. Finds the longest sequence in the batch.
2. Pads all sequences to that length.
3. Stacks them into a single PyTorch tensor (batch_size, max_len).