Below are 25 basic (yet thorough) coding-focused questions that test fundamental PyTorch skills relevant to building and running LLMs. They range from creating and manipulating tensors, to implementing small transformer components, to applying sampling methods. Each question should prompt you to write working code (in a live environment or whiteboard style), ensuring you can demonstrate good coding practices in PyTorch for LLM use cases.
[3,4]) of random floats.[2,6].nn.Embedding for this vocabulary. Then:
[batch_size=4, seq_len=5]).nn.Module that includes:
nn.Embedding layer.nn.Linear layer mapping from embedding dimension to a “hidden” dimension of your choice.(batch, seq_len, embedding_dim) and adds sinusoidal positional encodings of shape (seq_len, embedding_dim) to them. Show how you would:
sin and cos).model.forward(input_ids) that returns logits over your vocabulary. Write a greedy decoding loop that:
argmax of the logits at each step.<EOS> token or a maximum length.torch.topk on the logits to keep only the top-k tokens.torch.multinomial.Question: Implement a simple scaled dot-product attention from scratch. Given Q, K, V of shape (batch, seq_len, dim), compute:
$$ Attention(Q,K,V)=softmax(QK⊤d)V\text{Attention}(Q, K, V) = \mathrm{softmax}\Big(\frac{Q K^\top}{\sqrt{d}}\Big) V $$
LayerNorm manually (i.e., do not use nn.LayerNorm). Show how you’d:
gamma, and add a learnable beta.(batch, seq_len, seq_len)). Any position where the mask is False should be assigned a very negative value (like 1e9) before the softmax.(logits, new_k, new_v), and you want to store (k, v) from all previous timesteps to avoid recomputing them. Demonstrate how you’d:
new_k, new_v at each time step.(k, v) to the attention mechanism.collate_fn(batch) that:
(batch_size, max_len).