Below are 25 basic (yet thorough) coding-focused questions that test fundamental PyTorch skills relevant to building and running LLMs. They range from creating and manipulating tensors, to implementing small transformer components, to applying sampling methods. Each question should prompt you to write working code (in a live environment or whiteboard style), ensuring you can demonstrate good coding practices in PyTorch for LLM use cases.
[3,4]
) of random floats.[2,6]
.nn.Embedding
for this vocabulary. Then:
[batch_size=4, seq_len=5]
).nn.Module
that includes:
nn.Embedding
layer.nn.Linear
layer mapping from embedding dimension to a “hidden” dimension of your choice.(batch, seq_len, embedding_dim)
and adds sinusoidal positional encodings of shape (seq_len, embedding_dim)
to them. Show how you would:
sin
and cos
).model.forward(input_ids)
that returns logits over your vocabulary. Write a greedy decoding loop that:
argmax
of the logits at each step.<EOS>
token or a maximum length.torch.topk
on the logits to keep only the top-k tokens.torch.multinomial
.Question: Implement a simple scaled dot-product attention from scratch. Given Q, K, V
of shape (batch, seq_len, dim)
, compute:
$$ Attention(Q,K,V)=softmax(QK⊤d)V\text{Attention}(Q, K, V) = \mathrm{softmax}\Big(\frac{Q K^\top}{\sqrt{d}}\Big) V $$
LayerNorm
manually (i.e., do not use nn.LayerNorm
). Show how you’d:
gamma
, and add a learnable beta
.(batch, seq_len, seq_len)
). Any position where the mask is False
should be assigned a very negative value (like 1e9
) before the softmax.(logits, new_k, new_v)
, and you want to store (k, v)
from all previous timesteps to avoid recomputing them. Demonstrate how you’d:
new_k, new_v
at each time step.(k, v)
to the attention mechanism.collate_fn(batch)
that:
(batch_size, max_len)
.