Building a Primitive GPT-2: Layers & Dimensions
I recently took on the challenge of implementing a simplified GPT-2 Text Generation function on Deep-ML. This exercise was a fantastic way to bring together separate concepts—embeddings, layer normalization, and autoregression—into one cohesive system.
🧩 The Puzzle Pieces
Building a generator like this forces you to confront the reality of the architecture:
- Embeddings: Combining word identity with position identity (
wte + wpe). - Layer Norm: Stabilizing the features before projection.
- The Loop: Autoregressively predicting the next token and feeding it back.
💡 Lightbulb Moments
Dimensions are Everything
I used to get confused by the endless reshaping in PyTorch/NumPy. But during this project, I realized that noting down the dimensions (like (seq_len, d_model)) makes everything fall into place. It turns abstract math into a shape-matching puzzle.
Layer Norm vs. Batch Norm
I got stuck on which axis to normalize for a while. Then it clicked: Layer Norm gives every feature a fair shot. Unlike Batch Norm which looks across the batch, Layer Norm standardizes the features for a single example, ensuring that no single feature dominates the gradients.
The Cost of “Memory-less” Generation
Writing the generation loop manually really highlights the inefficiency. To generate the 5th token, we re-process the first 4. This redundancy is exactly why KV Caching is a critical optimization in modern LLM serving!
💻 The Code
You can see my simplified implementation here:
👉 GPT-2 Implementation on GitHub
It uses a dummy encoder and random weights, but the logic is identical to the real thing!
Enjoy Reading This Article?
Here are some more articles you might like to read next: