Building a Primitive GPT-2: Layers & Dimensions

I recently took on the challenge of implementing a simplified GPT-2 Text Generation function on Deep-ML. This exercise was a fantastic way to bring together separate concepts—embeddings, layer normalization, and autoregression—into one cohesive system.

🧩 The Puzzle Pieces

Building a generator like this forces you to confront the reality of the architecture:

  1. Embeddings: Combining word identity with position identity (wte + wpe).
  2. Layer Norm: Stabilizing the features before projection.
  3. The Loop: Autoregressively predicting the next token and feeding it back.

💡 Lightbulb Moments

Dimensions are Everything

I used to get confused by the endless reshaping in PyTorch/NumPy. But during this project, I realized that noting down the dimensions (like (seq_len, d_model)) makes everything fall into place. It turns abstract math into a shape-matching puzzle.

Layer Norm vs. Batch Norm

I got stuck on which axis to normalize for a while. Then it clicked: Layer Norm gives every feature a fair shot. Unlike Batch Norm which looks across the batch, Layer Norm standardizes the features for a single example, ensuring that no single feature dominates the gradients.

The Cost of “Memory-less” Generation

Writing the generation loop manually really highlights the inefficiency. To generate the 5th token, we re-process the first 4. This redundancy is exactly why KV Caching is a critical optimization in modern LLM serving!


💻 The Code

You can see my simplified implementation here:

👉 GPT-2 Implementation on GitHub

It uses a dummy encoder and random weights, but the logic is identical to the real thing!




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Scaling LLMs: MoE Routing & JAX Parallelism on TPU
  • GPUs for LLMs: The Same Rooflines, Different Numbers
  • TPU Profiling: When Math Meets Reality
  • Serving LLaMA 3-70B: From Theory to Production Numbers
  • Transformer Inference: Two Problems in Disguise