Scaling LLMs: From Alchemy to Science (Part 0)
I’ve started working through “How to Scale Your Model: A Systems View of LLMs on TPUs” — a book that aims to turn ML training from alchemy into science.
This post covers Part 0: Introduction, where I explored the fundamental trade-offs of parallelizing models across multiple chips.
🎯 The Core Question
Why doesn’t adding more chips always make training faster?
The answer lies in understanding three bottlenecks:
- Compute (Peak FLOPs)
- Memory (HBM Bandwidth)
- Communication (Interconnect Bandwidth)
📝 My Learning Journey: 6 Key Questions
Round 1: Understanding the Fundamentals
Q1: The “Strong Scaling” Trade-off
Question: Why does adding more chips not always result in a linear increase in speed?
My Answer: Strong scaling means adding chips to reduce time for a fixed problem size. However, as you shard a model further, communication becomes the bottleneck.
Key Insight:
- Weak Scaling = More chips → More users/data.
- Strong Scaling = More chips → Less time (for the same workload).
- As the Compute-to-Communication ratio drops, you hit a wall.
Q2: The Bottleneck Theory
Question: What are the three primary factors that form the “roofline”?
My Answer: The three pillars are:
- Peak FLOPs (compute capacity)
- HBM Bandwidth (memory bandwidth)
- Interconnect Bandwidth (communication between chips)
Key Insight: The “roofline model” helps you identify which of these three is limiting your performance.
Q3: Architecture vs. Hardware
Question: Why is understanding interconnects more important today than five years ago?
My Answer: Algorithms have consolidated into Transformers, which are built entirely on matrix multiplications—exactly what TPUs/GPUs excel at. At high scale, any latency in the interconnect adds up.
Key Insight: Modern ML research is less about “algorithmic cleverness” and more about hardware efficiency. The Memory Wall is now the primary design constraint.
Round 2: Deeper Systems Thinking
Q4: The “Alchemy” vs. “Science” Distinction
Question: What distinguishes ML “alchemy” from ML “science”?
My Definition:
- Science = Predicting performance and resource usage accurately.
- Alchemy = Relying on intuition where things “just work.”
Key Insight: Understanding the systems view lets us move from black-box experimentation to principled design.
Q5: The “Communication-Bound” Nightmare
Question: Why can doubling chips sometimes slow down training?
My Answer: The bottleneck shifts to the ICI (Inter-Chip Interconnects). If communication takes longer than computation, we become communication-bound, and strong scaling breaks.
Key Insight: Adding chips increases communication overhead. If the interconnect bandwidth can’t keep up, you’re just wasting resources.
Q6: Hardware-Driven Design
Question: Why did Transformers “win” over other architectures?
My Answer: Transformers were designed for parallelism. Their math (attention as matrix ops) perfectly aligns with TPU/GPU capabilities, allowing them to scale with compute.
Key Insight: This isn’t just about better language understanding—it’s about scaling efficiently on modern hardware.
🧠 Personal Reflection
At Samsung, when we talk about “optimizing models,” we’re really asking:
- Are we compute-bound? (Waiting for matrix ops)
- Are we memory-bound? (Waiting for data to load)
- Are we communication-bound? (Waiting for chips to sync)
The days of “just throw more GPUs at it” are over. If you don’t account for these bottlenecks, you’re just burning money.
� Code & Notes
All my notes and future implementations are in the Model_scaling_jax repository.
Next up: Part 1 - Roofline Analysis
Turning alchemy into science, one section at a time.
Enjoy Reading This Article?
Here are some more articles you might like to read next: