-
Training LLaMA 3 on TPUs: Putting Theory Into Practice
Applying parallelism equations to real models — parameter counting, FLOPs, memory, and sharding for LLaMA 3-70B and 405B on TPU v5p pods.
-
Training at Scale: When Communication Becomes the Enemy
Data Parallelism, FSDP, Tensor Parallelism — how to train 13B models without going communication-bound.
-
Transformer Math: The 6PT Rule and Other Accounting Tricks
Parameter counting, FLOPs estimation, attention complexity — the economics of modern LLMs.
-
Sharding Strategies: The Art of Distributed Matrix Multiplication
AllGather, ReduceScatter, AllToAll — mastering the 4 communication primitives that make LLM training scale.
-
TPU Architecture: Understanding the Bandwidth Hierarchy
From MXUs and VMEM to ICI and DCN — mapping the performance landscape of Google's TPUs.