-
Scaling LLMs: MoE Routing & JAX Parallelism on TPU
A deep dive into JAX's three parallelism modes — from auto-sharding to manual shard_map — implementing Mixture-of-Experts routing with a 43× speedup on TPU v5e-8.
-
GPUs for LLMs: The Same Rooflines, Different Numbers
How GPUs differ from TPUs at the chip level, NVLink networking, collective communications, and why pipeline parallelism is the key to training LLaMA-3 70B at scale.
-
TPU Profiling: When Math Meets Reality
Digging into JAX profiles, HLO traces, and observing our theoretical FLOP scaling limits in action.
-
Serving LLaMA 3-70B: From Theory to Production Numbers
Real hardware serving decisions — KV cache sizing, topology selection, critical batch sizes, and the disaggregated serving ratio for LLaMA 3-70B on TPU v5e.
-
Transformer Inference: Two Problems in Disguise
Prefill is compute-bound like training. Generation is always memory-bandwidth-bound. The KV cache changes everything.