Yash Jayswal Blog

some thoughts by me

Training LLaMA 3 on TPUs: Putting Theory Into Practice

Applying parallelism equations to real models — parameter counting, FLOPs, memory, and sharding for LLaMA 3-70B and 405B on TPU v5p pods.

7 min read · March 02, 2026

2026 · jax tpu llama training fsdp tensor-parallelism · engineering
Training at Scale: When Communication Becomes the Enemy

Data Parallelism, FSDP, Tensor Parallelism — how to train 13B models without going communication-bound.

5 min read · February 10, 2026

2026 · jax tpu parallelism fsdp tensor-parallelism training mesh · engineering
Transformer Math: The 6PT Rule and Other Accounting Tricks

Parameter counting, FLOPs estimation, attention complexity — the economics of modern LLMs.

4 min read · February 07, 2026

2026 · jax tpu transformers flops attention moe · engineering
Sharding Strategies: The Art of Distributed Matrix Multiplication

AllGather, ReduceScatter, AllToAll — mastering the 4 communication primitives that make LLM training scale.

4 min read · February 07, 2026

2026 · jax tpu systems-engineering sharding communication allgather · engineering
TPU Architecture: Understanding the Bandwidth Hierarchy

From MXUs and VMEM to ICI and DCN — mapping the performance landscape of Google's TPUs.

5 min read · February 04, 2026

2026 · jax tpu systems-engineering hardware bandwidth · engineering