Yash Jayswal Blog

some thoughts by me

Scaling LLMs: MoE Routing & JAX Parallelism on TPU

A deep dive into JAX's three parallelism modes — from auto-sharding to manual shard_map — implementing Mixture-of-Experts routing with a 43× speedup on TPU v5e-8.

12 min read · March 29, 2026

2026 · jax tpu systems-engineering llm scaling moe parallelism · engineering
GPUs for LLMs: The Same Rooflines, Different Numbers

How GPUs differ from TPUs at the chip level, NVLink networking, collective communications, and why pipeline parallelism is the key to training LLaMA-3 70B at scale.

5 min read · March 17, 2026

2026 · gpu nvidia h100 nvlink llm scaling parallelism roofline · engineering
TPU Profiling: When Math Meets Reality

Digging into JAX profiles, HLO traces, and observing our theoretical FLOP scaling limits in action.

4 min read · March 14, 2026

2026 · jax tpu profiling xla compilers optimization · engineering
Serving LLaMA 3-70B: From Theory to Production Numbers

Real hardware serving decisions — KV cache sizing, topology selection, critical batch sizes, and the disaggregated serving ratio for LLaMA 3-70B on TPU v5e.

6 min read · March 08, 2026

2026 · jax tpu inference llama serving kv-cache latency throughput quantization · engineering
Transformer Inference: Two Problems in Disguise

Prefill is compute-bound like training. Generation is always memory-bandwidth-bound. The KV cache changes everything.

5 min read · March 07, 2026

2026 · jax tpu inference kv-cache moe latency throughput · engineering

Yash Jayswal Blog

some thoughts by me

Scaling LLMs: MoE Routing & JAX Parallelism on TPU

GPUs for LLMs: The Same Rooflines, Different Numbers

TPU Profiling: When Math Meets Reality

Serving LLaMA 3-70B: From Theory to Production Numbers

Transformer Inference: Two Problems in Disguise