Efficient MoE Pre-training at Scale on 1K AMD GPUs with TorchTitan
Training massive MoE models (DeepSeek-V3, Llama 4-Scout, etc.) pushes hardware to the brink. AMD + Meta’s PyTorch team fixed that: optimized TorchTitan + Primus-Turbo for MI325X → near-perfect scaling on 1,024 GPUs. Big scale + high efficiency is now real. https://tinyurl.com/c3jjzxbb