Breaking the Bottleneck: TorchSpec and the Future of Disaggregated Speculative Decoding

In the world of Large Language Models (LLMs), speed is the ultimate currency. As models like Kimi K2.5 and Qwen 3.5 push into the hundreds of billions of parameters with million-token context windows, the industry has turned to Speculative Decoding to keep latency in check.

However, training the “draft models” required for speculative decoding has hit a massive wall: data movement. A recent breakthrough from the PyTorch team, TorchSpec, introduces a disaggregated training framework that solves the “hidden state bottleneck.”

Here is how TorchSpec is redefining how we train draft models at scale.

The Problem: The Hidden State Bottleneck

Speculative decoding works by having a small, fast “draft model” predict tokens that a large “target model” then verifies. To make draft models accurate, they need to be trained on the intermediate hidden states of the target model.

As models and context windows grow, this creates two impossible choices:

Inference Co-located Training: Running the target and draft models on the same GPUs. This crushes GPU memory. For a 1-Trillion parameter MoE model like Kimi K2.5, the weights alone take up most of an H100’s memory, leaving almost no room for the draft model’s training activations.
Offline Preparation: Pre-computing hidden states and saving them to disk. This is a storage nightmare. A single 128K-token sample can require ~7GB of hidden states. Scaling this to 100,000 samples would require 700 Terabytes of high-speed storage—an I/O demand few data centers can handle.

Enter TorchSpec: Disaggregated Training

TorchSpec solves this by decoupling the Inference System (which generates hidden states) from the Training System (which consumes them).

Instead of writing to disk or fighting for local GPU memory, TorchSpec streams data directly from the inference engine to the training workers via a central store called Mooncake.

Key Architectural Advantages:

Independent Scaling: You can use 16 GPUs for inference and 8 for training, or vice-versa. They no longer need to share the same sharding strategy or hardware constraints.
Mooncake Transfer Engine: Developed by Moonshot AI, Mooncake uses RDMA (Remote Direct Memory Access) to move gigabytes of hidden states between nodes at near-line-rate, bypassing the CPU entirely.
Zero Storage Overhead: Because data is streamed and consumed in real-time, there is no need for massive SSD arrays to hold intermediate tensors.

Real-World Impact: The Kimi K2.5 Case Study

To prove the system’s efficacy, the team trained an EAGLE-3 draft model for the Kimi K2.5 (1T parameter) model.

Efficiency: They utilized a disaggregated setup (e.g., 8xH200 for inference and 8xH200 for training).
Scale: They processed 600,000 training samples (6 billion tokens) in just 1,500 H200 GPU hours.
Performance: The resulting draft model improved output throughput by +60% for batch size 1.

Without TorchSpec, training on context lengths of 100,000+ tokens would have been technically impossible due to memory fragmentation and storage I/O limits.

Why “Engine-Native” Matters

One of TorchSpec’s most powerful features is its integration with production inference engines like vLLM and SGLang.

By being “engine-native,” TorchSpec ensures that the tokenization, templates, and kernels used during draft training are identical to what the model will see in production. It also allows for “Train with Decode,” where the system generates responses autoregressively from prompt-only inputs during training, eliminating the need for a separate data-generation phase.

Roadmap and Future

The PyTorch team isn’t stopping here. The roadmap for TorchSpec includes:

Broader Model Support: Upcoming support for Minimax M2.5 and GLM 5.
Packed Sequence Training: Optimizing GPU utilization for variable-length inputs.
New Algorithms: Expansion beyond EAGLE-3 to include DFlash and MTP (Multi-Token Prediction) architectures.

Conclusion

For engineering teams scaling LLM infrastructure, TorchSpec represents a shift away from monolithic training scripts toward a distributed, streaming architecture. By treating inference and training as two halves of a high-speed data pipeline, TorchSpec makes it feasible to build the highly optimized draft models required for the next generation of real-time AI.

You can find the open-source dataset and Kimi K2.5 draft model on the PyTorch GitHub and Hugging Face.