RESEARCHERS FOUND A WAY TO MAKE AI MODELS 8.5X FASTER.
Same model. Same accuracy. Way less waiting.
The demo is wild:
Vanilla decoding: 48.5 tokens/sec
DFlash: 415 tokens/sec
That is not a tiny optimization.
That is the difference between “wait for the chatbot” and “the chatbot keeps up with you.”
The bottleneck is stupid simple:
LLMs usually generate one token at a time.
Speculative decoding tried to fix this by using a smaller draft model to guess the next tokens, then the big model checks them.
But even the draft model was still guessing one by one.
DFlash changes that.
It uses a tiny block diffusion model that guesses multiple tokens at once in parallel.
Then the big model verifies the batch.
The craziest part:
It plugs into vLLM, SGLang, Transformers, and already supports Qwen, Llama, Kimi, gpt-oss, and more.
Video