Twitter/X

Author @hasantoxr (published 2026-05-11) reports researchers achieved an 8.5×…

2026-05-11 · 00:58 UTC ·@hasantoxr ·1 min read

Brief

DFlash promises an 8.5× real-world speedup for existing LLMs (same model, same accuracy), increasing decoding from 48.5 to 415 tokens/sec in the demo. Rather than one-token speculative drafts, it uses a tiny block-diffusion model to predict many tokens in parallel and then has the large model verify batches; it integrates with vLLM, SGLang, and Transformers and supports Qwen, Llama, Kimi, and gpt-oss.

Why it matters

Author @hasantoxr (published 2026-05-11) reports researchers achieved an 8.5× speedup with the same model and same accuracy: vanilla decoding 48.5 tokens/sec vs DFlash 415 tokens/sec in the demo.

Key details

DFlash replaces per-token speculative decoding with a tiny block-diffusion draft model that predicts multiple tokens in parallel; the large model then verifies the entire batch.
DFlash already plugs into vLLM, SGLang, and Transformers and claims support for models including Qwen, Llama, Kimi, and gpt-oss.

Source evidence

RESEARCHERS FOUND A WAY TO MAKE AI MODELS 8.5X FASTER.

Same model. Same accuracy. Way less waiting.

The demo is wild:

Vanilla decoding: 48.5 tokens/sec
DFlash: 415 tokens/sec

That is not a tiny optimization.

That is the difference between “wait for the chatbot” and “the chatbot keeps up with you.”

The bottleneck is stupid simple:

LLMs usually generate one token at a time.

Speculative decoding tried to fix this by using a smaller draft model to guess the next tokens, then the big model checks them.

But even the draft model was still guessing one by one.

DFlash changes that.

It uses a tiny block diffusion model that guesses multiple tokens at once in parallel.

Then the big model verifies the batch.

The craziest part:

It plugs into vLLM, SGLang, Transformers, and already supports Qwen, Llama, Kimi, gpt-oss, and more.

Video