Testing accuracy, latency, memory usage, and MTP efficiency after quantization.
͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Forwarded this email? Subscribe here for more
Qwen3.6 27B Quantization: FP8 vs INT4 vs NVFP4
Benjamin Marie
May 12| | | ∙| | Preview
READ IN APP
Like I did for Qwen3.5 and Gemma 4, let’s see how well Qwen3.6 27B holds up when quantized into different formats: FP8, INT4, and NVFP4.
The goal is to measure Qwen3.6’s robustness to quantization: how much accuracy is affected, what happens to latency and token efficiency, and how fast the model can become after quantization.
The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Upgrade to paid
In this article, I evaluate five Qwen3.6 27B variants. Across them, quantization mostly affects the large `Linear` weights, while the output head, token embeddings, normalization layers, and small bias or state tensors are kept in higher precision. The main differences between variants are how they handle linear-attention layers, and MTP weights.
Intel/Qwen3.6-27B-int4-AutoRound
Quantized: AutoRound INT4 , group size 128, symmetric.
Not quantized: linear-attention `in_proj_a` and `in_proj_b` are kept in 16-bit floating point.
Qwen/Qwen3.6-27B-FP8
Quantized: FP8 , block size 128.
Not quantized: linear-attention `in_proj_a`, `in_proj_b` and `in_proj_ba`.
hampsonw/Qwen3.6-27B-AWQ-BF16-INT4-mtp-bf16
Quantized: AWQ INT4 , group size 32, asymmetric, Linear weights.
Not quantized: all MTP tensors are restored as BF16 ; linear-attention Linears are ignored, including `out_proj`, `in_proj_qkv`, `in_proj_z`, `in_proj_a`, `in_proj_b`, and `in_proj_ba`.
Peutlefaire/Qwen3.6-27B-NVFP4
Quantized: NVFP4 with llm-compressor, targeting all Linear layers.
Not quantized: MTP is saved as-is from the original model;
kaitchup/Qwen3.6-27B-autoround-nvfp4-linearattn-BF16
Quantized: AutoRound NVFP4 , 4-bit floating weights and local dynamic 4-bit activations for Linear layers.
Specific not quantized: linear-attention layers are kept in 16-bit , including `out_proj`, `in_proj_qkv`, `in_proj_z`, `in_proj_a`, and `in_proj_b`; MTP is noted as not working for this release.
> Acknowledgments
> This article would not have been possible without the compute sponsorship generously provided by Verda, whose B200 and RTX Pro 6000 GPUs I used throughout this work.
> Verda provides access to high-end GPUs such as the B200 and B300, with GB300 support coming soon, as well as smaller GPUs such as the RTX 6000 Ada, which are among the most affordable per hour on the market.
> Verda is a European, AI-focused cloud and GPU infrastructure provider with sovereignty, sustainability, data privacy, and performance at its core.
> You can check them out here.
Quantized Qwen3.6: Accuracy with Thinking Enabled
Benchmark accuracy does not reveal a clear winner, but it does reveal one clear loser: the NVFP4 variant with quantized linear attention, shown by the green bars. This model consistently and significantly underperforms on most benchmarks. The next sections examine why.
Even my NVFP4 model, shown by the orange bars, tends to trail the others despite keeping linear attention in 16-bit precision. The quality drop is much more subtle on average, but it is still visible.
Intel’s INT4 model is especially strong, even though it quantizes most linear-attention modules. This suggests that Intel found a very effective recipe: linear attention can be quantized, as long as `in_proj_a` and `in_proj_b` are left in higher precision.
The pass@k curves tell a similar story. On coding tasks such as LiveCodeBench, the quantized models follow curves that are very close to the original BF16 model. The main exception is the full NVFP4 version, which clearly falls behind.
More details on how to exploit pass@k here:
How to Reduce LLM Inference Cost and Improve Accuracy with Pass@k and Majority Voting
Benjamin Marie| | ·
Apr 27
Read full story
Quantized Qwen3.6 Generate More Tokens
Generated token count appears to be negatively correlated with accuracy: the more tokens a quantized model produces, the lower its accuracy often becomes...
Subscribe to The Kaitchup – AI on a Budget to unlock the rest.
Become a paying subscriber of The Kaitchup – AI on a Budget to get access to this post and other subscriber-only content.
A subscription gets you:
Subscriber-only articles, full archive (180+ tutorials), and new articles every week
Post comments and join the community
All the AI notebooks (190+)
Like
Comment
Restack
© 2026 The Kaitchup
66 avenue des Champs-Élysées, Lot 41
75008 Paris