substack.com

Qwen3.6 27B Quantization: FP8 vs INT4 vs NVFP4

Brief

The article tests Qwen3.6 27B under five quantization recipes (FP8 block128; AutoRound INT4 group128; AWQ INT4 group32 with MTP→BF16 and linear-attn ignored; NVFP4 via llm-compressor; and AutoRound NVFP4 with 4-bit local activations but linear-attn in 16-bit). Methodology measured accuracy across benchmarks (including coding pass@k/LiveCodeBench), latency, memory, and MTP tensor handling; large Linear weights were the primary quantization targets while embeddings, output head, norms and small state tensors were kept at higher precision. Results: full NVFP4 that quantizes linear-attention performs substantially worse across tasks, NVFP4 that preserves linear-attn in BF16 still slightly trails, and Intel’s INT4 recipe is surprisingly robust when key linear-attention parts (inproja/inprojb) are kept higher precision. The author also observed quantized variants produce more tokens and that higher token counts often coincide with lower accuracy; experiments ran on Verda B200 and RTX Pro 6000 GPUs.

Why it matters

The author evaluated five Qwen3.6 27B quantized variants (published 2026-05-12) — Intel AutoRound INT4 (group=128, symmetric), Qwen FP8 (block=128), hampsonw AWQ INT4 (group=32, MTP restored as BF16), Peutlefaire NVFP4 (llm-compressor targeting all Linears), and kaitchup AutoRound NVFP4 (NVFP4 with local dynamic 4-bit activations; linear-attn kept in 16-bit).

Key details

  • Benchmark accuracy and pass@k show the full NVFP4 that quantizes linear-attention is the clear loser (consistently and significantly underperforms); even the NVFP4 variant that preserves linear-attn in 16-bit trails other recipes.
  • Intel’s INT4 AutoRound recipe is notably strong despite quantizing most linear-attention modules — accuracy holds up when in_proj_a and in_proj_b are left at higher precision, implying selective 16-bit retention in linear-attn is important.
  • Quantized models tend to generate more tokens (token count rose) and that increased token output correlates negatively with accuracy; experiments measured accuracy, latency, memory usage, and MTP efficiency on Verda GPUs (B200 and RTX Pro 6000).
Cleaned source text

Testing accuracy, latency, memory usage, and MTP efficiency after quantization.

͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­

Forwarded this email? Subscribe here for more

Qwen3.6 27B Quantization: FP8 vs INT4 vs NVFP4

Benjamin Marie

May 12| | | ∙| | Preview

READ IN APP

Like I did for Qwen3.5 and Gemma 4, let’s see how well Qwen3.6 27B holds up when quantized into different formats: FP8, INT4, and NVFP4.

The goal is to measure Qwen3.6’s robustness to quantization: how much accuracy is affected, what happens to latency and token efficiency, and how fast the model can become after quantization.

The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Upgrade to paid

In this article, I evaluate five Qwen3.6 27B variants. Across them, quantization mostly affects the large `Linear` weights, while the output head, token embeddings, normalization layers, and small bias or state tensors are kept in higher precision. The main differences between variants are how they handle linear-attention layers, and MTP weights.

Intel/Qwen3.6-27B-int4-AutoRound

Quantized: AutoRound INT4 , group size 128, symmetric.

Not quantized: linear-attention `in_proj_a` and `in_proj_b` are kept in 16-bit floating point.

Qwen/Qwen3.6-27B-FP8

Quantized: FP8 , block size 128.

Not quantized: linear-attention `in_proj_a`, `in_proj_b` and `in_proj_ba`.

hampsonw/Qwen3.6-27B-AWQ-BF16-INT4-mtp-bf16

Quantized: AWQ INT4 , group size 32, asymmetric, Linear weights.

Not quantized: all MTP tensors are restored as BF16 ; linear-attention Linears are ignored, including `out_proj`, `in_proj_qkv`, `in_proj_z`, `in_proj_a`, `in_proj_b`, and `in_proj_ba`.

Peutlefaire/Qwen3.6-27B-NVFP4

Quantized: NVFP4 with llm-compressor, targeting all Linear layers.

Not quantized: MTP is saved as-is from the original model;

kaitchup/Qwen3.6-27B-autoround-nvfp4-linearattn-BF16

Quantized: AutoRound NVFP4 , 4-bit floating weights and local dynamic 4-bit activations for Linear layers.

Specific not quantized: linear-attention layers are kept in 16-bit , including `out_proj`, `in_proj_qkv`, `in_proj_z`, `in_proj_a`, and `in_proj_b`; MTP is noted as not working for this release.

> Acknowledgments

> This article would not have been possible without the compute sponsorship generously provided by Verda, whose B200 and RTX Pro 6000 GPUs I used throughout this work.

> Verda provides access to high-end GPUs such as the B200 and B300, with GB300 support coming soon, as well as smaller GPUs such as the RTX 6000 Ada, which are among the most affordable per hour on the market.

> Verda is a European, AI-focused cloud and GPU infrastructure provider with sovereignty, sustainability, data privacy, and performance at its core.

> You can check them out here.

Quantized Qwen3.6: Accuracy with Thinking Enabled

Benchmark accuracy does not reveal a clear winner, but it does reveal one clear loser: the NVFP4 variant with quantized linear attention, shown by the green bars. This model consistently and significantly underperforms on most benchmarks. The next sections examine why.

Even my NVFP4 model, shown by the orange bars, tends to trail the others despite keeping linear attention in 16-bit precision. The quality drop is much more subtle on average, but it is still visible.

Intel’s INT4 model is especially strong, even though it quantizes most linear-attention modules. This suggests that Intel found a very effective recipe: linear attention can be quantized, as long as `in_proj_a` and `in_proj_b` are left in higher precision.

The pass@k curves tell a similar story. On coding tasks such as LiveCodeBench, the quantized models follow curves that are very close to the original BF16 model. The main exception is the full NVFP4 version, which clearly falls behind.

More details on how to exploit pass@k here:

How to Reduce LLM Inference Cost and Improve Accuracy with Pass@k and Majority Voting

Benjamin Marie| | ·

Apr 27

Read full story

Quantized Qwen3.6 Generate More Tokens

Generated token count appears to be negatively correlated with accuracy: the more tokens a quantized model produces, the lower its accuracy often becomes...

Subscribe to The Kaitchup – AI on a Budget to unlock the rest.

Become a paying subscriber of The Kaitchup – AI on a Budget to get access to this post and other subscriber-only content.

A subscription gets you:

Subscriber-only articles, full archive (180+ tutorials), and new articles every week

Post comments and join the community

All the AI notebooks (190+)

Like

Comment

Restack

© 2026 The Kaitchup

66 avenue des Champs-Élysées, Lot 41

75008 Paris