title: New Qwen3.5-35B-A3B Unsloth Dynamic GGUFs + Benchmarks
author: u/danielhanchen
contenttype: redditpost
publication: r/LocalLLaMA
published: 2026-02-27T18:23:50+00:00
sourceurl: https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/newqwen3535ba3bunslothdynamicggufsbenchmarks/
word_count: 1001
Hey r/LocalLlama! We just updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits. We did over 150 KL Divergence benchmarks, totally 9TB of GGUFs. We uploaded all research artifacts. We also fixed a tool calling chat template bug (affects all quant uploaders)
- We tested Bartowski, Ubergram, AesSedai, Noctrex and our new Dynamic GGUFs
- 99.9% KL Divergence shows SOTA on Pareto Frontier for UD-Q4_K_XL, IQ3_XXS & more.
- Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for a select few layers.
- Qwen3.5-35B-A3B GGUFs are updated to use new fixes (112B, 27B still converting, re-download once they are updated)
https://preview.redd.it/5hmdthgyp2mg1.png?width=2320&format=png&auto=webp&s=3dbd0480bbc38512a8bbbba0e4e01444feec99fb
- Imatrix definitely helps reduce KLD & PPL.
- I quants (iq3_xxs, iq2_s etc) makes inference 5-10% slower.
- Quantizing ssm_out (Mamba layers) is not a good idea, and ffn_down_exps.
Some tensors are very sensitive to quantization
- We made over 9TB of research artifacts available for the community to investigate further on our Experiments page. It includes KLD metrics and all 121 configs we tested.
- We varied bit widths across each tensor type, and generated a best and worst Pareto Frontier plot below vs 99.9% KLD.
- For the best items to quantize, ffn_up_exps and ffn_gate_exps are generally ok to quantize to 3bit. ffn_down_exps is slightly more sensitive.
- For the worst items, ssm_out dramatically increases KLD and the disk space savings is minuscule. For example, ssm_out at q2_k does dramatically worse. Quantizing any attn_* is especially sensitive for hybrid architectures, and so leaving them in higher precision works well.
https://preview.redd.it/pakdmbv1n2mg1.png?width=1183&format=png&auto=webp&s=be8940bf7c49157d1e34bb82053e70b44f0e1744
Tensor type vs bits on 99.9% KL Divergence
- We plot all quant levels vs 99.9% KLD, and sort from worst KLD to best. Quantizing ffn_* layers too heavily down is not a good idea.
- However, some bit widths are good, especially 3bit. - for example leaving ffn_* (down, up, gate) at around iq3_xxs seems to be best compromise on disk space and 99.9% KLD change. 2 bits cause more degradation.
MXFP4 is much worse on many tensors - attn_gate, attn_q, ssm_beta, ssm_alpha using MXFP4 is not a good idea, and rather Q4_K is better - also MXFP4 uses 4.25 bits per weight, whilst Q4_K uses 4.5 bits per weight. It's better to use Q4_K than MXFP4 when choosing between them.
https://preview.redd.it/xgugdgzmv2mg1.png?width=989&format=png&auto=webp&s=eddc2c32d343410a27f405289fd976e858d6f6a8
Imatrix works remarkably well
- Imatrix definitely helps weight the quantization process in the right way. For example previously ssm_out at 2bits was really bad, however imatrix reduces the 99.9% KLD by a lot.
- Imatrix generally helps on lower bits, and works on all quants and bit widths.
https://preview.redd.it/yidhlf79o2mg1.png?width=1389&format=png&auto=webp&s=c9b5f1f6510d0aa5ebbf4b06ba9908947a21e93e
I quants (iq3_xxs, iq2_s etc) makes inference 5-10% slower, they're definitely better in terms of efficiency, but there is a tradeoff.
Benjamin’s recent MiniMax‑M2.5 analysis shows a case how perplexity and KLD can still be very misleading. Unsloth Dynamic IQ2_XXS performs better than AesSedai’s IQ3_S on real world evals (LiveCodeBench v6, MMLU Pro) despite being 11GB smaller. Yet, AesSedai’s perplexity and KLD benchmarks suggest the opposite. (PPL: 0.3552 vs 0.2441; KLD: 9.0338 vs 8.2849 - lower is better).
https://preview.redd.it/hwif5hfex2mg1.png?width=1078&format=png&auto=webp&s=d6fef62ede6626f47991a3dbc90183b9d621d0bc
Perplexity and KLD can also be misleading but, as precaution we replaced any MXFP4 layer. Real-world evals (LiveCodeBench v6 etc.) are much better benchmarks, but can take many days. This mismatch shows how lower perplexity or KLD doesn’t necessarily translate to better real-world performance. The graph also shows UD‑Q4-K‑XL outperforming other Q4 quants, while being \~8GB smaller.
This doesn’t mean perplexity or KLD is useless, as they provide a rough signal. So, going forward, we’ll publish perplexity and KLD for every quant so the community has some reference.
Updated GGUFs here: https://huggingface.co/collections/unsloth/qwen35
For more investigation deets and benchmarks you can read: https://unsloth.ai/docs/models/qwen3.5
Thank you for reading and once again for the feedback and incredible support. Huge thanks to the Qwen team as well for releasing Qwen3.5. If there’s any suggestions please let us know and have a great Friday / weekend guys!
Benchmarking Details & Appreciation:
- We utilized bartowski's wonderful imatrix file to make the comparisons more fair - our Dynamic 2.0 method uses a conversational format, but we found benchmarking to be fairer if we used a more general imatrix
- We appreciated some friendly guidance from Ubergram and the community!
- For perplexity we used the below. We also use the BF16 as the base KLD file.
LLAMA_SET_ROWS=1 ./llama.cpp/llama-perplexity --flash-attn on --fit off --batch-size 16384 --ubatch-size 16384 --device {device} --model {model} --ctx-size 512
Score: 546 | Comments: 213 | Subreddit: r/LocalLLaMA
Top Comments
u/Round_Document6821 (116 pts):
Indeed, double checking on downstream task is a must these days since PPL and KLD is not enough.
Nice analysis from Unsloth team. Feel like this is a research itself actually :D
u/Digger412 (93 pts):
Hi Daniel, AesSedai here - thanks for publishing this research! KLD and PPL don't tell the entire story but they are good starting points when deciding what quantization (both uploader and which quant level) to use! I'm happy to see more investigation being done here as it benefits the entire community.
I think it helps that this model is very accessible to test, too - many of the recent releases have been larger MoE's (GLM-5, M2.5, Step-3.5, etc.) and that makes doing this comparison challenging for the average person both in terms of required compute, disk space, and time. This Qwen3.5-35B-A3B is very accessible in comparison.
I had recently tried to PR some of IK's quants into mainline but that was shut down, and I know pwilkin has a PR up now for mainline llama.cpp for a new quant type IQ3_PT. Seeing more research and effort being put into quantization research is awesome.
Thanks again for the post!
u/Far-Low-4705 (66 pts):
going forward, we’ll publish perplexity and KLD for every quant so the community has some reference.
This is absolutely huge, this should have honestly already been standard but this is absolutely a extremely useful addition
U guys rock
u/Educational_Rent1059 (38 pts):
Holy… this is how testing should be done!!! Insane work
u/VoidAlchemy (48 pts):
ubergarm here, thanks for sharing more of your methodologies and results such that others can reproduce and analyze the data too! (the AesSedai KLD logs are missing at the moment tho, probably forgot to upload them into the HF repo?).
As most folks know quantizing is all trade-offs. Thanks for including my mainline compatible Vulkan optimized "Q4_0" custom mix which performs quite well given legacy quantization methods which can be faster for AMD hardware backends.
I'll have to cook more of my usual ik_llama.cpp's SOTA quantization types and see how they fare too as they offer the best quality in the given memory footprint over mainline quants but require CUDA backend.
Cheers and good job cleaning up the bugged quants and taking the opportunity to improve your recipes!
u/segfawlt (53 pts):
I love the smell of fresh sloth in the morning
Thanks so much for this work! There's been a real uptick in this sub of detailed comparisons with quants for this release cycle, it's really nice to see and been really helpful!
u/sine120 (10 pts):
This is excellent. I'm glad to see the chart up on the GGUF HF page. When I was starting out, I had no idea which one to pick so pretty much picked at random. Would love to see more info to assist with that for newcomers when they find the GGUF. Any plans on doing the same analysis on the dense model? I'm very curious to know how the sub-16GB quants actually perform.
u/jslominski (9 pts):
https://preview.redd.it/wgdrj9qd83mg1.png?width=1099&format=png&auto=webp&s=32b29fc6e17546da5418558ddba6b15f1fa885d1
Great job, thanks. I just tested UD-IQ2_XXS.gguf on an RP5 16GB with full RAM load, and I’m getting 2.8 t/s with a 16k context (with llama.cpp built with KleidiAI-optimised kernels). Pretty dope!
The Pi draws 12W tops (no SSD HAT, etc.). It’s roughly half the performance on the 8GB model.
I’m waiting for the NVMe HAT and a better cooler to arrive so I can push a bit more out of it, since I’m having throttling issues with my current setup.
It's amazing what I can do now, thanks again!
u/xXprayerwarrior69Xx (7 pts):
I love you guys so much
u/Longcommentsan (7 pts):
I'm gonna exaggerate a little but to me it seems the best way is to just use Q4. It's basically a model-independent statement. It seems to be the best bang for the buck. There's a very large question to whether you should go below Q4 or just pick a smaller model. Savings below Q4 are not that massive unless the model by itself is above 150b total but the hit to the head becomes quite visible.
u/fallingdowndizzyvr (9 pts):
Sweet. Thanks for that. I'll also be retiring using MXFP4. I bought into the hype and I have been using those when I needed a 4 bit quant.
u/JumboShock (5 pts):
I’m too dumb to understand what I should take away from this really impressive looking analysis…
My previous understanding was that the UD quants had some issues when used on MoE models like the Qwen3.5-35B-A3B, and the regular GGUFs were a better choice. Does the updated UD quants address this? Is there still a performance gap?
u/Chromix_ (6 pts):
You hid some unexpected free lunch in the middle of your posting.
Unsloth Dynamic IQ2_XXS performs better than AesSedai’s IQ3_S on real world evals (LiveCodeBench v6, MMLU Pro) despite being 11GB smaller. Yet, AesSedai’s perplexity and KLD benchmarks suggest the opposite. (PPL: 0.3552 vs 0.2441; KLD: 9.0338 vs 8.2849 - lower is better).
Now maybe looking at the quant type differences between the individual layers might yield some insight into what affects real-world tasks that much, despite better perplexity, and could thus be used to make better quants. That's still counter-intuitive though, a token distribution that's closer to the original model should perform better, unless you won the regularization lottery with that quant.
u/am17an (10 pts):
Btw did you guys pick up the latest changes to fuse gate and up exps? Gives a nice PP boost on CUDA
u/ItankForCAD (8 pts):
Can the mmproj be appreciably quantized ? If so, what is the influence of different quants ?
u/sir_creamy (8 pts):
i know it isn't gguf format, but i'd love to see a comparison to nvfp4 in the charts. AFAIK, Nvidia is saying nvfp4 is awesome mostly because of the accuracy improvement vs other 4bit quants. However, it looks like from your charts your 4b quant is very close to 8b already.
thank you so much for everything you're doing!