Twitter/X

Test setup: @stevibe evaluated six open-source LLMs on a scrambled 3×3 sliding…

Brief

Author @stevibe ran a brutal long-horizon/tool-calling benchmark: six open-source LLMs solved a scrambled 3×3 sliding puzzle via a move_tile tool, five trials per depth, failing if >6× optimal moves. As depth increased, GLM 5.1, Gemma4 26B and DeepSeek variants collapsed; Qwen3.6 35B-A3B (≈3B active params, fits on one 3090) reliably beat the final boss in 36 moves while Kimi K2.6 ultimately cracked at 81.

Why it matters

Test setup: @stevibe evaluated six open-source LLMs on a scrambled 3×3 sliding puzzle using a move_tile tool, running five trials per scramble depth (best run kept); a model was marked failed if it exceeded 6× the optimal move count.

Key details

  • Progressive failures by depth: Depth 10 — GLM 5.1 melted down with 43 moves and was cut; Depth 12 — Gemma4 26B lost the plot; Depth 15 — DeepSeek V4 Flash, DeepSeek V4 Pro, Gemma4, and GLM 5.1 failed; Depth 18 — only Qwen3.6 35B-A3B and Kimi K2.6 still solved; Depth 22 (final) — Qwen3.6 solved in 36 moves, Kimi cracked at 81 moves, DeepSeek V4 Pro finished at 90.
  • Surprising winner and details: Qwen3.6 35B-A3B performed best despite having an active footprint of ~3B parameters that fits on a single RTX 3090; Kimi K2.6 produced an earlier 11-move solve described as looking like cheating, and the author calls Qwen 'unstoppable' and Kimi 'elegant.'
Source evidence

Six open-source LLMs. One sliding puzzle. A brutal test of long-horizon reasoning and tool calling.

Five of them broke. One didn't.

I gave each model a move_tile tool and a scrambled 3×3 board, then asked it to solve the puzzle through pure turn-by-turn reasoning. The deeper the scramble, the more brutal the search.

Five runs per depth, best run kept. A model fails the round if it exceeds 6x the optimal move count.

> Depth 5:
Everyone solves it. Yawn.

> Depth 10:
GLM 5.1 melts down. 43 moves. Cut.

> Depth 12:
Gemma4 26B loses the plot, shuffling tiles in circles. Gone.

> Depth 15:
The wall. DeepSeek V4 Flash, out. DeepSeek V4 Pro, out. Gemma4, out again. GLM 5.1, out.

Two survivors: Qwen3.6 35B-A3B, and Kimi K2.6 with an 11-move solve that looked like cheating.

> Depth 18:
Same two. Everyone else hallucinating tiles that weren't there.

> Depth 22:
Final boss. Kimi, flawless for five rounds, finally cracks. 81 moves. Still scrambled. DeepSeek V4 Pro limps home at 90.

Qwen3.6 35B-A3B solves it in 36.

The smallest model in the room. ~3B active params. Fits on a single 3090. It beat everything.

Kimi was elegant. Qwen3.6 was unstoppable.

Video