Six open-source LLMs. One sliding puzzle. A brutal test of long-horizon reasoning and tool calling.
Five of them broke. One didn't.
I gave each model a move_tile tool and a scrambled 3×3 board, then asked it to solve the puzzle through pure turn-by-turn reasoning. The deeper the scramble, the more brutal the search.
Five runs per depth, best run kept. A model fails the round if it exceeds 6x the optimal move count.
> Depth 5:
Everyone solves it. Yawn.
> Depth 10:
GLM 5.1 melts down. 43 moves. Cut.
> Depth 12:
Gemma4 26B loses the plot, shuffling tiles in circles. Gone.
> Depth 15:
The wall. DeepSeek V4 Flash, out. DeepSeek V4 Pro, out. Gemma4, out again. GLM 5.1, out.
Two survivors: Qwen3.6 35B-A3B, and Kimi K2.6 with an 11-move solve that looked like cheating.
> Depth 18:
Same two. Everyone else hallucinating tiles that weren't there.
> Depth 22:
Final boss. Kimi, flawless for five rounds, finally cracks. 81 moves. Still scrambled. DeepSeek V4 Pro limps home at 90.
Qwen3.6 35B-A3B solves it in 36.
The smallest model in the room. ~3B active params. Fits on a single 3090. It beat everything.
Kimi was elegant. Qwen3.6 was unstoppable.
Video