← W4M · Graph of Models — Blog
99.82% of 13,000 hard Sudoku puzzles.
97,000 parameters.
Frontier language models guess, backtrack, and get stuck. Purpose-built reasoning systems from venture-backed labs reach 96.2%. GoM solves 99.82% of a 13,000-puzzle hard benchmark, averaging 120ms per puzzle — with a model small enough to fit in this page, trained in 26 minutes on a MacBook.
Watch it solve
17 clues is the proven minimum for a Sudoku to have a unique solution — the hardest class of puzzle that can exist. This replay shows a real recorded run of the actual model: amber digits are attempts it reconsidered; green are final.
The numbers
| GoM (ours) | Kona 1.0 | Frontier LLMs | |
|---|---|---|---|
| Hard Sudoku solved (13,000-puzzle benchmark) | 99.82% | 96.2% | solve rates collapse on hard grids |
| Average solve time | 120ms / puzzle | "seconds" | minutes, when they succeed |
| Model size | 97K parameters | not disclosed | hundreds of billions |
| Training hardware | one MacBook | not disclosed | GPU datacenters |
| Training time | 26 minutes | not disclosed | weeks to months |
| Code execution / external solver | none | none | often required |
Benchmark: 13,000 unique-solution hard-rated puzzles (50th–90th percentile solver-difficulty band of the public Sudoku-Extreme test set, fixed-seed sample, none seen in training) — 12,976 solved. Solve time: 120ms median wall-clock per puzzle (mean 160ms) measured end-to-end on an Apple M5 laptop with our compiled search core. Kona 1.0 figure as published by Logical Intelligence. The minimum-clue (17-clue) frontier is harder still — that tier is where our research continues.
Why this matters
Sudoku at the 17-clue limit cannot be pattern-matched. Every cell depends on the whole grid, and one wrong commitment cascades into contradiction. Language models fail here for a structural reason: they reason token by token, locally. GoM evaluates the entire puzzle as a single structured object and refines the whole solution at once.
The result is not just accuracy — it is efficiency. A reasoning system five orders of magnitude smaller than a frontier LLM, trained in the time it takes to drink a coffee, on hardware you already own. Small enough for a phone. Small enough for a watch. We know, because we run it on both.
GoM is a general reasoning architecture — Sudoku is one of its proving grounds. More results soon. w4m.ai