← W4M · Graph of Models — Blog

99.82% of 13,000 hard Sudoku puzzles.
97,000 parameters.

Frontier language models guess, backtrack, and get stuck. Purpose-built reasoning systems from venture-backed labs reach 96.2%. GoM solves 99.82% of a 13,000-puzzle hard benchmark, averaging 120ms per puzzle — with a model small enough to fit in this page, trained in 26 minutes on a MacBook.

Watch it solve

17
0

17 clues is the proven minimum for a Sudoku to have a unique solution — the hardest class of puzzle that can exist. This replay shows a real recorded run of the actual model: amber digits are attempts it reconsidered; green are final.

The numbers

GoM (ours) Kona 1.0 Frontier LLMs
Hard Sudoku solved (13,000-puzzle benchmark) 99.82% 96.2% solve rates collapse on hard grids
Average solve time 120ms / puzzle "seconds" minutes, when they succeed
Model size 97K parameters not disclosed hundreds of billions
Training hardware one MacBook not disclosed GPU datacenters
Training time 26 minutes not disclosed weeks to months
Code execution / external solver none none often required

Benchmark: 13,000 unique-solution hard-rated puzzles (50th–90th percentile solver-difficulty band of the public Sudoku-Extreme test set, fixed-seed sample, none seen in training) — 12,976 solved. Solve time: 120ms median wall-clock per puzzle (mean 160ms) measured end-to-end on an Apple M5 laptop with our compiled search core. Kona 1.0 figure as published by Logical Intelligence. The minimum-clue (17-clue) frontier is harder still — that tier is where our research continues.

Why this matters

Sudoku at the 17-clue limit cannot be pattern-matched. Every cell depends on the whole grid, and one wrong commitment cascades into contradiction. Language models fail here for a structural reason: they reason token by token, locally. GoM evaluates the entire puzzle as a single structured object and refines the whole solution at once.

The result is not just accuracy — it is efficiency. A reasoning system five orders of magnitude smaller than a frontier LLM, trained in the time it takes to drink a coffee, on hardware you already own. Small enough for a phone. Small enough for a watch. We know, because we run it on both.

GoM is a general reasoning architecture — Sudoku is one of its proving grounds. More results soon. w4m.ai