Series · The Edge-Native Future

Research Notes

Results from our Graph of Models work — a small reasoning core that scales on a clean power law, learns continuously, and runs on-device. What it does, not how.

W4M Research · A fourteen-part series · July 2026

Overview The series so far

Three weeks, measured

Every published result on one page: a dated timeline of the cadence, and a map of the whole GoM stack — the core models we train, the brains that attach to frozen LLMs, and the innovations around them. Every number links to its note.

See the map Note 1 Scaling

A 962M run that follows a clean power law and lands on its own final loss. One recipe holds across a 32× ladder from 30M to 962M parameters — zero divergences.

Read the note Note 2 Learning

It learns while you use it

Adaptation in ~6 trials, a measured gain from a single round of self-play (83 → 86%), and +53 points on grade-school math (GSM8K) added to a frozen 8B model — without retraining it.

Read the note Note 3 Thinking

Reasoning that holds at depth

At 11 disks and beyond, every frontier model we tested scores zero. Committee search went 100-for-100 at 11 and 12 disks, 20-for-20 at 13 and 14 — and the subgoal architecture solved the full 30-disk tower — 1,073,741,823 moves, every one exact. Same answer every run.

Read the note Note 4 Resources

Reasoning at phone scale

Serious reasoning in a 2.54 GB package, under two watts, on a phone — no datacenter, no data egress, no per-token bill.

Read the note Note 5 The future

The swarm that grows with you

Our thesis: millions of small models that learn with you, coordinate with the cloud, and share what they learn — so the whole network grows together.

Read the note Note 6 On-device

The assistant that stays on your phone

A private voice assistant that does its thinking on the device, tested head-to-head against the new Siri. A tie on accuracy (n = 6), every query kept on the phone, and zero confidently-wrong answers in the test set.

Read the note Note 7 New · The life cycle

It sleeps, it remembers, it grows up

The continual-learning core given a full life: it taught itself from 87% to a perfect score with zero human data, mastered a never-seen tool in one day of corrections, keeps memories on a spaced schedule (81% vs 32% at day ten) — and lifted a completely stock LLM by 64 points through a 3 MB thought-token adapter, with the pair scoring higher than either alone.

Read the note Note 8 New · The swarm

The swarm, measured

Note 5's thesis gets its first data: three brains lived three different lives, synced overnight, and every member woke up perfect on the full exam — including material only its teammates had seen (45.6 → 100%, replicated ×3). Experience transplants between units as a file: 0 → 100% with no retraining.

Read the note Note 9 New · Follow-up to Note 3

Thirty disks. 1,073,741,823 moves. Zero errors.

The wall from Note 3, revisited: the decomposition architecture produced the full 30-disk plan — 1,073,741,823 moves, exactly the textbook optimum, every move verified — in ~9 minutes on a laptop CPU, a 4,200,000× depth extrapolation over the neural rung's training. Decomposition turns impossible problems into compositions of proven ones.

Read the note Note 10 New · Exploratory research

What we're borrowing from consciousness science

Zero claims of machine consciousness — but the science is an engineering sourcebook, and our architecture already implements measurable analogs of several mechanisms. Two imports running as experiments, plus the exploration we care most about: multi-channel rewards and an empathic, do-no-harm valuation channel — by architecture, not by pledge.

Read the note Note 11 Precision

Sudoku’s hardest tier: solved. 4,865 for 4,865.

July 6: the only published learned-solver score on the complete 17-clue tier (96.86%). July 9: the same 400 KB model reaches 100.00% — all 4,865 puzzles, independently audited — after testing a competitor’s own published mechanism on our architecture. Fast mode: 98.7% in one 0.2s deterministic pass.

Read the note Note 12 Planning

Planning: past GPT-4, at one-millionth the size

On PlanBench — the benchmark that exposed how badly language models plan — our ~1M-parameter brain produces 37–44% fully valid plans on 500 real instances vs GPT-4's published ~34%. Every plan machine-verified, replicated across two independent trainings, randomized names by default.

Read the note Note 13 New · Frozen-model composition

Three sizes of Nemotron, one upgrade path

NVIDIA's Nemotron family measured before and after GoM, with zero weights touched: the 550B flagship +69% on ARC-AGI-2 at the identical reasoning budget (13 → 22 / 120, self-run), the 120B Super +23 points on GSM8K (53.5 → 76.5%, n=200, interim), and the 3B-active Nano +12.5 points from a brain trained in under two hours on one GPU.

Read the note Note 14 New · Smaller and better

We made the brain leaner. It scored higher.

Everyone assumes a brain gets better by adding parts. We asked the opposite — was ours carrying dead weight? A leaner variant (fewer moving parts, ~16% fewer parameters) was expected to regress a little. It scored the highest number in the series: on the same frozen Nemotron Nano, same 376M-class budget, 25.0 → 82.5% on GSM8K (+57.5, n=200) — cleanest outputs we've measured and a clean do-no-harm sweep. One week, one model: +12.5 → +44.5 → +57.5.

Read the note Demos Interactive · recorded runs

Seven interactive demos replayed from real logged runs: watch it learn in-session, reason step by step, recover from silent rule switches, produce byte-identical answers, and solve live.

Open the demo deck

Qualified partners: request access to the full evidence →