Goedel-Architect, a Princeton framework running on an open-weight DeepSeek model, scored 75.6% on the Putnam math benchmark for 294 dollars in compute. A comparable pipeline powered by Google's Gemini hit a lower 70% while spending roughly 170,000 dollars, up to 500 times more. The difference was not the model. It was how the agent was built.

Formal theorem proving is the most unforgiving task in machine learning. There is no partial credit, no plausible-sounding answer that slips past a tired grader. A proof written in Lean 4, the programming language mathematicians use to make their reasoning machine-checkable, either compiles or it does not. Every step is verified by the computer, line by line. An AI that is merely confident gets nothing.

That is what makes the number at the center of a new Princeton paper so striking. On June 4, researchers at Princeton Language and Intelligence published Goedel-Architect, an agentic framework for proving math theorems in Lean 4. On PutnamBench, a benchmark drawn from the famously brutal William Lowell Putnam Mathematical Competition, it solved 75.6% of the 672 problems. The total compute bill came to $294.

A competing open-source system called Hilbert, powered by Google's Gemini, scored lower on the same benchmark while spending close to 170,000 dollars, per reporting on the paper. The Princeton team's own description of the gap, stated in the paper's abstract, is that Goedel-Architect reaches state-of-the-art results for an open-source pipeline "at a price point up to 500x less than comparable open-source pipelines."

The lesson that has circulated through AI research circles since is not really about mathematics. It is about agents.

The Blueprint Replaces the Spiral

To understand why Goedel-Architect is cheap, you have to understand how most AI theorem provers waste money.

The dominant approach is recursive decomposition. The system takes a hard theorem, breaks it into sub-problems, breaks those into smaller sub-problems, and keeps going, with each layer spawning its own search and consuming its own compute budget. The trouble is that the system commits to this tree before it knows whether the branches lead anywhere. It can spiral deep into a dead-end strategy, burning thousands of dollars in model calls, only to discover near the bottom that the whole approach was wrong.

Goedel-Architect does something different. Before it tries to prove anything, it drafts a blueprint: a dependency graph of every definition and lemma needed to reach the final theorem, with the relationships between them mapped out in advance. A blueprint is essentially a plan that a compiler can sanity-check. Only after the plan exists does the system dispatch the open pieces, the unproven lemma nodes, to a tool-equipped Lean prover that works on them in parallel. When a lemma fails, that failure feeds back and refines the global blueprint rather than triggering a fresh recursive descent.

The effect is that the system catches bad strategies early, against the Lean compiler, instead of late, against its bank account. It also parallelizes the parts that can run independently, which a strict top-down recursion cannot. The architecture, not the underlying intelligence, is what drives the efficiency.

The Engine Is Open and Already Cheap

The model under the hood reinforces the point. Goedel-Architect runs on DeepSeek-V4-Flash, the open-weight model from the Chinese lab DeepSeek, configured as a 284-billion-parameter Mixture-of-Experts that activates only 13 billion parameters per call. It is the same DeepSeek V4 family that already matched frontier models on several benchmarks at a fraction of the API price of closed competitors.

So the Princeton result stacks two efficiencies on top of each other: a cheap open model, orchestrated by an agent design that refuses to waste calls. The headline figures look like this.

System	Backbone	PutnamBench (pass@1)	Reported compute cost
Goedel-Architect	DeepSeek-V4-Flash (open)	75.6%	$294
Hilbert	Google Gemini	70.0%	~$170,000

And the raw capability is not a fluke of one benchmark. On MiniF2F-test, a standard set of competition-level problems, the system hit 99.2% pass@1. When the team seeded the blueprint with an optional natural-language proof on the hardest problems, the results climbed further: a perfect score on MiniF2F-test, 88.8% on PutnamBench (597 of 672 problems), 4 of 6 problems on the 2025 International Mathematical Olympiad, 11 of 12 on the 2025 Putnam, and 3 of 6 on the 2026 USA Mathematical Olympiad.

These are problems that stump most strong undergraduates and many graduate students. The system is solving them in a formal language where the computer rejects anything short of a complete, valid proof.

Why Practitioners Should Care About a Math Paper

Almost nobody reading this builds theorem provers. The reason Goedel-Architect matters anyway is that it is a clean, measurable demonstration of a principle every AI engineer is now wrestling with: how you orchestrate an agent can matter more than which model you put inside it.

The industry spent 2025 and early 2026 assuming capability scaled with model size and spend. Bigger model, bigger budget, better result. Goedel-Architect is a counterexample with a hard number attached. A smaller open model wrapped in a smarter control loop beat a system built on a larger commercial model, at a tiny fraction of the cost, on a task where you cannot fake the answer.

That gap, hundreds of dollars versus six figures for a comparable or better outcome, is the kind of difference that decides whether an autonomous agent is economically viable in production or a research toy. As agents take on longer, multi-step jobs across coding, step-by-step reasoning, data analysis, and engineering, the design choices around planning, tool use, and verification become the dominant cost lever. Goedel-Architect's blueprint-first approach is one concrete answer: plan globally, verify continuously against a ground-truth checker, and parallelize what you can.

It also lands at a moment when AI's relationship with rigorous mathematics is being scrutinized hard. When OpenAI claimed a model had made progress on a decades-old Erdos conjecture, skeptics tore the claim apart before it survived review. Formal proving in Lean sidesteps that entire fight. There is nothing to second-guess. The Lean compiler is the referee, and it does not grade on a curve.

The Other Side of the Ledger

The result deserves the asterisks its critics are attaching to it.

The "up to 500x" figure is a ceiling, not a typical case, and it compares against one particular rival pipeline rather than the entire field. Cost comparisons across theorem provers are notoriously sensitive to configuration, hardware, and how many attempts each system is allowed, so the precise multiple should be read as a striking illustration, not a fixed law. The pass@1 numbers are genuinely strong, but the very highest scores, like the 88.8% on PutnamBench, depend on seeding the system with a natural-language proof on the hardest problems, which is a meaningful human assist that the headline 75.6% does not require.

There is a dependency worth flagging too. The framework's efficiency rides on DeepSeek-V4-Flash, an open model from a Chinese lab that has drawn intense scrutiny in Washington over how its underlying models were trained. Teams in regulated or government-adjacent settings may not be free to adopt the exact stack that produced these numbers, even if the orchestration idea transfers to any capable model.

And formal theorem proving, for all its rigor, is a narrow domain. A blueprint of lemmas with a compiler to check them is a near-ideal setting for the plan-and-verify approach, because the verifier is perfect and instant. Most real-world agent tasks have no such oracle. Whether the architecture's efficiency holds up when the feedback signal is noisy, slow, or subjective is the open question the paper does not answer.

The Bottom Line

Strip away the mathematics and Goedel-Architect makes one argument: in a year when the entire AI industry is committing trillions of dollars to ever-larger models and ever-bigger compute clusters, a Princeton lab proved hard theorems for the price of a nice dinner by being smarter about how it spent each model call. The model was open and cheap. The cleverness was in the wiring.

That should unsettle the assumption that capability is something you buy by the gigawatt. Sometimes it is something you architect. The systems that win the agent era may not be the ones running on the most expensive models. They may be the ones that, like Goedel-Architect, refuse to take a single wasted step.

A perfect verifier made that discipline measurable here. The harder, more valuable problem is teaching agents to be that disciplined when nobody is checking their work line by line.