Models & Researchllmsagentic codingopen sourcereinforcement learning

DeepReinforce Releases Ornith-1.0 Agentic Coding Models

|June 29, 2026|By LDS Team

7.2

Relevance Score

DeepReinforce Releases Ornith-1.0 Agentic Coding Models — Photo: static.simonwillison.net · rights & takedowns

DeepReinforce released Ornith-1.0, an MIT-licensed open-weights family of coding models (9B, 31B, 35B MoE, and 397B MoE) built on top of Gemma 4 and Qwen 3.5, introducing a reinforcement-learning method where the model learns to generate its own task-specific scaffold alongside the solution rather than relying on a hand-built harness. DeepReinforce reports its flagship Ornith-1.0-397B scores 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, edging out Claude Opus 4.7's reported 70.3 and 80.8 on the same benchmarks (though well behind Claude Opus 4.8's 85 and 87.6). Independent developer Simon Willison ran the 35B variant locally via LM Studio, confirming it handled real coding tasks competently at roughly 103 tokens per second. DeepReinforce also describes a three-layer defense against reward-hacking during training: a fixed execution boundary, a deterministic monitor, and a frozen LLM judge as veto.

For practitioners building code agents, the interesting part of Ornith-1.0 isn't just the benchmark scores, it's the training method: the model learns to write its own task-specific harness during RL rather than relying on a hand-engineered one, which could reduce the manual scaffolding work that currently limits how many agentic-coding tasks a team can automate. An independent developer already ran the model and confirmed it works as advertised on real tasks, which is unusually strong corroboration for a fresh open-weights release.

What happened

DeepReinforce released Ornith-1.0, an MIT-licensed family of open-weights models for agentic coding, in four sizes: 9B Dense, 31B Dense, 35B MoE, and 397B MoE, built on top of pretrained Gemma 4 and Qwen 3.5 (both Apache 2.0 licensed). DeepReinforce reports Ornith-1.0-397B scores 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, which it says surpasses Claude Opus 4.7 (70.3 and 80.8 on the same benchmarks) and open-weight peers Minimax M3 and DeepSeek-V4-Pro, though Claude Opus 4.8 remains well ahead at 85 and 87.6. The edge-deployable Ornith-1.0-9B scores 43.1 and 69.4 on the same two benchmarks, which DeepReinforce says matches or exceeds larger models like Gemma 4-31B.

Technical context

The core method is a self-scaffolding RL loop: at each training step the model first proposes a task-specific scaffold, then generates a solution rollout conditioned on that scaffold, and reward is propagated to both stages jointly, so the model learns to author its own orchestration logic rather than following a fixed, human-designed harness. DeepReinforce says this lets task-specific search strategies emerge automatically instead of requiring hand-engineered harnesses per task category. To limit reward-hacking, the company describes three layers of defense: an immutable outer trust boundary around the environment and test isolation, a deterministic monitor that zeroes out reward for any attempt to read withheld files or modify verification scripts, and a frozen LLM judge that can veto trajectories showing intent-level gaming even within the allowed tool surface.

For practitioners

Independent developer Simon Willison ran the Ornith-1.0-35B GGUF quantization locally via LM Studio, connected to his Pi agent harness, and reported it handled multi-step codebase search tasks competently at about 103 tokens per second, along with his standard "draw a pelican riding a bicycle" test. That kind of hands-on, independent confirmation is notably stronger corroboration than most fresh open-weights releases get on day one. Willison also noted DeepReinforce has little public track record; the earliest paper he could trace from the team is a June 2025 CUDA optimization paper, which is worth weighing against the benchmark claims until more independent evaluations appear.

What to watch

Track independent reproductions of the Terminal-Bench 2.1 and SWE-Bench Verified scores under third-party evaluation setups (DeepReinforce's own numbers use a 5-run average with specific harness configurations), community testing of the reward-hacking mitigations, and whether other teams adopt the self-scaffolding training technique for agentic tasks beyond coding.

Key Points

1DeepReinforce released Ornith-1.0, an MIT-licensed open-weights coding model family spanning 9B to 397B parameters, built on Gemma 4 and Qwen 3.5.
2The model learns to generate its own RL training scaffold jointly with each solution, reducing the need for hand-engineered agentic-coding harnesses.
3Independent developer Simon Willison ran the model locally and confirmed it performs competently on real coding tasks, unusually strong day-one corroboration.

Scoring Rationale

A genuinely novel RL technique (self-generated scaffolds) released as MIT-licensed open weights across a wide parameter range, with benchmark claims that hold up against Claude Opus 4.7 on two major coding benchmarks and, unusually, independent hands-on verification from a respected developer (Simon Willison) rather than only vendor-reported numbers.

MoreLLMs news

Sources

Primary source and supporting public references used for this report.

4 sources

Primary sourcesimonwillison.netOrnith-1.0: Self-Scaffolding LLMs for Agentic Coding

View 3 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems