Developer Open-Sources Bottega Agent Orchestration Tool
Developer Vincent Daubry open-sourced Bottega, an internal coding-agent orchestration tool, after his team shipped its 1,000th user story with it, according to his blog post. For the past 8 months, Daubry says 100% of the team's production code has been written by agents, with humans owning only the plan and the review. After Anthropic changed Claude Code pricing, he added support for Codex and OpenCode so the pipeline can mix models per step - open-source options like Kimi and DeepSeek included - using, for example, Opus for planning, Sonnet for implementation, Codex for review, and an open model for PR management. Daubry also reports a two-week bug where the tool silently defaulted to Sonnet 3.7 instead of a frontier model, with output quality staying comparable to Opus 4.6 - evidence, he argues, that a rigorous plan-and-review harness matters more than which specific model executes it.
Vincent Daubry's account is a rare, concrete data point on production-scale coding-agent use: not a vendor benchmark, but one team's record of shipping 1,000 user stories with agents writing all of their production code for eight months, plus a specific claim - that harness and process design matter more than model choice - directly actionable for any team building an agent-driven development workflow.
What happened
In a blog post, Vincent Daubry announced he was open-sourcing Bottega, an internal agent-orchestration tool his team built and used in production, after shipping their 1,000th user story with it. He reports that for the past 8 months, 100% of the team's production code has been written by agents, with humans owning the plan and the review. Around the same time, Anthropic announced new Claude Code pricing; Daubry says Bottega's orchestration layer was easy to extend, so the team added support for Codex and OpenCode alongside Claude Code, which in turn lets Bottega run open-source models such as Kimi and DeepSeek. The practical benefit, per the post, is mixing models within a single task - for example, Claude Opus for planning, Claude Sonnet for implementation, Codex for code review, and an open-source model to manage the pull request.
Technical context
Bottega runs as a web UI that is team-first (multi-user) and remote-first, designed to run on a shared server rather than a single developer's laptop, which Daubry says made sandboxing agents with skip-permission settings less risky and let non-developers at the company use it too. The pipeline separates planning, implementation plus unit tests, manual testing against scenarios defined in the plan, and PR management (an agent opens the PR, then iterates on CI failures and merge conflicts, including in response to GitHub review comments). Daubry positions Bottega alongside similar tools that emerged over the same period - Conductor (Melty Labs, YC S24), GasTown (Steve Yegge), gstack (Garry Tan), and GitHub's SpecKit - and says the overlap is confirmation the team converged on a workflow others are independently building toward.
For practitioners
Daubry's central argument is that the plan, not the model, is the centerpiece of the agentic development cycle: his team's earlier failure mode was treating the plan as a disposable, session-bound artifact rather than an enduring one, which produced large, hard-to-review PRs requiring heavy back-and-forth. Bottega's fix was to make the task, requirement, and technical spec persistent artifacts that live alongside the implementation, add a dedicated manual-testing step where the agent runs the scenarios defined in the plan before opening a PR, and add an adversarial review agent that checks the implementation strictly matches the plan. Separately, Daubry reports a two-week production bug in which Claude subprocesses silently defaulted to Sonnet 3.7 instead of a frontier model; output quality during that period stayed comparable to Opus 4.6, which he takes as evidence that a tight harness and rigorous process reduce sensitivity to model choice more than most teams assume.
What to watch
Whether other teams report comparable production-scale agent usage and similar harness-over-model conclusions; whether open-source orchestration layers like Bottega gain adoption alongside Conductor, GasTown, gstack, and SpecKit; and how cost pressure from Anthropic's new Claude Code pricing shapes broader adoption of multi-provider model mixing and open-weight models such as Kimi and DeepSeek in production coding pipelines.
Editorial analysis
This is a single team's self-reported account, not an independently audited benchmark, so the 100%-agent-written and comparable-to-Opus-4.6 claims should be read as one practitioner's experience rather than a general result. That said, Daubry's observation that multiple teams converged on similar plan-first, multi-agent-role workflows around the same time (citing Conductor, GasTown, gstack, and SpecKit) suggests a real pattern in how production coding-agent workflows are being designed, independent of whether Bottega specifically gains adoption.
Key Points
- 1Vincent Daubry open-sourced Bottega, a coding-agent orchestration tool, after his team shipped 1,000 user stories with agents writing 100% of production code for eight months.
- 2After Anthropic changed Claude Code pricing, Bottega added Codex and OpenCode support to mix models across planning, implementation, review and PR steps.
- 3A two-week bug that silently defaulted to Sonnet 3.7 produced output comparable to Opus 4.6, suggesting a rigorous plan-and-review harness reduces sensitivity to model choice.
Scoring Rationale
A concrete, well-documented practitioner account of production-scale coding-agent usage with a specific, actionable claim about process mattering more than model choice - genuinely useful for teams building agentic development workflows. It is a single team's self-reported, single-source account with no independent verification of the production-usage figures, which caps it below the major tier.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems