Open-source Models Match Mythos in Bug Finding

At Black Hat Asia, OpenAI's first security hire, Ari Herbert-Voss, argued that open-source models can find software bugs as effectively as Anthropic's Mythos when combined into orchestration pipelines. He credited Mythos for strong performance on both shallow and complex vulnerabilities, attributing that capability to what he called "supralinear scaling." Herbert-Voss says defenders can replicate Mythos-grade results by building scaffolding that runs several open models in harness, producing defense in depth and covering each model's blind spots. Cost and access make open-source an attractive option for many organizations, but human experts remain essential to orchestrate models and triage the high volume of findings generated by fuzzing and AI-assisted testing. The net effect, he predicts, is improved security practices rather than large-scale job displacement.
What happened
At Black Hat Asia, Ari Herbert-Voss, OpenAI's first security hire and CEO of RunSybil, argued that ensembles of open-source models can match the bug-finding effectiveness of Anthropic's Mythos. He highlighted Mythos ability to find both "shallow" bugs - well-described flaws that are and easy to validate - and more complex vulnerabilities and attributed part of its edge to the phenomenon he called "supralinear scaling." Herbert-Voss said that practical scaffolding that runs multiple open models in concert provides defense in depth and similar coverage while avoiding Mythos access and cost constraints.
Technical details
Herbert-Voss emphasized that replication of Mythos performance is not a single-model drop-in but a systems problem requiring human orchestration. Practitioners need to combine models, routing logic, and validation layers so results complement each other rather than duplicate noise. Key practical levers include:
- •model diversity and ensembling across architectures and checkpoints
- •automated triage and prioritization to reduce false positives
- •integration with fuzzing pipelines and runtime instrumentation
He also flagged that fuzzing and AI-generated test cases produce high volumes of warnings. Humans remain necessary to validate exploitability and prioritize actionable findings. He mentioned cost as a major differentiator: building and operating a proprietary Mythos-class model is expensive, whereas ensembles of open models are cheaper to iterate with but require staff who can manage orchestration and compute.
Context and significance
This is a practical reframing in the ongoing debate over proprietary frontier models versus open-source alternatives. Mythos represents specialized, restricted-access tooling tailored for security; the broader lesson Herbert-Voss offered is that domain-specific capability often materializes through system design, tooling, and workflows as much as raw model scale. For security teams and ML engineers, that means investment in orchestration, automated validation, and developer workflows may deliver outsized returns compared with single-model procurement.
What to watch
Teams should prototype ensembles and invest in triage automation to measure real-world signal-to-noise. Watch for open-source toolkits that standardize scaffolding, and for commercial vendors packaging orchestration and validation layers around model ensembles. The practical constraint remains cost of compute and the human labor needed to validate and exploit findings; those will determine adoption speed.
Scoring Rationale
This is a notable, practitioner-facing observation: open-source ensembles can deliver competitive bug-finding capability, shifting emphasis to orchestration and triage. Not a paradigm shift, but important for security teams evaluating proprietary tooling versus build approaches.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

