What happened
Per the Microsoft Security Blog, Microsoft introduced a new multi-model agentic scanning harness codenamed MDASH, which the company reports helped researchers discover 16 vulnerabilities in the Windows networking and authentication stack, including four critical remote code execution (RCE) flaws. The blog post states the harness was built by Microsoft's Autonomous Code Security team and orchestrates an ensemble of frontier and distilled models using more than 100 specialized AI agents. Microsoft reports test results including 21 of 21 planted vulnerabilities found with zero false positives on a private test driver, 96% recall against five years of confirmed Microsoft Security Response Center (MSRC) cases in clfs.sys, 100% recall in tcpip.sys, and an 88.45% score on the public CyberGym benchmark, which the post describes as the top leaderboard entry.
Infosecurity reports that Microsoft's May Patch Tuesday fixed 120 CVEs in total and that 16 of those CVEs were discovered using the MDASH system. CSO Online reports the agentic tool will open to enterprise customers in a private preview in June, per its coverage.
Technical details
Per Microsoft's post, MDASH is presented as a model-agnostic, agentic system that combines many specialized agents to: generate candidate vulnerabilities, debate and triage findings, and produce end-to-end exploit proofs. The Microsoft writeup quantifies the harness performance on internal and public benchmarks (see figures above) and includes an architecture diagram showing agent orchestration across multiple model families. Microsoft also reports running the system against both synthetic test drivers and historical MSRC cases to measure recall and false-positive rates.
Industry context
Editorial analysis: Agentic ensembles combining multiple specialist agents and model families are increasingly used in security research to cover diverse tactic-technique-procedure (TTP) patterns and to reduce single-model blind spots. Observers who track AI-assisted offensive and defensive tooling have noted similar multi-agent approaches in red-team research and synthetic telemetry generation, where ensemble debate and structured verification improve signal-to-noise compared with single-model output.
Context and significance
Editorial analysis: The combination of reported high recall and low false-positive rates matters to defenders because vulnerability discovery workloads are sensitive to both missed findings and analyst time spent triaging false positives. If external verification reproduces Microsoft's claimed benchmark gains, defenders and tool vendors may accelerate experimentation with agentic orchestration patterns for fuzzing, static analysis augmentation, and exploit proof generation. From a research perspective, the use of historical MSRC cases as a recall baseline provides a stronger empirical signal than purely synthetic evaluations, though independent replication will be required to validate generalization to other codebases and classes of bugs.
What to watch
Editorial analysis: Observers should watch for:
- •independent benchmark replications or third-party evaluations of MDASH or similar agentic systems
- •the scope of the private preview and whether enterprise testers report similar recall/false-positive characteristics in their environments
- •how vendor tooling and open-source projects adopt agent orchestration patterns for vulnerability discovery, fuzzing, and exploit generation. Also monitor Patch Tuesday disclosures and proof-of-concept activity for the specific CVEs Microsoft and other researchers disclosed, since rapid public exploit development can change enterprise remediation priorities
Practical implications for practitioners
For practitioners: Security teams evaluating AI-assisted vulnerability research tools should treat vendor benchmark claims as a starting point and plan for instrumented pilots that measure both recall and triage burden in their own code and telemetry. For ML engineers exploring agentic systems, this announcement reinforces that multi-agent orchestration and verification pipelines are a practical research direction rather than a purely academic exercise, with measurable operational outcomes reported by an industry-scale team.
Key Points
- 1Microsoft reports its multi-model agentic harness MDASH found 16 Windows CVEs, including four critical RCEs, during May Patch Tuesday.
- 2Published benchmark claims (21/21 planted, 96% recall, 88.45% CyberGym) imply agentic ensembles can raise recall while keeping false positives low.
- 3Industry testing and independent replication will determine whether agent orchestration generalizes beyond Microsoft's internal workloads.
Scoring Rationale
The story shows an industry-leading vendor reporting production use of an agentic multi-model system to find real, high-severity vulnerabilities. That is notable for defenders and ML practitioners building security tooling, but independent validation and broader adoption are still required before this becomes paradigm-shifting.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

