AISI Evaluates GPT-5.5 Cybersecurity Performance Against Advanced Tasks
The UK AI Security Institute (AISI) published an evaluation of OpenAI's GPT-5.5 that tests models on a suite of 95 narrow capture-the-flag cyber tasks, per AISI. The report says models have saturated the basic tasks and that an early checkpoint of GPT-5.5 completes the AISI corporate network attack simulation end-to-end, which AISI writes "we estimate would take a human around 20 hours." On the Expert-level advanced tasks AISI reports GPT-5.5 achieved an average pass rate of 71.4% (±8.0%), compared with 68.6% (±8.7%) for Mythos Preview, 52.4% (±9.8%) for GPT-5.4, and 48.6% (±10.0%) for Opus 4.7, according to the institute. The AISI framing is that similar performance across different frontier models suggests a broader trend in cyber capability rather than a unique breakthrough for one model, per the report.
What happened
The AI Security Institute (AISI) published an evaluation of OpenAI's GPT-5.5 that exercises models on a suite of 95 narrow cyber tasks in capture-the-flag format, per AISI. The institute reports models have fully saturated its basic tasks since at least February 2026, and that an early checkpoint of GPT-5.5 completed AISI's corporate network attack simulation end-to-end; AISI writes "we estimate would take a human around 20 hours." On AISI's Expert-level advanced tasks, the report states GPT-5.5 achieved an average pass rate of 71.4% (±8.0%), compared with 68.6% (±8.7%) for Mythos Preview, 52.4% (±9.8%) for GPT-5.4, and 48.6% (±10.0%) for Opus 4.7, per the institute.
Technical details
Editorial analysis: AISI's task suite separates difficulty tiers into basic, Practitioner, and Expert levels and emphasizes multi-step, realistic vulnerability-research and exploitation scenarios. The Expert tasks measure skills including reverse engineering of stripped binaries, developing exploits for modern mitigations, cryptanalytic key recovery, and weaponising synthetic vulnerabilities, according to the report. For practitioners, these task specifications make the evaluation a focused probe of exploit generation, not a broad test of general coding or dialog capability.
Context and significance
Industry context
Public reporting from AISI frames the parity between GPT-5.5 and another frontier model as evidence of a broader upward trend in model cyber capability rather than an isolated model-specific jump. For security teams and ML practitioners, the reported Expert-level pass rates indicate frontier LLMs can perform complex, multi-step offensive cyber tasks at nontrivial success rates on synthetic but realistic challenges, per AISI.
What to watch
Industry context
Observers should track how evaluation methodologies evolve, whether vendors or third parties publish red-team results calibrated to real-world defensive controls, and any changes in responsible disclosure or usage policies referenced by developers. AISI's results highlight measurable capability growth that teams monitoring model misuse will likely watch closely.
Scoring Rationale
AISI's reported Expert-level pass rates for `GPT-5.5` show meaningful capability in offensive cyber tasks, which is notable for security teams and model developers. The story is technically important but not a paradigm shift.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

