Researchswe benchcode generationllmbenchmarks

SWE-bench Updates Bash-Only Coding Leaderboard With New Model Rankings

|February 19, 2026|By LDS Team

8.1

Relevance Score

SWE-bench Updates Bash-Only Coding Leaderboard With New Model Rankings — Photo: static.simonwillison.net · rights & takedowns

On 19 February 2026, SWE-bench published a fresh full run of its February 2025 'Bash Only' coding benchmark, evaluating models on 2,294 real-world problems drawn from 12 open-source repositories. Claude Opus 4.5 ranked first, followed by Gemini 3 Flash and MiniMax M2.5; OpenAI's GPT-5.2 placed sixth while GPT-5.3-Codex was absent, and the run used a uniform system prompt for fair comparison.

Key Points

1Ranks models: Claude Opus 4.5 first, Gemini 3 Flash second, MiniMax M2.5 third.
2Provides independent, non-self-reported benchmarking using identical system prompt across all evaluated models.
3Impacts model selection: highlights competitive Chinese models and notes GPT-5.3-Codex absence for API users.

Scoring Rationale

Independent, uniform benchmarking increases comparability and practical utility for model selection, but the Bash-only workload limits broader coding generalization.

MoreAI Evals news

Sources

Public references used for this report.

1 source

01simonwillison.netSWE-bench February 2025 leaderboard update

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems