Google Ranks GPT 5.4 and Gemini Highest

What happened
Google updated its Android Bench on April 9, 2026, and added two OpenAI entries: GPT-5.4 and GPT-5.3-Codex. The new ranking places GPT-5.4 and Google’s Gemini 3.1 Pro Preview in a tie for top spot at 72.4%. GPT-5.3-Codex appears next at 67.7%, followed by Claude Opus 4.6 at 66.6% and earlier GPT-5.2 Codex at 62.5%. The remainder of the published list is unchanged; Google used results from a late-February run for models it previously evaluated, while OpenAI’s models were tested in mid-March.
Technical context
Android Bench is a Google-maintained benchmark that measures model effectiveness on Android-specific coding tasks. The methodology emphasizes integration with Android idioms and libraries: Jetpack Compose for UI generation, Coroutines and Flows for asynchronous code, Room for local persistence, and Hilt for dependency injection. Scores reflect how well a model handles those practical, framework-specific tasks rather than synthetic NLP metrics.
Key details
The April refresh is the first update that includes GPT-5.4 and GPT-5.3-Codex. Exact published scores: GPT-5.4 72.4%; Gemini 3.1 Pro Preview 72.4%; GPT-5.3-Codex 67.7%; Claude Opus 4.6 66.6%; GPT-5.2 Codex 62.5%. Google frames Android Bench as a developer productivity tool intended to help teams deliver higher-quality apps across the Android ecosystem.
Why practitioners should care
If you build Android apps or integrate code-generation models into mobile CI/CD, Android Bench gives a focused signal about real-world developer workflows and framework compatibility. A high leaderboard score indicates stronger handling of Android idioms and likely fewer post-generation edits. However, the benchmark is not an absolute truth: model choice should still consider latency, cost, licensing, and how a model fits your specific codebase and testing pipeline.
What to watch
Watch for more frequent refreshes as vendors release model updates, and for expanded benchmarks that measure end-to-end developer experience (e.g., test generation, refactors, or multi-file projects). Also track latency/cost data and any method changes on developer.android.com that could shift rankings.
Scoring Rationale
This update matters to practitioners who select models for Android development because it provides framework-specific performance signals. It’s not a field-changing research breakthrough, but it affects tool choice and integration decisions for mobile engineering teams.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.


