Google Ranks GPT 5.4 and Gemini Highest

Google refreshed its Android Bench on April 9, 2026, adding OpenAI's GPT-5.4 and GPT-5.3-Codex. GPT-5.4 ties Gemini 3.1 Pro Preview for first place at 72.4%; GPT-5.3-Codex follows at 67.7%. Android Bench measures model performance specifically for Android app development tasks, UI with Jetpack Compose, asynchronous code with Coroutines and Flows, Room persistence, and Hilt DI, and aims to help developers ship higher-quality apps. The update notes the new OpenAI models were tested in mid-March, while other results remain from a late-February run. Benchmarks are context-dependent; workflow, integration, and value trade-offs still determine which model is best for an individual project.
What happened
Google updated its Android Bench on April 9, 2026, and added two OpenAI entries: GPT-5.4 and GPT-5.3-Codex. The new ranking places GPT-5.4 and Google's Gemini 3.1 Pro Preview in a tie for top spot at 72.4%. GPT-5.3-Codex appears next at 67.7%, followed by Claude Opus 4.6 at 66.6% and earlier GPT-5.2 Codex at 62.5%. The remainder of the published list is unchanged; Google used results from a late-February run for models it previously evaluated, while OpenAI's models were tested in mid-March.
Technical context
Android Bench is a Google-maintained benchmark that measures model effectiveness on Android-specific coding tasks. The methodology emphasizes integration with Android idioms and libraries: Jetpack Compose for UI generation, Coroutines and Flows for asynchronous code, Room for local persistence, and Hilt for dependency injection. Scores reflect how well a model handles those practical, framework-specific tasks rather than synthetic NLP metrics.
Key details
The April refresh is the first update that includes GPT-5.4 and GPT-5.3-Codex. Exact published scores: GPT-5.4 72.4%; Gemini 3.1 Pro Preview 72.4%; GPT-5.3-Codex 67.7%; Claude Opus 4.6 66.6%; GPT-5.2 Codex 62.5%. Google frames Android Bench as a developer productivity tool intended to help teams deliver higher-quality apps across the Android ecosystem.
Why practitioners should care
If you build Android apps or integrate code-generation models into mobile CI/CD, Android Bench gives a focused signal about real-world developer workflows and framework compatibility. A high leaderboard score indicates stronger handling of Android idioms and likely fewer post-generation edits. However, the benchmark is not an absolute truth: model choice should still consider latency, cost, licensing, and how a model fits your specific codebase and testing pipeline.
What to watch
Watch for more frequent refreshes as vendors release model updates, and for expanded benchmarks that measure end-to-end developer experience (e.g., test generation, refactors, or multi-file projects). Also track latency/cost data and any method changes on developer.android.com that could shift rankings.
Scoring Rationale
This update matters to practitioners who select models for Android development because it provides framework-specific performance signals. It's not a field-changing research breakthrough, but it affects tool choice and integration decisions for mobile engineering teams.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
