AI Systems Fail Real Remote Work Tasks

Researchers from Scale AI and the Center for AI Safety published the Remote Labor Index in October, testing top AI systems including OpenAI's ChatGPT, Google's Gemini and Anthropic's Claude on hundreds of real freelance tasks. The team found the best-performing AI autonomously completed only 2.5% of projects, often failing at visual design, long-term memory tasks, and producing technical errors. The results suggest current models can assist but are far from replacing human contractors.
Key Points
- 1Finds best AI completes only 2.5% of tested freelance projects across diverse tasks
- 2Shows AI lacks visual understanding and long-term memory, causing many practical task failures
- 3Warns businesses can't fully replace contractors; AI may augment work but needs human oversight
Scoring Rationale
Provides systematic, real-work evaluation with strong methodology; limited by testing snapshot and evolving model improvements.
Sources
Public references used for this report.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems
