Researchllmremote workvisual understandingscale ai

AI Systems Fail Real Remote Work Tasks

|January 8, 2026|By LDS Team

8.0

Relevance Score

AI Systems Fail Real Remote Work Tasks — Photo: washingtonpost.com · rights & takedowns

Researchers from Scale AI and the Center for AI Safety published the Remote Labor Index in October, testing top AI systems including OpenAI's ChatGPT, Google's Gemini and Anthropic's Claude on hundreds of real freelance tasks. The team found the best-performing AI autonomously completed only 2.5% of projects, often failing at visual design, long-term memory tasks, and producing technical errors. The results suggest current models can assist but are far from replacing human contractors.

Key Points

1Finds best AI completes only 2.5% of tested freelance projects across diverse tasks
2Shows AI lacks visual understanding and long-term memory, causing many practical task failures
3Warns businesses can't fully replace contractors; AI may augment work but needs human oversight

Scoring Rationale

Provides systematic, real-work evaluation with strong methodology; limited by testing snapshot and evolving model improvements.

Sources

Public references used for this report.

1 source

01washingtonpost.comAnalysis | Can AI do your job? See the results from hundreds of tests.

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Researchllmremote workvisual understandingscale ai

AI Systems Fail Real Remote Work Tasks

|January 8, 2026|By LDS Team

8.0

Relevance Score

Key Points

1Finds best AI completes only 2.5% of tested freelance projects across diverse tasks
2Shows AI lacks visual understanding and long-term memory, causing many practical task failures
3Warns businesses can't fully replace contractors; AI may augment work but needs human oversight

Scoring Rationale

Provides systematic, real-work evaluation with strong methodology; limited by testing snapshot and evolving model improvements.

Sources

Public references used for this report.

1 source

01washingtonpost.comAnalysis | Can AI do your job? See the results from hundreds of tests.

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

AI Systems Fail Real Remote Work Tasks

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Investors Seek Shelter in India Amid AI Storm

PACE Estimates Agent Scores From Proxy Benchmarks

Advanced AI consumes 136.5x more electricity than chatbots

Researchers Benchmark Persistent-State Attacks on Coding Agents

AI Systems Fail Real Remote Work Tasks

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Investors Seek Shelter in India Amid AI Storm

PACE Estimates Agent Scores From Proxy Benchmarks

Advanced AI consumes 136.5x more electricity than chatbots

Researchers Benchmark Persistent-State Attacks on Coding Agents