Anthropic Uses Contractors to Improve Claude Code
Business Insider reports that an Anthropic project, run through a data vendor identified as Snorkel AI and nicknamed "Marlin," is collecting developer feedback to fine-tune Claude Code. Two contractors told Business Insider they were paid up to $280 per task to create prompts and review code, and that tasks typically took about an hour, with some requiring additional review by Snorkel's approval layer. Business Insider reports the contractors A/B tested outputs from two models and chose preferred code, and that the contractors did not know which model versions they were evaluating. Editorial analysis: Companies building coding models often rely on high-skill contractors for nuanced labels; practitioners should view this as an example of scaled, paid human feedback rather than an automated benchmark.
What happened
Business Insider reports that an Anthropic project called Marlin, run via the data vendor Snorkel AI, is gathering human software-engineer feedback to improve Claude Code. Business Insider reports that freelancers with software engineering backgrounds were directed to A/B test code outputs from two different models and select which output they preferred, using project guidelines reviewed by Business Insider. Business Insider reports two contractors said they were paid up to $280 per task, that tasks took about an hour on average, and that some submissions required additional back-and-forth with Snorkel's approval layer. Business Insider reports the project is ongoing and that the contractors did not know which model versions they were evaluating.
Technical details (reported)
Business Insider reports the work focused on creating prompts and reviewing code, with reviewers comparing paired outputs to assess detail and maintainability. Business Insider reviewed project guidelines that instructed contractors to prefer outputs meeting the prompt's expected level of detail.
Editorial analysis - technical context
Label-generation for coding models frequently uses A/B preference collection and targeted prompt-writing to shape style and maintainability. Companies and vendors commonly hire experienced developers for these tasks because code evaluation requires domain knowledge beyond generic labelers. For practitioners, this pattern implies that high-quality coding training signals often depend on curated human comparisons and prompt engineering expertise rather than purely automated metrics.
Context and significance
Industry reporting places this story in a broader trend where data-labeling platforms and vendor-managed contractor pools play a critical, paid role in improving commercial coding assistants. Tracking contractor compensation and review workflows is relevant to reproducibility and auditability of model behavior.
What to watch
Editorial analysis: Observers should watch for vendor disclosures about reviewer instructions, sample sizes, and repeatability of A/B tests, and public reporting on whether similar programs disclose model versions or evaluation datasets.
What's next
Bottom line
Why it matters
Scoring Rationale
The story reveals operational details of how a commercial coding model is improved using paid developer feedback and vendor-managed A/B testing. That matters to practitioners building or auditing code-generation systems, but it is not a frontier-model release or new architecture.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

