Applause Finds AI Adoption Outpaces Quality

Applause's fourth annual State of Digital Quality in Testing AI report finds rapid enterprise and consumer adoption of AI, but a rise in quality problems. The survey of more than 1,000 developers and QA professionals and over 4,000 consumers shows 55% of organizations have released AI-powered features, yet more than half of AI initiatives fail to reach full production. Reported issues include rising hallucinations, misunderstood prompts, and unreliable outputs even as productivity gains are reported. Testing remains heavily human-driven, with 61% relying on human evaluation and 33% using LLM-as-judge methods. The gap between velocity and validation creates retention, revenue, and reputational risks for product teams.
What happened
Applause released its fourth annual State of Digital Quality in Testing AI report showing accelerated deployment of AI-powered applications alongside worsening quality signals. The survey covers more than 1,000 developers and QA professionals and over 4,000 consumers, and finds 55% of organizations have shipped AI features. At the same time, reported quality problems including rising hallucinations, misunderstood prompts, and unreliable outputs are increasing after a prior decline. More than half of AI initiatives still fail to reach full production, driven by integration, cost, and quality barriers.
Technical details
The report documents the testing approaches practitioners are using and their limits. The dominant methods are:
- •Human evaluation, used by 61% of organizations, for context-aware validation and edge-case detection
- •LLM-as-judge, used by 33%, where multiple models assess outputs in parallel to find blind spots
- •Automated testing and monitoring pipelines, which are growing but still struggle with non-deterministic model behavior
The dataset shows productivity wins-40% of respondents say AI boosts productivity by more than 75%-but those gains come with brittle behavior that current QA processes fail to catch at scale. The report highlights that teams frequently push to production without mature validation frameworks, increasing downstream risk.
Context and significance
This is a practical signal that AI adoption has moved from experimentation into product velocity, but quality engineering has not kept pace. The tension is familiar: model capability and release cadence have outstripped test design, metrics, and tooling for nondeterministic outputs. The rise of LLM-as-judge indicates a shift toward model-based evaluation, but the report shows hybrid strategies that combine human expertise with model-assisted checks produce the most trustworthy outcomes. For ML engineering and QA, that means investment priorities should include robust evaluation datasets, adversarial testing, real-user monitoring, and human-in-the-loop review where domain nuance matters.
Why it matters for practitioners
Quality failures manifest as user churn, revenue loss, and compliance exposure. The findings imply teams need to treat AI validation as a first-class engineering domain-designing test harnesses for hallucinations, prompt robustness, and context preservation-rather than bolting QA onto a completed model. Vendors and platform teams should expect demand for observability, synthetic-data generation for edge cases, differential testing across model versions, and policy-enforcement tooling.
What to watch
Product and QA teams will likely accelerate adoption of hybrid evaluation stacks that combine human reviewers, LLM-as-judge workflows, and automated monitoring tied to SLIs for safety and fidelity. Watch for vendor feature releases that simplify adversarial testing, lineage tracking, and production feedback loops. Regulators and enterprise procurement teams may start requiring stronger validation evidence as quality problems translate into material business risk.
"AI development isn't slowing down, and quality is falling behind," said Chris Sheehan, EVP of High Tech and AI at Applause, underscoring the need for both speed and rigorous human-in-the-loop validation to maintain trust.
Scoring Rationale
The report is a notable industry signal that adoption has outpaced validation, creating operational and risk management needs for ML teams. It is practically important to practitioners but not a frontier technical breakthrough.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.



