Researchllmrobustnessmetamorphic testingbenchmarks

LLM Agents Exhibit Semantic Fragility Across Variations

|March 16, 2026|By LDS Team

9.0

Relevance Score

LLM Agents Exhibit Semantic Fragility Across Variations

A March 13, 2026 arXiv preprint by J. De Curtò presents a metamorphic testing framework assessing robustness of LLM reasoning agents under eight semantic-preserving transformations across seven foundation models from four architectural families. The authors evaluate 19 multi-step reasoning problems in eight scientific domains and find that model scale does not predict robustness: Qwen3-30B achieves 79.6% invariant responses with semantic similarity 0.91, while larger models exhibit greater fragility.

Key Points

1Apply metamorphic testing across eight semantic-preserving transformations and seven foundation models.
2Reveal that model scale does not predict robustness; smaller Qwen3-30B achieves 79.6% invariance.
3Advise practitioners to include semantic-preserving transformations in evaluation pipelines to detect brittle reasoning.

Scoring Rationale

High novelty and broad applicability but based on a single preprint evaluation without peer review.

MoreAI Evals news

Sources

Public references used for this report.

1 source

01arxiv.org[2603.13173] Semantic Invariance in Agentic AI

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Researchllmrobustnessmetamorphic testingbenchmarks

LLM Agents Exhibit Semantic Fragility Across Variations

|March 16, 2026|By LDS Team

9.0

Relevance Score

Key Points

1Apply metamorphic testing across eight semantic-preserving transformations and seven foundation models.
2Reveal that model scale does not predict robustness; smaller Qwen3-30B achieves 79.6% invariance.
3Advise practitioners to include semantic-preserving transformations in evaluation pipelines to detect brittle reasoning.

Scoring Rationale

High novelty and broad applicability but based on a single preprint evaluation without peer review.

MoreAI Evals news

Sources

Public references used for this report.

1 source

01arxiv.org[2603.13173] Semantic Invariance in Agentic AI

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

LLM Agents Exhibit Semantic Fragility Across Variations

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Google Expands Gemini Ad Agents In India

MLCommons Adds Agentic Inference Benchmark To MLPerf

PLoS Computational Biology Reviews Two Decades of Systems Biology

Markey Unveils AI Accountability Agenda For Federal Oversight

LLM Agents Exhibit Semantic Fragility Across Variations

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Google Expands Gemini Ad Agents In India

MLCommons Adds Agentic Inference Benchmark To MLPerf

PLoS Computational Biology Reviews Two Decades of Systems Biology

Markey Unveils AI Accountability Agenda For Federal Oversight