LLM Agents Exhibit Semantic Fragility Across Variations

A March 13, 2026 arXiv preprint by J. De Curtò presents a metamorphic testing framework assessing robustness of LLM reasoning agents under eight semantic-preserving transformations across seven foundation models from four architectural families. The authors evaluate 19 multi-step reasoning problems in eight scientific domains and find that model scale does not predict robustness: Qwen3-30B achieves 79.6% invariant responses with semantic similarity 0.91, while larger models exhibit greater fragility.
Scoring Rationale
High novelty and broad applicability but based on a single preprint evaluation without peer review.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems

