LLMs Exhibit Fragile Robustness Under Perturbations

This arXiv preprint (v1, Feb 19, 2026) tests 23 contemporary LLMs on MMLU, SQuAD and AMEGA, applying controlled lexical and syntactic meaning-preserving perturbations. The authors find lexical substitutions consistently cause substantial, statistically significant performance drops while syntactic changes have heterogeneous effects, sometimes improving accuracy, and both disrupt model leaderboards; robustness does not reliably scale with model size. They recommend standardizing robustness testing in LLM evaluation.
Scoring Rationale
High methodological breadth and clear findings, limited by preprint status and lack of peer review.
Practice with real Logistics & Shipping data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Logistics & Shipping problems


