Researchllmrobustnessbenchmarks

LLMs Exhibit Fragile Robustness Under Perturbations

|February 20, 2026|By LDS Team

9.1

Relevance Score

LLMs Exhibit Fragile Robustness Under Perturbations

This arXiv preprint (v1, Feb 19, 2026) tests 23 contemporary LLMs on MMLU, SQuAD and AMEGA, applying controlled lexical and syntactic meaning-preserving perturbations. The authors find lexical substitutions consistently cause substantial, statistically significant performance drops while syntactic changes have heterogeneous effects, sometimes improving accuracy, and both disrupt model leaderboards; robustness does not reliably scale with model size. They recommend standardizing robustness testing in LLM evaluation.

Key Points

1Demonstrate lexical perturbations cause substantial, statistically significant performance degradation across 23 LLMs and tasks
2Reveal syntactic perturbations yield heterogeneous effects, sometimes improving accuracy, indicating non-uniform model behavior
3Recommend adding robustness testing to evaluations as leaderboards and model rankings become unstable

Scoring Rationale

High methodological breadth and clear findings, limited by preprint status and lack of peer review.

MoreAI Evals news

Sources

Public references used for this report.

1 source

01arxiv.org[2602.17316] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Researchllmrobustnessbenchmarks

LLMs Exhibit Fragile Robustness Under Perturbations

|February 20, 2026|By LDS Team

9.1

Relevance Score

Key Points

1Demonstrate lexical perturbations cause substantial, statistically significant performance degradation across 23 LLMs and tasks
2Reveal syntactic perturbations yield heterogeneous effects, sometimes improving accuracy, indicating non-uniform model behavior
3Recommend adding robustness testing to evaluations as leaderboards and model rankings become unstable

Scoring Rationale

High methodological breadth and clear findings, limited by preprint status and lack of peer review.

MoreAI Evals news

Sources

Public references used for this report.

1 source

01arxiv.org[2602.17316] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

LLMs Exhibit Fragile Robustness Under Perturbations

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Saudi Arabia Signs Energy, AI MoUs with Canada

OpenAI Expands Bio Bug Bounty For GPT-5.6

Meta begins production of Iris AI chip in September

Meta Debates Privacy LED For Always-On AI Glasses

LLMs Exhibit Fragile Robustness Under Perturbations

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Saudi Arabia Signs Energy, AI MoUs with Canada

OpenAI Expands Bio Bug Bounty For GPT-5.6

Meta begins production of Iris AI chip in September

Meta Debates Privacy LED For Always-On AI Glasses