Researchllmdfaformal reasoning

Large Language Models Struggle With DFA Construction

|January 21, 2026|By LDS Team

6.1

Relevance Score

Large Language Models Struggle With DFA Construction

Researchers introduce a benchmark testing large language models' ability to construct deterministic finite automata (DFAs) from regular-language descriptions, submitted Jan. 19, 2026. Models achieve perfect accuracy on factual items and 84–90% on seen constructions, but accuracy drops 30–64% on unseen, handcrafted and Arden's-theorem-generated problems; failures include constraint misinterpretation, Kleene-star errors, and global inconsistency, while a hint protocol only partially corrects shallow mistakes.

Key Points

1Demonstrate perfect factual accuracy, 84–90% on seen tasks, 30–64% drop on unseen
2Reveal systematic failures in constraint interpretation, Kleene-star handling, and global consistency
3Indicate prompting and hint protocols correct shallow errors but not structural reasoning flaws

Scoring Rationale

Moderate empirical novelty and broad relevance, limited by single preprint source and constrained problem scope.

Sources

Public references used for this report.

1 source

01arxiv.org[2601.13392] Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Researchllmdfaformal reasoning

Large Language Models Struggle With DFA Construction

|January 21, 2026|By LDS Team

6.1

Relevance Score

Key Points

1Demonstrate perfect factual accuracy, 84–90% on seen tasks, 30–64% drop on unseen
2Reveal systematic failures in constraint interpretation, Kleene-star handling, and global consistency
3Indicate prompting and hint protocols correct shallow errors but not structural reasoning flaws

Scoring Rationale

Moderate empirical novelty and broad relevance, limited by single preprint source and constrained problem scope.

Sources

Public references used for this report.

1 source

01arxiv.org[2601.13392] Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Large Language Models Struggle With DFA Construction

Key Points

Scoring Rationale

Sources

More AI & Data Science News

South Korea Chipmakers Weigh U.S. Pressure and Home Plans

InTheWeights Rates People on LLM Familiarity

AI Model Maps Snore Source in Upper Airway

Micron begins $9.3-billion chip plant expansion in Japan

Large Language Models Struggle With DFA Construction

Key Points

Scoring Rationale

Sources

More AI & Data Science News

South Korea Chipmakers Weigh U.S. Pressure and Home Plans

InTheWeights Rates People on LLM Familiarity

AI Model Maps Snore Source in Upper Airway

Micron begins $9.3-billion chip plant expansion in Japan