Analysisllmsemantic preservationtrail of bitssoftware correctness

LLMs Undermine Compiler Semantic Guarantees and Reliability

|December 19, 2025|By LDS Team

7.0

Relevance Score

LLMs Undermine Compiler Semantic Guarantees and Reliability — Photo: blog.trailofbits.com · rights & takedowns

At the AI Engineer Code Summit in New York, the author argues that large language models (LLMs) differ fundamentally from compilers because they lack determinism and semantic-preservation guarantees. He shows that models like Claude, Gemini, and ChatGPT can change program semantics (for example, C-to-Python integer overflow behavior) and recounts a Vendetect case where an LLM "fix" removed a crash but broke logic and tests.

Key Points

1Demonstrates LLM nondeterminism: identical prompts and updates produce different, inconsistent code outputs
2Explains compilers preserve semantics deterministically, while LLMs lack semantic guarantees, increasing correctness risks
3Warns practitioners: automated LLM fixes can break logic, tests, and security-sensitive properties

Scoring Rationale

Highlights pervasive LLM nondeterminism and real-world failures, but relies on single-author examples without broader empirical validation.

Sources

Public references used for this report.

1 source

01blog.trailofbits.comCan chatbots craft correct code?

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Analysisllmsemantic preservationtrail of bitssoftware correctness

LLMs Undermine Compiler Semantic Guarantees and Reliability

|December 19, 2025|By LDS Team

7.0

Relevance Score

Key Points

1Demonstrates LLM nondeterminism: identical prompts and updates produce different, inconsistent code outputs
2Explains compilers preserve semantics deterministically, while LLMs lack semantic guarantees, increasing correctness risks
3Warns practitioners: automated LLM fixes can break logic, tests, and security-sensitive properties

Scoring Rationale

Highlights pervasive LLM nondeterminism and real-world failures, but relies on single-author examples without broader empirical validation.

Sources

Public references used for this report.

1 source

01blog.trailofbits.comCan chatbots craft correct code?

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

LLMs Undermine Compiler Semantic Guarantees and Reliability

Key Points

Scoring Rationale

Sources

More AI & Data Science News

AI Systems Screen Out Newcomer Job Applicants

Fanfiction Communities Target AI-generated Fanworks and Detection Methods

Trademark Attorney Warns About AI-Generated Brand Name Risks

Diagnosing LLM Failures Before Switching Techniques

LLMs Undermine Compiler Semantic Guarantees and Reliability

Key Points

Scoring Rationale

Sources

More AI & Data Science News

AI Systems Screen Out Newcomer Job Applicants

Fanfiction Communities Target AI-generated Fanworks and Detection Methods

Trademark Attorney Warns About AI-Generated Brand Name Risks

Diagnosing LLM Failures Before Switching Techniques