Models & Researchcode completionmodel confidenceperplexitysoftware engineering

Paper Measures LLM Confidence in Code Completion

|April 30, 2026

6.8

Relevance Score

An arXiv paper (arXiv:2508.16131) by Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, and Panos Louridas evaluates LLM confidence in code completion by measuring code perplexity across programming languages, models, and datasets using a sample of 2254 files from 881 GitHub projects. The paper reports that strongly-typed languages exhibit lower perplexity while scripting languages and shell show higher perplexity; Java appears low in perplexity and code comments often increase perplexity, according to the paper. Editorial analysis: These language- and model-level perplexity patterns can serve as a lightweight proxy for expected completion reliability when teams evaluate code-completion tools.

What happened

Per the arXiv paper (arXiv:2508.16131) by Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, and Panos Louridas, the authors measure model confidence for code completion by computing code perplexity across programming languages, models, and corpora. The evaluation uses a sample of 2254 files drawn from 881 GitHub projects, as reported in the paper. The paper documents that strongly-typed languages exhibit lower perplexity, scripting languages show higher perplexity, and Shell appears universally high in perplexity while Java appears low, per the authors. The paper also reports that inclusion of code comments often increases perplexity but does not materially change the language-level ranking.

Technical details

Editorial analysis - technical context: The study uses intrinsic uncertainty metrics such as perplexity, entropy, and mutual information as proxies for model confidence and potential hallucination risk. These intrinsic metrics are presented in the paper as simpler, model-agnostic alternatives to downstream task metrics, which the authors describe as sometimes unreliable or difficult to compute across domains. The paper compares results across multiple LLMs fine-tuned on code and across evaluation corpora; under a fixed model, the authors report moderate stability in relative language-level perplexity rankings.

Context and significance

The findings connect language characteristics to model uncertainty. Lower intrinsic uncertainty in strongly-typed languages, as reported, aligns with the intuition that stronger syntactic and type constraints reduce token-level entropy. Higher uncertainty for scripting and shell code, as documented in the paper, signals areas where completions may be less reliable and where downstream validation is more important. For researchers, the paper provides empirical evidence that intrinsic metrics can complement functional benchmarks when comparing code LLMs.

What to watch

For practitioners: Follow subsequent work that correlates the paper's intrinsic metrics with concrete downstream correctness measures such as test-passing rates, static analysis violations, or vulnerability introduction. Observers should also compare the paper's language rankings on project-specific corpora, since the authors note some corpus-dependent variation in absolute perplexity values. Finally, tooling vendors and evaluator teams will likely weigh these intrinsic signals when designing benchmarks for code-completion quality assessment.

Scoring Rationale

The paper offers empirical, cross-language measurements of intrinsic uncertainty that are directly relevant to practitioners evaluating code-completion tools. It is a notable research contribution but not a paradigm-shifting model release.

MoreLLMs news