Paper Measures LLM Confidence in Code Completion

An arXiv paper (arXiv:2508.16131) by Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, and Panos Louridas evaluates LLM confidence in code completion by measuring code perplexity across programming languages, models, and datasets using a sample of 2254 files from 881 GitHub projects. The paper reports that strongly-typed languages exhibit lower perplexity while scripting languages and shell show higher perplexity; Java appears low in perplexity and code comments often increase perplexity, according to the paper. Editorial analysis: These language- and model-level perplexity patterns can serve as a lightweight proxy for expected completion reliability when teams evaluate code-completion tools.
What happened
Per the arXiv paper (arXiv:2508.16131) by Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, and Panos Louridas, the authors measure model confidence for code completion by computing code perplexity across programming languages, models, and corpora. The evaluation uses a sample of 2254 files drawn from 881 GitHub projects, as reported in the paper. The paper documents that strongly-typed languages exhibit lower perplexity, scripting languages show higher perplexity, and Shell appears universally high in perplexity while Java appears low, per the authors. The paper also reports that inclusion of code comments often increases perplexity but does not materially change the language-level ranking.
Technical details
Editorial analysis - technical context: The study uses intrinsic uncertainty metrics such as perplexity, entropy, and mutual information as proxies for model confidence and potential hallucination risk. These intrinsic metrics are presented in the paper as simpler, model-agnostic alternatives to downstream task metrics, which the authors describe as sometimes unreliable or difficult to compute across domains. The paper compares results across multiple LLMs fine-tuned on code and across evaluation corpora; under a fixed model, the authors report moderate stability in relative language-level perplexity rankings.
Context and significance
Industry context
The findings connect language characteristics to model uncertainty. Lower intrinsic uncertainty in strongly-typed languages, as reported, aligns with the intuition that stronger syntactic and type constraints reduce token-level entropy. Higher uncertainty for scripting and shell code, as documented in the paper, signals areas where completions may be less reliable and where downstream validation is more important. For researchers, the paper provides empirical evidence that intrinsic metrics can complement functional benchmarks when comparing code LLMs.
What to watch
For practitioners: Follow subsequent work that correlates the paper's intrinsic metrics with concrete downstream correctness measures such as test-passing rates, static analysis violations, or vulnerability introduction. Observers should also compare the paper's language rankings on project-specific corpora, since the authors note some corpus-dependent variation in absolute perplexity values. Finally, tooling vendors and evaluator teams will likely weigh these intrinsic signals when designing benchmarks for code-completion quality assessment.
Scoring Rationale
The paper offers empirical, cross-language measurements of intrinsic uncertainty that are directly relevant to practitioners evaluating code-completion tools. It is a notable research contribution but not a paradigm-shifting model release.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

