Paper introduces delegation-based aggregator for multi-sample LLM inference

The arXiv paper titled "When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference" (submitted 6 Jun 2026) introduces Propagational Proxy Voting (PPV) as an unsupervised consensus rule for multi-sample LLM outputs, per the arXiv abstract. According to the paper, PPV improves accuracy over majority voting on MMLU-Pro by +1.5 percentage points overall and +2.24 percentage points on a non-trivial subset, with a paired McNemar p ~ 1.0e-14 (n = 8,099), as reported on arXiv. Per the paper, PPV partitions 128 sampled generations per question into 16 groups, computes each group's letter-level semantic entropy and reasoning embedding centroid, and runs a stochastic delegation matrix whose stationary distribution selects the consensus answer. The authors report negative-result delegation strategies that constrain the design space for unsupervised aggregation, according to the arXiv abstract.
What happened
The arXiv submission "When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference" (submitted 6 Jun 2026) introduces Propagational Proxy Voting (PPV) as an unsupervised aggregator for multi-sample LLM outputs, per the arXiv abstract. The paper reports that PPV outperforms majority voting on MMLU-Pro by +1.5 percentage points overall and +2.24 percentage points on a non-trivial subset, with paired McNemar p ~ 1.0e-14 and sample size n = 8,099, according to the arXiv report.
Technical details
Per the arXiv abstract, the method partitions 128 sampled generations per question into 16 groups. For each group the authors compute two signals: group-level letter entropy (a within-group confidence proxy) and a reasoning embedding centroid (a between-group geometric signal). The two signals feed into a stochastic delegation matrix with two per-voter levers the authors call WHEN (weight retained on a group's own pick) and WHOM (how remaining weight is split across peers). The matrix's stationary distribution determines the consensus answer, and the authors describe an example where geometric coherence reverses a 10-6 majority.
Editorial analysis
The paper frames majority voting as discarding two inexpensive signals per sample: within-group entropy and between-group embedding geometry. Industry-pattern observations: unsupervised aggregation gains often come from leveraging auxiliary signals that are free at inference time, such as entropy or embedding-based agreement, rather than training additional classifiers. The reported effect size (+1.5 pp) is modest but meaningful for benchmarks like MMLU-Pro, and the extremely small p-value reported suggests the improvement is consistent on the evaluated set.
Context and significance
For practitioners who rely on multi-sample inference and majority voting as a default ensemble technique, the paper documents a concrete, label-free alternative that integrates confidence and geometric coherence signals. Industry-pattern observations: methods that expose per-voter delegation or weighted voting can outperform naive majority when samples cluster heterogeneously, especially on questions with multimodal answer distributions.
What to watch
Follow-up items include replication across other benchmarks, sensitivity to the number of samples and groups, and whether embedding choices materially affect WHOM decisions. The arXiv abstract also reports several negative-result delegation strategies that narrow viable design choices for unsupervised aggregation, which observers will watch for full details in the paper PDF.
Scoring Rationale
A label-free aggregator for multi-sample LLM inference that reports a modest improvement over majority voting on one benchmark (MMLU-Pro). The idea is interesting for practitioners experimenting with ensemble or self-consistency inference, but the gain is small and the single preprint is not yet independently verified, placing it in the solid-but-niche band.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
