Models & Researchaggregationllm inferenceunsupervised ensemblemmlu pro

Paper introduces delegation-based aggregator for multi-sample LLM inference

|June 9, 2026|By LDS Team

5.6

Relevance Score

Paper introduces delegation-based aggregator for multi-sample LLM inference

The arXiv paper titled "When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference" (submitted 6 Jun 2026) introduces Propagational Proxy Voting (PPV) as an unsupervised consensus rule for multi-sample LLM outputs, per the arXiv abstract. According to the paper, PPV improves accuracy over majority voting on MMLU-Pro by +1.5 percentage points overall and +2.24 percentage points on a non-trivial subset, with a paired McNemar p ~ 1.0e-14 (n = 8,099), as reported on arXiv. Per the paper, PPV partitions 128 sampled generations per question into 16 groups, computes each group's letter-level semantic entropy and reasoning embedding centroid, and runs a stochastic delegation matrix whose stationary distribution selects the consensus answer. The authors report negative-result delegation strategies that constrain the design space for unsupervised aggregation, according to the arXiv abstract.

What happened

The arXiv submission "When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference" (submitted 6 Jun 2026) introduces Propagational Proxy Voting (PPV) as an unsupervised aggregator for multi-sample LLM outputs, per the arXiv abstract. The paper reports that PPV outperforms majority voting on MMLU-Pro by +1.5 percentage points overall and +2.24 percentage points on a non-trivial subset, with paired McNemar p ~ 1.0e-14 and sample size n = 8,099, according to the arXiv report.

Technical details

Per the arXiv abstract, the method partitions 128 sampled generations per question into 16 groups. For each group the authors compute two signals: group-level letter entropy (a within-group confidence proxy) and a reasoning embedding centroid (a between-group geometric signal). The two signals feed into a stochastic delegation matrix with two per-voter levers the authors call WHEN (weight retained on a group's own pick) and WHOM (how remaining weight is split across peers). The matrix's stationary distribution determines the consensus answer, and the authors describe an example where geometric coherence reverses a 10-6 majority.

Editorial analysis

The paper frames majority voting as discarding two inexpensive signals per sample: within-group entropy and between-group embedding geometry. Industry-pattern observations: unsupervised aggregation gains often come from leveraging auxiliary signals that are free at inference time, such as entropy or embedding-based agreement, rather than training additional classifiers. The reported effect size (+1.5 pp) is modest but meaningful for benchmarks like MMLU-Pro, and the extremely small p-value reported suggests the improvement is consistent on the evaluated set.

Context and significance

For practitioners who rely on multi-sample inference and majority voting as a default ensemble technique, the paper documents a concrete, label-free alternative that integrates confidence and geometric coherence signals. Industry-pattern observations: methods that expose per-voter delegation or weighted voting can outperform naive majority when samples cluster heterogeneously, especially on questions with multimodal answer distributions.

What to watch

Follow-up items include replication across other benchmarks, sensitivity to the number of samples and groups, and whether embedding choices materially affect WHOM decisions. The arXiv abstract also reports several negative-result delegation strategies that narrow viable design choices for unsupervised aggregation, which observers will watch for full details in the paper PDF.

Key Points

1PPV uses group-level letter entropy and per-question embedding centroids to reweight sampled votes without labels, improving consensus accuracy.
2On MMLU-Pro, the authors report +1.5 pp overall gain and +2.24 pp on a challenging subset, with strong statistical significance per arXiv.
3Industry-pattern observation: leveraging free inference-time signals (entropy, embedding geometry) often outperforms naive majority in heterogeneous sample clusters.

Scoring Rationale

A label-free aggregator for multi-sample LLM inference that reports a modest improvement over majority voting on one benchmark (MMLU-Pro). The idea is interesting for practitioners experimenting with ensemble or self-consistency inference, but the gain is small and the single preprint is not yet independently verified, placing it in the solid-but-niche band.

MoreAI Research news

Sources

Primary source and supporting public references used for this report.

1 source

Primary sourcearxiv.org[2606.08098] When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems