Models & Researchmulti armed banditssurrogate rewardsstatistical theorybandit algorithms

MLA-UCB Improves Multi-Armed Bandits with Generated Surrogates

|April 23, 2026|By LDS Team

7.3

Relevance Score

MLA-UCB Improves Multi-Armed Bandits with Generated Surrogates

The paper introduces `MLA-UCB`, an algorithm that integrates pre-trained ML models to convert offline side information into surrogate rewards for multi-armed bandits. The approach addresses bias in offline predictions and provably reduces cumulative regret when predicted and true rewards are jointly Gaussian, while requiring no prior knowledge of the covariance between surrogate and true rewards. The authors extend the method to batched feedback and non-Gaussian rewards, derive computable confidence bounds, and show empirical gains in simulations and real-world tasks including language model selection and video recommendation. Gains occur with moderate offline surrogate sample sizes and modest correlation between surrogate and true rewards.

What happened

The authors introduce `MLA-UCB`, a machine learning-assisted variant of the classic Multi-Armed Bandits framework that uses pre-trained ML models to produce surrogate rewards from offline side information. The paper proves regret improvements under joint Gaussian assumptions and provides extensions for batched feedback and non-Gaussian observations, while requiring no prior knowledge of the covariance between surrogate and true rewards.

Technical details

`MLA-UCB` wraps any reward prediction model and augments the standard UCB exploration rule with confidence bounds that account for surrogate-target uncertainty. Key algorithmic properties include:

•Applicability to any offline reward predictor and arbitrary auxiliary data sources
•Provable cumulative regret improvement when predicted and true rewards are jointly Gaussian, even if surrogate means misalign with true means
•No requirement for prior knowledge of the covariance matrix between surrogate and true rewards
•A batched-feedback extension that handles multiple observations per pull and non-Gaussian rewards with computable confidence bounds

The paper derives bounds in the Gaussian case and computable confidence bounds for the batched, non-Gaussian setting. Empirical evaluation uses both synthetic Gaussian surrogates and ML-generated surrogates, plus real-world tests in language model selection and video recommendation showing consistent regret reductions with moderate offline sample sizes and modest surrogate-target correlations.

Context and significance

This work formalizes a common practical workflow: using offline ML models to warm-start online bandit problems. By giving a theoretically-grounded way to exploit biased ML predictions while controlling exploration, the paper connects counterfactual/offline learning ideas with classic sequential decision theory. The no-covariance requirement and batched extensions make the approach broadly applicable to production settings where full statistical characterization of predictors is unavailable. For practitioners, the result means safer, provable gains from using historical models to accelerate exploration in recommender systems, A/B testing, and model selection.

What to watch

Validate MLA-UCB on larger-scale, nonstationary deployments and quantify sensitivity to surrogate bias and unmodeled heteroskedasticity. Practical adoption will hinge on easy-to-compute confidence terms and robustness to heavy-tailed reward noise.

Key Points

1Using ML-generated surrogate rewards can reduce cumulative regret by informing exploration, even when surrogates are biased, if handled correctly.
2MLA-UCB requires no prior covariance knowledge, making it practical for production where predictor-target relationships are unknown.
3Batched and non-Gaussian extensions enable real-world deployments like language model selection and video recommendation with modest offline data.

Scoring Rationale

This is a notable theoretical advance that bridges offline ML prediction and online bandit algorithms, offering provable regret improvements and practical batched extensions. The recent arXiv timing reduces novelty marginally.

Sources

Public references used for this report.

1 source

01arxiv.org[2506.16658] Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems