MLA-UCB Improves Multi-Armed Bandits with Generated Surrogates

The paper introduces `MLA-UCB`, an algorithm that integrates pre-trained ML models to convert offline side information into surrogate rewards for multi-armed bandits. The approach addresses bias in offline predictions and provably reduces cumulative regret when predicted and true rewards are jointly Gaussian, while requiring no prior knowledge of the covariance between surrogate and true rewards. The authors extend the method to batched feedback and non-Gaussian rewards, derive computable confidence bounds, and show empirical gains in simulations and real-world tasks including language model selection and video recommendation. Gains occur with moderate offline surrogate sample sizes and modest correlation between surrogate and true rewards.
What happened
The authors introduce `MLA-UCB`, a machine learning-assisted variant of the classic Multi-Armed Bandits framework that uses pre-trained ML models to produce surrogate rewards from offline side information. The paper proves regret improvements under joint Gaussian assumptions and provides extensions for batched feedback and non-Gaussian observations, while requiring no prior knowledge of the covariance between surrogate and true rewards.
Technical details
`MLA-UCB` wraps any reward prediction model and augments the standard UCB exploration rule with confidence bounds that account for surrogate-target uncertainty. Key algorithmic properties include:
- •Applicability to any offline reward predictor and arbitrary auxiliary data sources
- •Provable cumulative regret improvement when predicted and true rewards are jointly Gaussian, even if surrogate means misalign with true means
- •No requirement for prior knowledge of the covariance matrix between surrogate and true rewards
- •A batched-feedback extension that handles multiple observations per pull and non-Gaussian rewards with computable confidence bounds
The paper derives bounds in the Gaussian case and computable confidence bounds for the batched, non-Gaussian setting. Empirical evaluation uses both synthetic Gaussian surrogates and ML-generated surrogates, plus real-world tests in language model selection and video recommendation showing consistent regret reductions with moderate offline sample sizes and modest surrogate-target correlations.
Context and significance
This work formalizes a common practical workflow: using offline ML models to warm-start online bandit problems. By giving a theoretically-grounded way to exploit biased ML predictions while controlling exploration, the paper connects counterfactual/offline learning ideas with classic sequential decision theory. The no-covariance requirement and batched extensions make the approach broadly applicable to production settings where full statistical characterization of predictors is unavailable. For practitioners, the result means safer, provable gains from using historical models to accelerate exploration in recommender systems, A/B testing, and model selection.
What to watch
Validate MLA-UCB on larger-scale, nonstationary deployments and quantify sensitivity to surrogate bias and unmodeled heteroskedasticity. Practical adoption will hinge on easy-to-compute confidence terms and robustness to heavy-tailed reward noise.
Scoring Rationale
This is a notable theoretical advance that bridges offline ML prediction and online bandit algorithms, offering provable regret improvements and practical batched extensions. The recent arXiv timing reduces novelty marginally.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

