3D World Generators Improve VLA Fine-Tuning

Andrew Choi et al. (arXiv, 19 Mar 2026) show that fine-tuning vision-language-action (VLA) models with reinforcement learning using 3D world generative models and a language-driven scene designer greatly improves performance. Their approach raises simulation success from 9.7% to 79.8% and achieves a 1.25× speedup in completion, while sim-to-real transfer improves real-world success from 21.7% to 75% with a 1.13× speedup. Ablations show increased scene diversity improves zero-shot generalization.
Scoring Rationale
Strong empirical gains and scalable simulation technique, limited by preprint status and evaluation on a single research group.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
