ShengShu Unveils Motubrain World Action Model

Per a PR Newswire release replicated by The Manila Times and Yahoo Finance, ShengShu Technology announced Motubrain, a unified "world action" model intended as a single robotic "brain." The release quotes Founder Jun Zhu: "A true world model must be able to build a unified representation of the real world and predict how it evolves." Per the PR release, Motubrain scored 63.77 on the WorldArena EWM metric and an average 96.0 across 50 tasks on RoboTwin 2.0, including exceeding 95.0 in randomized environments. Reporting by TipRanks states ShengShu raised a $293 million Series B led by Alibaba Cloud and that Motubrain is in active deployment with multiple robotics partners.
What happened
Per a PR Newswire release reproduced by The Manila Times and Yahoo Finance, ShengShu Technology announced Motubrain, described as a unified world action model that combines perception, prediction, generation, and action in a single architecture. The PR release includes a direct quote from Founder Jun Zhu: "A true world model must be able to build a unified representation of the real world and predict how it evolves."
Per the PR release, Motubrain achieved a 63.77 EWM Score on WorldArena and an average 96.0 across 50 predetermined tasks on RoboTwin 2.0, with the release stating it is the only model to exceed 95.0 in randomized environments. Reporting by TipRanks states ShengShu has secured a $293 million Series B round led by Alibaba Cloud and that Motubrain is already in active deployment with multiple robotics partners.
Technical details
Per the company release and TipRanks coverage, Motubrain builds on ShengShu's prior video model Vidu and uses a multimodal training setup that jointly learns video and action. TipRanks and the PR describe a three-stream Mixture-of-Transformers that connects video, language, and action and say the training cycle covers vision-language-action control, world modeling, video generation, inverse dynamics modeling, and joint video-action prediction. The release frames generative video as the foundation for simulating robots at scale and reducing reliance on physical data collection.
Industry context
Editorial analysis: Companies and research groups working on embodied AI increasingly combine large-scale video pretraining with simulation to bootstrap policies and world models. Observed patterns in similar efforts show that simulation-driven pretraining can speed iteration and broaden environment coverage, but independent replication of benchmark claims is often required before practitioners rebase production stacks. In addition, integrating perception and control in a single multimodal architecture raises engineering tradeoffs around latency, safety gating, and domain transfer that teams commonly encounter when moving from simulation to real robots.
What to watch
For practitioners: track independent benchmark reproductions and third-party evaluations of Motubrain on WorldArena and RoboTwin 2.0. Watch for published technical papers or model cards that disclose training data composition, compute used, and evaluation protocols. Monitor reported deployment case studies from robotics partners for details on latency, compute footprint, fine-tuning requirements, and safety or failure modes. Also watch for any SDKs, APIs, or open weights that would affect adoption workflows and cost tradeoffs for teams experimenting with embodied agents.
Scoring Rationale
The announcement introduces a unified embodied AI model with strong benchmark claims and significant Series B funding, which matters to robotics and ML practitioners. Independent verification and technical details are still limited, reducing immediate industry-shaking impact.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


