Logo BridgeV2W

Bridging Video Generation Models to Embodied World Models via Embodiment Masks

Corresponding Author

TL;DR:

BridgeV2W bridges pretrained video generation models to embodied world models via embodiment masks that align actions with pixel spaces, while ensuring viewpoint robustness, embodiment-agnostic architectures, and effective reuse of pretrained visual and motion priors.

Abstract

Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning.

grade-lv

Figure 1: BridgeV2W vs. previous methods. Pixel-aligned embodiment masks bridge video generation models to embodied world models, addressing the action–video gap, improving viewpoint robustness, and yielding a unified architecture across embodiments.

Method

BridgeV2W is an action-conditioned embodied world model that predicts future videos from an initial image and an action sequence. The core idea is to bridge actions into the visual domain by converting them into pixel-aligned embodiment masks and injecting them into a pretrained video diffusion model.

Action-to-Mask Conditioning

The action-to-mask conditioning module converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters.

  • From Actions to Pixel-Space Masks: BridgeV2W converts actions into temporally aligned embodiment masks in pixel space by simulating embodiment motion (e.g., via URDF) and projecting it onto the image plane. These masks explicitly indicate where the embodiment is expected to move at each timestep.
  • Mask-Conditioned Video Diffusion: The mask sequence is injected into a pretrained image-to-video diffusion model through a ControlNet-style conditioning branch, guiding generation while preserving pretrained visual and motion priors. This design also supports settings without action annotations, where masks can be obtained directly from video segmentation.

Motion-Centric Training Objectives

Standard diffusion losses supervise frames independently and may overlook temporal structure. To improve spatiotemporal consistency, BridgeV2W combines two motion-aware objectives:

  • Latent dynamics consistency, which explicitly aligns temporal differences between predicted and ground-truth video latents across multiple time offsets, encouraging coherent long-horizon dynamics.
  • Optical flow supervision, which compares motion fields between predicted and ground-truth videos using a frozen flow estimator, focusing learning on embodiment and object motion rather than static backgrounds.
grade-lv

Figure 2: Overview of the BridgeV2W pipeline. Actions are projected into pixel-space masks using URDF and camera parameters. The initial image and mask sequence are encoded by VAE, with mask features injected via a ControlNet branch into the DiT backbone. The model generates action-consistent videos, trained with diffusion, dynamics-consistency, and flow-based objectives.

Video Generation Capabilities

DROID

AgiBot-G1

Real-World

Downstream Applications

Policy Evaluation

Figure 2

Figure 3: Correlation between BridgeV2W evaluation and real-world success.

Table 1

Table 1: Policy evaluation across tasks and baselines using BridgeV2W.

Goal-Image Conditioned Manipulation

Table 2

Table 2: Real-world planning performance with BridgeV2W.

WM-Enhanced Closed-Loop Control

Table 2

Table 3: Real-World success rates with BridgeV2W + OpenVLA-OFT.

Citation

@misc{chen2026bridgev2wbridgingvideogeneration,
    title={BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks}, 
    author={Yixiang Chen and Peiyan Li and Jiabing Yang and Keji He and Xiangnan Wu and Yuan Xu and Kai Wang and Jing Liu and Nianfeng Liu and Yan Huang and Liang Wang},
    year={2026},
    eprint={2602.03793},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2602.03793}, 
}