Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Paper Guide Brief
Reading Brief
This paper presents Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence that unifies manipulation, driving, navigation, and human-to-robot transfer under a shared natural language action interface, using a 60-layer double-stream MMDiT with frozen Qwen2.5-VL action encoding, the 8.6M-pair Embodied World Knowledge dataset, and a general+expert progressive curriculum.
Central Claim
The paper introduces a unified language-conditioned video world model that achieves cross-embodiment and cross-domain physical generalization through MLLM action encoding, a large-scale action-language mapped dataset, and a staged training curriculum.
Contribution
The paper introduces a unified language-conditioned video world model that achieves cross-embodiment and cross-domain physical generalization through MLLM action encoding, a large-scale action-language mapped dataset, and a staged training curriculum.
Why It Matters
It provides a scalable foundation for cross-embodiment world modeling by using natural language as a universal action interface and demonstrating that joint training across diverse domains reinforces physical generalization.
Prerequisites
double-stream MMDiT, MLLM action encoding, flow matching, 3D RoPE, progressive curriculum learning
Atlas Placement
Computer Vision (subfield)
Read If
You care about double-stream MMDiT, MLLM action encoding, flow matching.
Skip If
You only care about EWMBench, DreamGen Bench.
Noosaga Placements
- The paper constitutes a video generation system conditioned on visual observations and language; uses VAEs, diffusion models, video tokenization, and addresses multi-view geometric consistency.predicts physically grounded future visual trajectories from current observationsDouble-Stream MMDiT with MLLM Action Encodingvideo-VAE latents
- Diffusion Modelsframework90%The paper uses a diffusion transformer (MMDiT) trained with the flow matching objective; the denoising process is central to the generation pipeline.double-stream diffusion transformerdenoising processflow matching objective Lipman et al. (2023)
- The core technology is a 60-layer double-stream diffusion transformer trained with flow matching; the work extensively uses large language models, transformer architectures, and representation learning.60-layer double-stream diffusion transformerMMDiT adopts a double-stream architectureflow matching objective
- Transformer Architectureframework90%The core architecture is a 60-layer transformer with double-stream MMDiT blocks, joint attention, and 3D RoPE—all transformer architecture components.60-layer double-stream diffusion transformer24 attention heads (head dimension 128), hidden size 3,0723D RoPE Su et al. (2024); Heo et al. (2024)
- The model is designed for embodied intelligence tasks including robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer—all robotics-adjacent domains.embodied intelligence requires agents to perceive, reason, and actrobotic manipulation, autonomous driving, indoor navigation, human-to-robot transfer20+ robot embodiments
- Language is used as the unified action interface; the work relies on an MLLM (Qwen2.5-VL) for semantic encoding and instruction parsing, and captions are generated using LLM-based pipelines.natural language as a unified action interfacefrozen Qwen2.5-VL to encode user inputs into condition signalshierarchical annotation framework
- The paper proposes a foundation model approach for embodied world modeling, combining language, vision, and planning signals into a single system; this fits a broad AI systems perspective.language-conditioned video world modelunified formulation provides ... language-guided planning signalsgeneral world priors and embodied action priors
Abstract
We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.
Paper Context
Classified from the full extracted paper text (83,080 characters). The Paper Guide brief above is the user-facing synthesis; raw context is kept out of the page.
Full-paper context sent 83,080 of 83,080 extracted characters to classification.