cs.CVJun 15, 2026classified

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Chenxu Lv, Deqing Li, Gengze Zhou, Hang Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zhixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Xiong-Hui Chen, Chenfei Wu

cs.CV

Paper Guide Brief

Reading Brief

This paper presents Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence that unifies manipulation, driving, navigation, and human-to-robot transfer under a shared natural language action interface, using a 60-layer double-stream MMDiT with frozen Qwen2.5-VL action encoding, the 8.6M-pair Embodied World Knowledge dataset, and a general+expert progressive curriculum.

Central Claim

The paper introduces a unified language-conditioned video world model that achieves cross-embodiment and cross-domain physical generalization through MLLM action encoding, a large-scale action-language mapped dataset, and a staged training curriculum.

Contribution

Why It Matters

It provides a scalable foundation for cross-embodiment world modeling by using natural language as a universal action interface and demonstrating that joint training across diverse domains reinforces physical generalization.

Prerequisites

double-stream MMDiT, MLLM action encoding, flow matching, 3D RoPE, progressive curriculum learning

Atlas Placement

Computer Vision (subfield)

Read If

You care about double-stream MMDiT, MLLM action encoding, flow matching.

Skip If

You only care about EWMBench, DreamGen Bench.

Methods

double-stream MMDiTMLLM action encodingflow matching3D RoPEprogressive curriculum learningvideo diffusion transformercross-modal joint attention

Tasks

language-conditioned video generationembodied world modelingfuture frame predictionvideo-to-video editinghuman-to-robot transfermulti-view video generationfirst-frame conditioning

Datasets

Embodied World Knowledge (EWK)action-language mappinghierarchical five-layer annotationtask-aware temporal segmentationmulti-view concatenationpaired human-to-robot data

Benchmarks

EWMBenchDreamGen BenchWorldModelBenchPBenchRoboTwin-IF

Noosaga Placements

Computer Visionsubfield95%
The paper constitutes a video generation system conditioned on visual observations and language; uses VAEs, diffusion models, video tokenization, and addresses multi-view geometric consistency.
predicts physically grounded future visual trajectories from current observationsDouble-Stream MMDiT with MLLM Action Encodingvideo-VAE latents
Diffusion Modelsframework90%
The paper uses a diffusion transformer (MMDiT) trained with the flow matching objective; the denoising process is central to the generation pipeline.
double-stream diffusion transformerdenoising processflow matching objective Lipman et al. (2023)
Deep Learningsubfield95%
The core technology is a 60-layer double-stream diffusion transformer trained with flow matching; the work extensively uses large language models, transformer architectures, and representation learning.
60-layer double-stream diffusion transformerMMDiT adopts a double-stream architectureflow matching objective
Transformer Architectureframework90%
The core architecture is a 60-layer transformer with double-stream MMDiT blocks, joint attention, and 3D RoPE—all transformer architecture components.
60-layer double-stream diffusion transformer24 attention heads (head dimension 128), hidden size 3,0723D RoPE Su et al. (2024); Heo et al. (2024)
Roboticssubfield85%
The model is designed for embodied intelligence tasks including robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer—all robotics-adjacent domains.
embodied intelligence requires agents to perceive, reason, and actrobotic manipulation, autonomous driving, indoor navigation, human-to-robot transfer20+ robot embodiments
Natural Language Processingsubfield80%
Language is used as the unified action interface; the work relies on an MLLM (Qwen2.5-VL) for semantic encoding and instruction parsing, and captions are generated using LLM-based pipelines.
natural language as a unified action interfacefrozen Qwen2.5-VL to encode user inputs into condition signalshierarchical annotation framework
Artificial Intelligencesubfield70%
The paper proposes a foundation model approach for embodied world modeling, combining language, vision, and planning signals into a single system; this fits a broad AI systems perspective.
language-conditioned video world modelunified formulation provides ... language-guided planning signalsgeneral world priors and embodied action priors

Abstract

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

Paper Context

Source ContextWhole paper

Budget100,000 tokens

Coverage83,080 chars

Classified from the full extracted paper text (83,080 characters). The Paper Guide brief above is the user-facing synthesis; raw context is kept out of the page.

Full-paper context sent 83,080 of 83,080 extracted characters to classification.