iMaC: Translating Actions into Motion and Contact Images for Embodied World Models
Paper Guide Brief
Reading Brief
iMaC presents an embodied world model that translates future robot actions into explicit motion and contact images to guide video generation for robot policy evaluation, achieving strong correlation with real-world success rates on long-horizon manipulation tasks.
Central Claim
A novel action representation method that converts future actions into URDF/FK-rendered motion images and pointcloud-based contact images, injected as spatially explicit video controls into an image-to-video diffusion transformer for world model prediction an...
Contribution
A novel action representation method that converts future actions into URDF/FK-rendered motion images and pointcloud-based contact images, injected as spatially explicit video controls into an image-to-video diffusion transformer for world model prediction and closed-loop policy evaluation.
Why It Matters
iMaC presents an embodied world model that translates future robot actions into explicit motion and contact images to guide video generation for robot policy evaluation, achieving strong correlation with real-world success rates on long-horizon manipulation tasks.
Prerequisites
action-conditioned video generation, world model, diffusion transformer, forward kinematics rendering, pointcloud distance fields
Atlas Placement
Robot Learning (subfield)
Read If
You care about action-conditioned video generation, world model, diffusion transformer.
Skip If
You only care about π0.5, GigaBrain-0.5.
Noosaga Placements
- The paper presents a learning-based world model that predicts future video conditioned on actions, used for evaluating robot policies.iMaC first uses the robot URDF and forward kinematics to render future robot-observation control videos (i.e., motion images) from future joint actions.The paper focuses on learning fθ, not improving π itself, but the core objective is to make world-model rollouts reliable enough to compare policy checkpoints.
- Learning-Based Manipulationframework90%The paper's world model is explicitly designed for and evaluated on learning-based manipulation tasks, using policy checkpoints from VLA models.We evaluate two VLA policy families, π0.5 and GigaBrain-0.5, using three checkpoints from each model.iMaC can rank different policies with different checkpoints by performance, with world-model evaluation scores strongly positively correlated with real-world success rates.
- The world model is specifically designed for and evaluated on long-horizon real-world robot manipulation tasks.Experiments on eight challenging long-horizon real-robot manipulation tasks show that iMaC can evaluate the relative performance of different policy checkpoints.This requirement is particularly stringent in manipulation, where a few centimeters can decide whether a gripper touches an object.
- Transformer Architectureframework90%The world model backbone is a Diffusion Transformer (DiT), a transformer architecture applied to video generation.iMaC builds on a WAN2.2 image-to-video (IT2V) DiT.The DiT predicts the flow only for future tokens.
- Diffusion Modelsframework90%The core prediction mechanism is a flow-matching diffusion model, a variant of diffusion models.We sample x0 ∼ N(0, I), τ ∼ U(0, 1), and noise only the future latent.The DiT predicts the flow only for future tokens, with objective...
- The method leverages video generation and depth prediction, which are core computer vision techniques.iMaC builds on a WAN2.2 image-to-video (IT2V) DiT.Beyond RGB prediction, iMaC predicts depth to improve the world model's spatial understanding.
- Model-Based Reinforcement Learningframework80%iMaC functions as a learned world model for policy evaluation, which is a key component of model-based reinforcement learning frameworks.World models have long been viewed as a foundation for planning and control: an agent can choose actions by predicting their consequences before executing them in the real world.iMaC is used as a world model for evaluating policies on a specified robot platform.
- The paper addresses action representation for control, specifically converting joint actions into spatially explicit image controls.iMaC (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models.iMaC translates future actions into URDF/FK-based motion images and two-stream pointcloud-based contact images.
- Reinforcement Learningframework70%The world model enables policy evaluation and rollout, which are common in reinforcement learning pipelines.A world model predicts the future observation chunk given the current observation and future actions.For policy evaluation, π and fθ form a closed loop: the policy acts on generated observations, and the world model predicts the visual consequences of those actions.
Abstract
Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.
Paper Context
Classified from the full extracted paper text (49,338 characters). The Paper Guide brief above is the user-facing synthesis; raw context is kept out of the page.
Full-paper context sent 49,338 of 49,338 extracted characters to classification.