cs.ROJun 8, 2026classified

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang, Zhixuan Liang, Wenjie Xu, Yinan Mao, Weinan Zhang, Xiaokang Yang, Ru Ying, Ran Zheng, Yao Mu

cs.ROcs.AIcs.CV

Paper Guide Brief

Reading Brief

AHA-WAM proposes an asynchronous world-action model that decouples a low-frequency video Diffusion Transformer (world planner) from a high-frequency action DiT (executor), using Observation-Guided Video-Context Routing and horizon-adaptive offset training to achieve efficient closed-loop robot manipulation without robot-data pretraining.

Central Claim

A novel asynchronous dual-DiT architecture for world-action modeling that separates slow video-based planning from fast action execution, supported by OVCR and horizon-adaptive offset training to maintain context alignment and efficiency.

Contribution

Why It Matters

By decoupling world prediction and action execution into different temporal resolutions, AHA-WAM improves closed-loop control frequency (up to 24.17 Hz) and manipulation success (92.80% average) without requiring robot-data pretraining.

Prerequisites

dual Diffusion Transformer, asynchronous world-action model, Observation-Guided Video-Context Routing, horizon-adaptive offset training, rolling KV memory

Atlas Placement

Robot Manipulation (subfield)

Read If

You care about dual Diffusion Transformer, asynchronous world-action model, Observation-Guided Video-Context Routing.

Skip If

You only care about RoboTwin 2.0, real-world manipulation.

Methods

dual Diffusion Transformerasynchronous world-action modelObservation-Guided Video-Context Routinghorizon-adaptive offset trainingrolling KV memoryODE distillationflow matching

Tasks

bimanual manipulationmulti-task robot manipulationdeformable object manipulationlong-horizon organizationfine-grained tool use

Datasets

RoboTwin 2.0RoboCOIN

Benchmarks

RoboTwin 2.0real-world manipulation

Noosaga Placements

Robot Manipulationsubfield95%
The paper focuses on robot manipulation tasks, including bimanual manipulation, deformable objects, and tool use, evaluated on RoboTwin 2.0 and real-world tasks.
Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretrainingWe evaluate AHA-WAM in both RoboTwin 2.0 simulation and real-world experiment.
Diffusion Modelsframework90%
The core generative models are Diffusion Transformers (DiTs) used for both video planning and action generation.
AHA-WAM is built on a dual Diffusion Transformer (DiT) architectureAHA-WAM instantiates the video DiT as a low-frequency world planner
Robot Learningsubfield90%
The paper proposes a new learning architecture (dual-DiT) for world-action modeling, trained with flow matching and distillation, aimed at improving policy learning for manipulation.
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning.We propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture
Transformer Architectureframework80%
The architecture uses Transformer layers extensively, including layerwise joint attention and KV memory, characteristic of the Transformer architecture framework.
AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memorylayerwise joint attention
Learning-Based Manipulationframework80%
The paper focuses on robotic manipulation tasks and uses learning-based manipulation techniques, though it is method-agnostic within that framework.
We propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model for robot manipulationlearning how actions co-evolve with visual scene dynamics to inject physical priors into control
Computer Visionsubfield70%
The model uses video generation (video DiT) as a world model, involves visual observation encoding and context routing, which are core computer vision techniques.
AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observationsVisual observations are encoded by the pretrained VAE
Deep Learningsubfield70%
The architecture relies on Diffusion Transformers, flow matching, and ODE distillation, which are deep learning methods.
AHA-WAM is built on a dual Diffusion Transformer (DiT) architectureTraining uses flow matching for both world modeling and action prediction
Imitation Learningframework70%
The paper builds upon the world-action model paradigm, which is a form of learning from demonstration/behavioral cloning with world modeling, though not explicitly framed as imitation learning.
World-action models have emerged as a promising paradigm for robot manipulationAHA-WAM builds on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling
Robot Controlsubfield60%
The paper addresses closed-loop control frequency and latency, which are control-centric concerns, though the method is a learning approach.
AHA-WAM reaches up to 56.9 Hz closed-loop control frequencyAHA-WAM reorganizes WAM inference into an asynchronous world-action coupling framework
Learning from Demonstrationframework60%
The model is trained on demonstration data (behavioral cloning) and uses world modeling to augment imitation learning, aligning with learning from demonstration.
collect approximately 120 episodes on averageAHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success

Abstract

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

Paper Context

Source ContextWhole paper

Budget100,000 tokens

Coverage63,938 chars

Classified from the full extracted paper text (63,938 characters). The Paper Guide brief above is the user-facing synthesis; raw context is kept out of the page.

Full-paper context sent 63,938 of 63,938 extracted characters to classification.