cs.CVJun 11, 2026classified

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Junke Wang, Qihang Zhang, Shuai Yang, Yiming Luo, Yujun Shen, Zuxuan Wu, Yu-Gang Jiang, Yinghao Xu

cs.CV

Paper Guide Brief

Reading Brief

RepWAM introduces a representation-centric world action model that uses semantic visual-action tokenizers, aligning visual latents with a frozen visual foundation model and learning latent actions as transitions between visual states, to improve instruction-following dynamics and robot control.

Central Claim

A novel representation visual-action tokenizer (RepViTok) that produces semantically aligned visual tokens and latent action tokens, and a causal world action model trained with flow matching on these tokens for joint future visual state and latent action prediction.

Contribution

Why It Matters

By aligning visual latents with a frozen visual foundation model and learning latent actions as transitions within that semantic space, RepWAM overcomes the limitations of reconstruction-oriented tokenizers and decoupled action spaces, lea...

Prerequisites

representation visual-action tokenizer, latent action tokenizer, causal diffusion transformer, flow matching, visual foundation model alignment

Atlas Placement

Computer Vision (subfield)

Read If

You care about representation visual-action tokenizer, latent action tokenizer, causal diffusion transformer.

Skip If

You only care about RoboTwin 2.0, gFVD.

Methods

representation visual-action tokenizerlatent action tokenizercausal diffusion transformerflow matchingvisual foundation model alignmentinverse dynamics modelforward dynamics model

Tasks

world action modelingrobot manipulationclosed-loop controlinstruction followingfuture prediction

Datasets

AgiBotRoboMINDRoboCOINInternA1RoboTwin 2.0ImageNetUCF101

Benchmarks

RoboTwin 2.0gFVDPSNRSSIMopen loop score (OLS)success rate

Noosaga Placements

Computer Visionsubfield90%
The paper focuses on visual tokenization, video representation learning, and visual foundation model alignment, which are core to computer vision. The primary arXiv category is cs.CV.
arXiv:2606.13674v1 [cs.CV]we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokensaligning the latent space of a video autoencoder with a frozen visual foundation model
Representation Learningframework90%
The paper explicitly uses representation learning to align visual latents with a frozen visual foundation model and to learn latent actions as transitions between visual states.
we explore a semantic visual-action latent space for representation-centric world action modelingaligning the latent space of a video autoencoder with a frozen visual foundation modelrepresentation visual-action tokenizer
Roboticssubfield90%
The paper is explicitly about world action models for robot control, with experiments on real-world manipulation tasks and simulation benchmarks for robot manipulation.
a representation-centric world action model (WAM) built on representation visual-action tokenizersfollowed by adaptation to real robot trajectories for closed-loop manipulationExperiments on real-world manipulation tasks and simulation benchmarks
Transformer Architectureframework90%
The paper uses transformer architectures for the visual tokenizer (ViT) and the causal world action model (causal diffusion transformer).
The visual tokenizer is a vision transformer (ViT) autoencoderthe world model expert is a causal diffusion transformer with 30 layersA block-causal mask lets each chunk attend to s<t but not to future chunks
Deep Learningsubfield80%
The method uses deep learning techniques such as vision transformers, diffusion transformers, flow matching, and representation learning, which are central to deep learning.
The visual tokenizer is a vision transformer (ViT) autoencoderWe train the transformer with teacher forcing under a conditional flow-matching objectivecausal diffusion transformer
Diffusion Modelsframework80%
The world action model is trained with a flow-matching objective, which is a type of diffusion model. The paper also compares with WAN-pretrained pipelines that use diffusion models.
We train the transformer with teacher forcing under a conditional flow-matching objectiveWe cast world action modeling as causal generation over visual-action chunksthe remaining performance gap to Lingbot-VA mainly comes from its use of WAN video-generation pretraining
Robot Manipulationsubfield80%
The paper evaluates on robot manipulation tasks such as picking, pushing drawers, and inserting tubes, and uses robot manipulation datasets.
Pick the fruits and put them into the platePush the drawer and put the building block into itInsert the test tube into the test tube rack
Robot Learningframework80%
The paper presents a method for robot learning, specifically learning world action models and adapting them to robot control.
we pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectoriesRepWAM delivers strong performance across diverse manipulation settings
Robot Learningsubfield70%
The paper involves learning world action models and adapting them to robot control, which falls under robot learning.
we pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectoriesRepWAM delivers strong performance across diverse manipulation settings
Vision Transformers and Foundation Modelsframework70%
The paper uses a frozen visual foundation model (Perception Encoder) to align visual latents, which is a type of vision foundation model.
aligning the latent space of a video autoencoder with a frozen visual foundation modelLet G denote this teacher model and let Walign be a linear projection layer that matches the teacher dimension

Abstract

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

Paper Context

Source ContextWhole paper

Budget100,000 tokens

Coverage42,060 chars

Classified from the full extracted paper text (42,060 characters). The Paper Guide brief above is the user-facing synthesis; raw context is kept out of the page.

Full-paper context sent 42,060 of 42,060 extracted characters to classification.