cs.ROJun 15, 2026classified

T-Rex: Tactile-Reactive Dexterous Manipulation

Dantong Niu, Zhuoyang Liu, Zekai Wang, Boning Shao, Zhao-Heng Yin, Anirudh Pai, Yuvan Sharma, Stefano Saravalle, Ruijie Zheng, Jing Wang, Ryan Punamiya, Mengda Xu, Yuqi Xie, Yunfan Jiang, Letian Fu, Konstantinos Kallidromitis, Matteo Gioia, Junyi Zhang, Jiaxin Ge, Haiwen Feng, Fabio Galasso, Wei Zhan, David M. Chan, Yutong Bai, Roei Herzig, Jiahui Lei, Fei-Fei Li, Ken Goldberg, Jitendra Malik, Pieter Abbeel, Yuke Zhu, Danfei Xu, Jim, Fan, Trevor Darrell

cs.RO

Paper Guide Brief

Reading Brief

T-Rex introduces a tactile-reactive dexterous manipulation framework combining a large-scale 100-hour tactile-synchronized robot dataset with a variable-rate Mixture-of-Transformers (MoT) architecture that decouples low-frequency visuomotor action planning from high-frequency tactile refinement. A spatial-temporal VQ-VAE encoder compresses tactile force history and deformation maps into compact tokens, enabling asynchronous closed-loop control. On 12 real-world contact-rich tasks, T-Rex achieves over 30% higher average success rate than strong baselines.

Central Claim

A complete system including a large-scale tactile-motor dataset, a MoT architecture with asynchronous cascaded flow matching, a spatial-temporal tactile VQ-VAE encoder, and a three-stage training recipe (human egocentric pre-training, tactile-grounded robot m...

Contribution

Why It Matters

If this contribution is true, it provides the first unified foundation model for dexterous manipulation that effectively integrates high-frequency tactile feedback into a VLA-style architecture, achieving significant improvements in contac...

Prerequisites

Mixture-of-Transformers, cascaded flow matching, asynchronous refinement, VQ-VAE, spatial-temporal tactile encoding

Atlas Placement

Robot Manipulation (subfield)

Read If

You care about Mixture-of-Transformers, cascaded flow matching, asynchronous refinement.

Skip If

You only care about 12 tactile-reactive manipulation tasks, real-world robot benchmark.

Methods

Mixture-of-Transformerscascaded flow matchingasynchronous refinementVQ-VAEspatial-temporal tactile encodingthree-stage trainingegocentric pre-training

Tasks

tactile-reactive dexterous manipulationcontact-rich manipulationforce-sensitive tasksdeformable object manipulationbimanual coordination

Datasets

T-Rex Datasetegocentric human videoteleoperation dataverb-noun combinationsmotor primitivestactile synchronised

Benchmarks

12 tactile-reactive manipulation tasksreal-world robot benchmarksuccess rate evaluation

Noosaga Placements

Robot Manipulationsubfield95%
The paper directly addresses dexterous manipulation, presenting a dataset, model, and experiments focused on real-world contact-rich manipulation tasks with dual dexterous hands.
T-Rex is a tactile-reactive dexterous manipulation frameworkWe propose a large-scale, 100-hour tactile-rich dataset collected via a novel, data-efficient recipeintroduce a variable-rate Mixture-of-Transformers (MoT) architecture equipped with a novel temporal tactile VQ-VAE encoder
Learning-Based Manipulationframework90%
The paper builds a learning-based manipulation policy using imitation learning (flow matching) with tactile feedback, fitting under the Learning-Based Manipulation framework.
T-Rex is a tactile-reactive dexterous manipulation frameworkaction generation is formulated as conditional flow matchingthree-stage recipe that progressively transfers large-scale human visuomotor priors into tactile-reactive dexterous robot control
Robot Learningsubfield90%
The paper develops a learning-based policy using imitation learning (flow matching) with a three-stage training recipe (pre-training, mid-training, post-training) on large-scale datasets, central to robot learning.
T-Rex is trained with a three-stage recipe that progressively transfers large-scale human visuomotor priors into tactile-reactive dexterous robot controlFollowing standard flow-based robot policies, action generation is formulated as conditional flow matchingLarge-scale Human Egocentric Pre-training... Tactile Grounded Robot Mid-training... Skill-Specific Post-training
Learning-Based Robot Controlframework85%
The MoT architecture with high-frequency tactile refinement and cascaded denoising is a form of learning-based control, directly fitting Learning-Based Robot Control.
variable-rate MoT architecture that disentangles control into a low-rate action expert for baseline dexterous manipulation and a high-rate tactile expert for rapid residual refinementsasynchronous tactile-reactive cascaded flow matching
Transformer Architectureframework80%
The backbone architecture is a Mixture-of-Transformers, a direct application of the transformer architecture, though extended with multiple experts.
Mixture-of-Transformers (MoT) backbonetransformer experts
Robot Controlsubfield70%
The asynchronous cascaded denoising with high-frequency refinement directly addresses low-level control, enabling fast closed-loop responses to tactile signals.
high-frequency tactile refinement and employs a spatial-temporal tactile encoderasynchronous tactile-reactive cascaded flow matching that enables the model to respond dynamically to real-time tactile feedback
Imitation Learningframework75%
The policy is trained via behavioral cloning (imitation learning) using flow matching on expert demonstrations, fitting Imitation Learning.
T-Rex policy πθ receives RGB observations... predicts a future action chunkFollowing standard flow-based robot policies, action generation is formulated as conditional flow matching
Deep Learningsubfield60%
The paper uses a MoT architecture with transformer experts and a VQ-VAE encoder, which are deep learning techniques, but the focus is on robotic application.
variable-rate Mixture-of-Transformers (MoT) architecturespatial-temporal VQ-VAE encoder
Deep Reinforcement Learningframework50%
The paper references deep reinforcement learning as a future direction but does not use RL; it is situated as an alternative/imitation-based approach to RL.
future work could integrate reinforcement learning or online interaction-based refinement

Abstract

The ability to react dynamically to tactile signals has long been considered crucial to agile human-level dexterity. Yet contemporary learning-based Vision-Language-Action (VLA) models for robotic manipulation generally either overlook the tactile modality or are limited to encoders with static cues, due in part to the scarcity of diverse training data and standardized evaluation, architectural constraints in current VLA models, and limitations of static tactile encoders. In this paper, we push the frontier of tactile-reactive manipulation by addressing all of these limitations. We propose a large-scale, 100-hour tactile-rich dataset collected via a novel, data-efficient recipe that prioritizes elementary motor primitives. To effectively exploit naturally high-frequency touch signals without sacrificing the existing capabilities of existing VLAs, we introduce a variable-rate Mixture-of-Transformers (MoT) architecture equipped with a novel temporal tactile VQ-VAE encoder. We demonstrate the effectiveness of tactile-reactive policies on 12 manipulation tasks requiring delicate force control and deformable object manipulation, achieving over 30% higher average success rate than the strongest baseline.

Paper Context

Source ContextWhole paper

Budget100,000 tokens

Coverage91,701 chars

Classified from the full extracted paper text (91,701 characters). The Paper Guide brief above is the user-facing synthesis; raw context is kept out of the page.

Full-paper context sent 91,701 of 91,701 extracted characters to classification.