Research Radarcs.ROJun 16, 2026classified

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

Ning Gao, Jinliang Zheng, Xing Gao, Haoxiang Ma, Hanqing Wang, Yukai Wang, Jiantong Chen, Zanxin Chen, Shujie Zhang, Mingda Jia, Xuekun Jiang, Zihou Zhu, Xinyu Li, Shuai Wang, Hao Li, Wenzhe Cai, Yuqiang Yang, Xudong Xu, Zhaoyang Lyu, Yao Mu, Tai Wang, Jiangmiao Pang, Jia Zeng, Weinan Zhang, Chunhua ShenarXivPDF
cs.RO

Paper Guide Brief

Reading Brief

EBench is a simulation benchmark for diagnosing generalist mobile manipulation policies across 26 tasks annotated with 5 capability and 4 generalization dimensions, revealing that models with similar overall success rates have strikingly different capability profiles and generalization behaviors.

Central Claim

A simulation benchmark for generalist mobile manipulation policies that provides multi-dimensional diagnostic profiling beyond a single success-rate scalar, covering long-horizon, dexterous, and mobile tasks with controlled generalization dimensions.

Contribution

A simulation benchmark for generalist mobile manipulation policies that provides multi-dimensional diagnostic profiling beyond a single success-rate scalar, covering long-horizon, dexterous, and mobile tasks with controlled generalization dimensions.

Why It Matters

EBench is the first benchmark to simultaneously cover long-horizon, dexterous, and mobile manipulation in a single evaluation protocol with structured capability and generalization axes, enabling fine-grained diagnosis of policy strengths...

Prerequisites

simulation benchmark, capability profiling, generalization diagnosis, permutation test, mobile manipulation

Atlas Placement

Robot Manipulation (subfield)

Read If

You care about simulation benchmark, capability profiling, generalization diagnosis.

Skip If

You only care about EBench, LIBERO.

Methods
simulation benchmarkcapability profilinggeneralization diagnosispermutation test
Tasks
mobile manipulationdexterous manipulationlong-horizon manipulationgeneralization evaluation
Datasets
EBench datasetteleoperationmotion planning
Benchmarks
EBenchLIBERORoboTwin 2.0

Noosaga Placements

  • The benchmark evaluates generalist manipulation policies on mobile pick-and-place, long-horizon multi-stage, and dexterous-and-precise tasks, directly targeting robot manipulation capabilities.
    EBench comprises 26 diverse and challenging manipulation tasksthree families: Mobile Pick-and-Place, Mobile Long-Horizon, and Table-Top Dexterous-and-Precise
  • Learning-Based Manipulationframework90%
    The benchmark evaluates learning-based manipulation policies (VLAs) that are trained and evaluated on manipulation tasks in simulation.
    We evaluate state-of-the-art generalist manipulation models including π0, π0.5, XVLA, and InternVLA-A1All models are fine-tuned from pretrained checkpoints on the same EBench training data
  • Roboticssubfield90%
    The paper evaluates vision-language-action models (VLAs) for generalist robotics, situating the benchmark within the broader robotics AI field.
    generalist mobile manipulation policiesVision–language–action models
  • Deep Reinforcement Learningframework60%
    The paper discusses pretraining and fine-tuning of VLA models, which can involve deep reinforcement learning principles, but the benchmark itself does not directly use RL methods.
    post-training dataset200K gradient steps, batch size 128, AdamW optimizer
  • Robot Learningsubfield80%
    The benchmark is used to evaluate robot learning models (VLAs) post-trained on demonstration data, and analyzes generalization and pretraining effects.
    post-training dataset contains 91.4 hours demonstrationspretraining ablation across EBench, LIBERO, and RoboTwin 2.0
  • Mobile Roboticssubfield70%
    Many tasks involve mobile manipulation with a dual-arm robot on a mobile base, requiring base motion coordination.
    mobile pick-and-place tasksmobile manipulation is hard to teleoperate since a single operator has to coordinate base motion and arm motion

Abstract

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $π_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.

Paper Context

Source ContextWhole paper
Budget100,000 tokens
Coverage61,980 chars

Classified from the full extracted paper text (61,980 characters). The Paper Guide brief above is the user-facing synthesis; raw context is kept out of the page.

Full-paper context sent 61,980 of 61,980 extracted characters to classification.