Logo VERM

Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

IEEE Robotics and Automation Letter (RA-L) 2025

Corresponding Author

TL;DR:

Multi-camera setups for 3D robotic manipulation create redundancy and increase computational costs. VERM uses foundation models (GPT-4o) to automatically select a task-adaptive virtual camera view from 3D point clouds, capturing key information while reducing occlusion. VERM outperforms state-of-the-art methods with 1.89× faster training and 1.54× faster inference.

Abstract

When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89× speedup in training time and 1.54× speedup in inference speed.

Method

The proposed VERM method leverages GPT-4o to find task-adaptive camera poses from 3D point clouds, and incorporates a depth-aware module and a dynamic coarse-to-fine procedure for precise 3D manipulation.

Camera Pose Selection

We develop a prompt-based paradigm for camera pose selection using GPT-4o, as illustrated in Figure 1. The prompt consists of four distinct parts: environment description, task description, in-context examples, and rules. The environment description provides an overview of the fixed camera poses within the scene using visual representation (SoM technique). The task description defines camera pose using two parameters: elev (elevation angle) and azim (azimuth angle). We include in-context examples and establish rules to refine the output from GPT-4o. The textual prompt is combined with visual prompts to get the ideal camera pose from GPT-4o.

prompt

Figure 1: The prompt-based paradigm for querying virtual camera poses using GPT-4o.

Policy Network

The architecture of the policy network is depicted in Figure 2. The original RGB-D inputs are transformed into a unified 3D point cloud and projected onto a virtual camera plane. We use this virtual image, combined with language instructions, to predict actions through a coarse-to-fine procedure.

  • Dynamic Coarse-to-Fine Module: We introduce a dynamic inference module that selectively applies refinement only when needed. A lightweight predictor identifies task-critical phases and decides whether to activate the fine stage.
  • Depth-Aware Module: We incorporate learnable depth tokens to predict depth values for 3D action planning. Action prediction is performed using Transformer, which processes depth tokens, language tokens, and image tokens together.
architecture

Figure 2: Policy network of the proposed VERM.

Experiments

RLBench

We evaluate VERM on the RLBench simulation benchmark. VERM achieves 1.89× speedup in training time and 1.54× speedup in inference speed compared to RVT-2 (see Figure 3), while surpassing it by 1.4% in average task success rate and performing best in 11 out of 17 tasks (see Table 1). We also test VERM with different foundation models (GPT-4o, Qwen2.5, and Claude 3.5 Sonnet) and achieve comparable performance across all models (see Table 2).

RLBench results

Table 1: Results on RLBench benchmark.

Training and inference speed

Figure 3:Training time (left) and inference speed (right) comparison.

Different foundation models

Table 2: Cross-model generalization results with different foundation models.

Real-World Evaluation

We evaluate VERM on eight real-world manipulation tasks. VERM achieves strong performance with just 15 demonstrations, already outperforming RVT and RVT-2 in most tasks while significantly reducing both training time and inference latency.

Real-world results

Table 3: Results on real-world manipulation tasks.

Real-World Task Demonstrations

Stack Blocks

Put in Drawer

Put in Shelf

Put in Bowl

Citation

@misc{chen2025vermleveragingfoundationmodels,
    title={VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation}, 
    author={Yixiang Chen and Yan Huang and Keji He and Peiyan Li and Liang Wang},
    year={2025},
    eprint={2512.16724},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
}