Transporters with Visual Foresight (TVF)
- TVF is an image-based robotic pick-and-place planning framework that integrates a learned visual foresight model for next-state prediction.
- TVF refines the Goal-Conditioned Transporter Network with multi-modal action proposals and tree-search planning to enhance zero-shot generalization on unseen tasks.
- Experimental results demonstrate that TVF significantly outperforms GCTN in simulation and real-robot setups, achieving up to 85.6% success with limited demonstration data.
Transporters with Visual Foresight (TVF) is an image-based task planning framework for robotic pick-and-place rearrangement, designed to generalize efficiently to unseen tasks with minimal demonstration data. TVF builds upon the Goal-Conditioned Transporter Network (GCTN) by integrating a learned Visual Foresight (VF) model for next-state prediction, enabling model-based search for anticipated outcomes. Experiments show TVF achieves significant improvements over GCTN alone in zero-shot generalization to unseen rearrangement problems, both in simulation and on real robot setups (Wu et al., 2022).
1. Formulation and Notation
The foundation of TVF is the planar tabletop pick-and-place rearrangement problem. At timestep , the agent receives a top-down orthographic observation , where the four channels correspond to RGB and height. The agent is also provided with a goal observation . The atomic action is parameterized by a pick pose and a place pose .
Given a dataset of expert demonstration trajectories
the goal is to learn a goal-conditioned pick-and-place policy that generalizes in a zero-shot manner to unseen task configurations from a small number of demonstrations.
2. Architectural Components
TVF consists of two synergistic modules: (A) a multi-modal action proposal network based on GCTN, and (B) a Visual Foresight model for one-step state prediction. At inference, GCTN generates candidate pick-and-place actions; the VF model is used to predict each outcome, enabling a tree-search planner to select actions optimized with respect to the downstream goal.
Goal-Conditioned Transporter Network (GCTN)
GCTN encodes both the current and goal images to output spatial Q-value maps for pick and place operations. The pick-value head predicts the optimal pick location:
The place-value head incorporates discrete rotation bins:
The imitation learning objective is a dense cross-entropy loss:
Visual Foresight Model
The VF model directly encodes the action into image space. Given action , it constructs a pick mask centered at and a place mask by rotating by about , cropping a same-sized patch, and pasting it at . The input is processed by an encoder–decoder FCN with skip connections to yield a predicted next image .
The VF loss is
with a 5× penalty on the height channel.
SE(2) equivariance is exploited: applying any rigid to and amounts to the same transform on the prediction, simplifying data augmentation.
3. Multi-Modal Action Proposal and Planning
TVF augments the typical single-action GCTN output with a multi-modal proposal strategy:
- Compute and associated .
- Apply a threshold and select TopN=100 scores.
- Cluster these using K-means into clusters.
- From each cluster , pick and set .
In tree-search planning, nodes are with image , depth , and action sequence . The tree is expanded to depth , exploring candidates at each expansion. Each node is scored as
with discount factor and constant . At execution, the best-scored node's first action is performed, and the process is replanned until the goal is reached or a maximum step count is hit.
4. Training Protocol and Implementation
Both GCTN and VF modules are trained using Adam with learning rate 1e−4 over 60,000 steps, batch size 1, and SE(2)-invariant data augmentation via random rotation and translation. The FCN uses 5x5 convolutional kernels, downsampling to a 4×4 latent representation, with up-projection via deconvolutions and skip connections (matching the original TransporterNet FCN architecture). Rotation space is discretized into bins, corresponding to 10° increments. K-means threshold , discount , and are used. Computationally, training requires ~0.08 s/step for GCTN, 0.14 s/step for TVF-Small (K=2), and 1.8 s/step for TVF-Large (K=3, ) on an RTX 3090.
5. Experimental Evaluation and Quantitative Results
Experiments were conducted both in simulation (Ravens suite) and with a real Franka Panda robot equipped with a Carmine depth camera. In simulation, 14 tasks (6 train, 8 held-out) involve constructing planar shapes from blocks per task. Expert demonstrations ( per task) were used for training. Real-robot evaluation used 3 training tasks (10 demos each) and 3 unseen test tasks (10 trials each).
Main results include:
| Demos/task | GCTN (%) | TVF-S (K=2,d=1) (%) | TVF-L (K=3,d=3) (%) |
|---|---|---|---|
| 1 | 1.3 | 1.7 | 2.9 |
| 10 | 55.4 | 71.5 | 78.5 |
| 100 | 49.0 | 62.3 | 71.7 |
| 1000 | 54.2 | 72.5 | 85.6 |
On 8 unseen simulation tasks (10 demos/task), TVF-Large achieves 78.5% success versus 55.4% for GCTN. Per-task breakdowns show TVF outperforming GCTN, e.g., Plane Square (TVF-L 100% vs. GCTN 86.7%), Building (TVF-L 13.3% vs. GCTN 5.0%).
In real-robot trials (30 demos total), average success for unseen tasks improves from 30.0% (GCTN) to 63.3% (TVF), and dramatic gains occur on especially difficult tasks (e.g., Twin-Tower: 0% GCTN vs. 60% TVF).
Visual Foresight ablations show pixel-wise loss (10 demos/task): Latent Dynamics baseline 0.0875 (color), 0.0873 (height); TVF 0.0242 (color), 0.0136 (height).
6. Ablation Analysis
Systematic variation of and search depth demonstrates that increased tree-search depth boosts average performance only when both the action proposal and the VF model are sufficiently accurate. The highest average success (78.5%) on the 10-demo setting is reported for . Excessive search depth can degrade performance due to accumulated VF model error, particularly when model predictions become less reliable.
7. Significance and Limitations
TVF demonstrates that explicit one-step visual prediction, when combined with multi-modal action proposals and local tree-search, significantly enhances the ability to compose skills for rearrangement tasks not seen during training, even in the low-data regime. The framework's reliance on SE(2)-equivariant representations and data augmentation is essential for sample efficiency and generalization. Notably, real robot results confirm substantial improvements on difficult compositions relative to leading imitation-only baselines.
A plausible implication is that the model-based approach, as instantiated in TVF, provides a tractable path for visual task planning under severe demonstration constraints. However, TVF’s efficacy is conditional on the accuracy of both its forward model and its action-value proposals: error accumulation in either component can impair scaling to deeper search or higher-horizon tasks.