Papers
Topics
Authors
Recent
Search
2000 character limit reached

Transporters with Visual Foresight (TVF)

Updated 14 January 2026
  • TVF is an image-based robotic pick-and-place planning framework that integrates a learned visual foresight model for next-state prediction.
  • TVF refines the Goal-Conditioned Transporter Network with multi-modal action proposals and tree-search planning to enhance zero-shot generalization on unseen tasks.
  • Experimental results demonstrate that TVF significantly outperforms GCTN in simulation and real-robot setups, achieving up to 85.6% success with limited demonstration data.

Transporters with Visual Foresight (TVF) is an image-based task planning framework for robotic pick-and-place rearrangement, designed to generalize efficiently to unseen tasks with minimal demonstration data. TVF builds upon the Goal-Conditioned Transporter Network (GCTN) by integrating a learned Visual Foresight (VF) model for next-state prediction, enabling model-based search for anticipated outcomes. Experiments show TVF achieves significant improvements over GCTN alone in zero-shot generalization to unseen rearrangement problems, both in simulation and on real robot setups (Wu et al., 2022).

1. Formulation and Notation

The foundation of TVF is the planar tabletop pick-and-place rearrangement problem. At timestep tt, the agent receives a top-down orthographic observation otRH×W×4o_t\in\mathbb{R}^{H\times W\times 4}, where the four channels correspond to RGB and height. The agent is also provided with a goal observation ogRH×W×4o_g\in\mathbb{R}^{H\times W\times 4}. The atomic action ata_t is parameterized by a pick pose Tpick=(ppick,θpick)SE(2)T_{\text{pick}}=(p_{\text{pick}},\theta_{\text{pick}})\in SE(2) and a place pose Tplace=(pplace,θplace)SE(2)T_{\text{place}}=(p_{\text{place}},\theta_{\text{place}})\in SE(2).

Given a dataset D={ξi}i=1ND=\{\xi_i\}_{i=1}^N of expert demonstration trajectories

ξi={o1,a1,o2,a2,...,oTi,aTi,oTi+1=og}\xi_i = \{o_1, a_1, o_2, a_2, ..., o_{T_i}, a_{T_i}, o_{T_i+1}=o_g \}

the goal is to learn a goal-conditioned pick-and-place policy π(ot,og)at\pi(o_t, o_g)\to a_t that generalizes in a zero-shot manner to unseen task configurations from a small number of demonstrations.

2. Architectural Components

TVF consists of two synergistic modules: (A) a multi-modal action proposal network based on GCTN, and (B) a Visual Foresight model for one-step state prediction. At inference, GCTN generates KK candidate pick-and-place actions; the VF model is used to predict each outcome, enabling a tree-search planner to select actions optimized with respect to the downstream goal.

Goal-Conditioned Transporter Network (GCTN)

GCTN encodes both the current and goal images to output spatial Q-value maps for pick and place operations. The pick-value head Qpick(u,vot,og)RH×WQ_{\text{pick}}(u,v|o_t,o_g)\in\mathbb{R}^{H\times W} predicts the optimal pick location:

ppick=argmax(u,v)Qpick(u,vot,og).p_{\text{pick}} = \arg\max_{(u,v)} Q_{\text{pick}}(u,v|o_t,o_g).

The place-value head Qplace(u,v,rot,og,ppick)RH×W×RQ_{\text{place}}(u,v,r|o_t,o_g,p_{\text{pick}})\in\mathbb{R}^{H\times W\times R} incorporates RR discrete rotation bins:

(u,v,r)=argmaxQplace.(u^*,v^*,r^*) = \arg\max Q_{\text{place}}.

The imitation learning objective is a dense cross-entropy loss:

LGCTN=t[logsoftmaxQpick(ppick)+logsoftmaxQplace(pplace,r)].\mathcal{L}_{\text{GCTN}} = -\sum_t [\log\operatorname{softmax}Q_{\text{pick}}(p^*_{\text{pick}}) + \log\operatorname{softmax}Q_{\text{place}}(p^*_{\text{place}},r^*)].

Visual Foresight Model

The VF model directly encodes the action into image space. Given action ata_t, it constructs a pick mask Mpick{0,1}H×WM_{\text{pick}} \in \{0,1\}^{H\times W} centered at ppickp_{\text{pick}} and a place mask MplaceRH×W×4M_{\text{place}}\in\mathbb{R}^{H\times W\times 4} by rotating oto_t by Δθ=θplaceθpick\Delta\theta = \theta_{\text{place}} - \theta_{\text{pick}} about ppickp_{\text{pick}}, cropping a same-sized patch, and pasting it at pplacep_{\text{place}}. The input [ot,Mpick,Mplace]RH×W×(4+1+4)[o_t, M_{\text{pick}}, M_{\text{place}}]\in\mathbb{R}^{H\times W\times(4+1+4)} is processed by an encoder–decoder FCN with skip connections to yield a predicted next image o^t+1\hat{o}_{t+1}.

The VF loss is

LVF=o^t+1ot+11\mathcal{L}_{\text{VF}} = \|\hat{o}_{t+1} - o_{t+1}\|_1

with a 5× penalty on the height channel.

SE(2) equivariance is exploited: applying any rigid gSE(2)g\in SE(2) to (ot,at)(o_t, a_t) and ot+1o_{t+1} amounts to the same transform on the prediction, simplifying data augmentation.

3. Multi-Modal Action Proposal and Planning

TVF augments the typical single-action GCTN output with a multi-modal proposal strategy:

  1. Compute Q~place(u,v)=maxrQplace(u,v,r)\tilde Q_{\text{place}}(u,v)=\max_r Q_{\text{place}}(u,v,r) and associated r~(u,v)=argmaxrQplace(u,v,r)\tilde r(u,v)=\arg\max_r Q_{\text{place}}(u,v,r).
  2. Apply a threshold S={(u,v):Q~(u,v)>αQplacemax}S=\{(u,v): \tilde Q(u,v)>\alpha Q^{\max}_{\text{place}}\} and select TopN=100 scores.
  3. Cluster these using K-means into KK clusters.
  4. From each cluster ii, pick (ui,vi)=argmax(u,v)clusteriQ~(u,v)(u_i,v_i) = \arg\max_{(u,v)\in\text{cluster}_i} \tilde Q(u,v) and set θi=r~(ui,vi)\theta_i = \tilde r(u_i,v_i).

In tree-search planning, nodes are n=[o,d,τ]n=[o,d,\tau] with image oo, depth dd, and action sequence τ\tau. The tree is expanded to depth dmaxd_{\max}, exploring KK candidates at each expansion. Each node is scored as

V(n)=γd1[Coog1]V(n) = \gamma^{d-1}[C - \|o - o_g\|_1]

with discount factor γ(0,1)\gamma\in(0,1) and constant C>0C > 0. At execution, the best-scored node's first action is performed, and the process is replanned until the goal is reached or a maximum step count is hit.

4. Training Protocol and Implementation

Both GCTN and VF modules are trained using Adam with learning rate 1e−4 over 60,000 steps, batch size 1, and SE(2)-invariant data augmentation via random rotation and translation. The FCN uses 5x5 convolutional kernels, downsampling to a 4×4 latent representation, with up-projection via deconvolutions and skip connections (matching the original TransporterNet FCN architecture). Rotation space is discretized into R=36R=36 bins, corresponding to 10° increments. K-means threshold α=0.01\alpha=0.01, discount γ=0.99\gamma=0.99, and C=1C=1 are used. Computationally, training requires ~0.08 s/step for GCTN, 0.14 s/step for TVF-Small (K=2), and 1.8 s/step for TVF-Large (K=3, d=3d=3) on an RTX 3090.

5. Experimental Evaluation and Quantitative Results

Experiments were conducted both in simulation (Ravens suite) and with a real Franka Panda robot equipped with a Carmine depth camera. In simulation, 14 tasks (6 train, 8 held-out) involve constructing planar shapes from 2N52\leq N\leq 5 blocks per task. Expert demonstrations (N=1,10,100,1000N=1,10,100,1000 per task) were used for training. Real-robot evaluation used 3 training tasks (10 demos each) and 3 unseen test tasks (10 trials each).

Main results include:

Demos/task GCTN (%) TVF-S (K=2,d=1) (%) TVF-L (K=3,d=3) (%)
1 1.3 1.7 2.9
10 55.4 71.5 78.5
100 49.0 62.3 71.7
1000 54.2 72.5 85.6

On 8 unseen simulation tasks (10 demos/task), TVF-Large achieves 78.5% success versus 55.4% for GCTN. Per-task breakdowns show TVF outperforming GCTN, e.g., Plane Square (TVF-L 100% vs. GCTN 86.7%), Building (TVF-L 13.3% vs. GCTN 5.0%).

In real-robot trials (30 demos total), average success for unseen tasks improves from 30.0% (GCTN) to 63.3% (TVF), and dramatic gains occur on especially difficult tasks (e.g., Twin-Tower: 0% GCTN vs. 60% TVF).

Visual Foresight ablations show pixel-wise L1L_1 loss (10 demos/task): Latent Dynamics baseline 0.0875 (color), 0.0873 (height); TVF 0.0242 (color), 0.0136 (height).

6. Ablation Analysis

Systematic variation of K{2,3}K\in\{2,3\} and search depth d{1,2,3,4}d\in\{1,2,3,4\} demonstrates that increased tree-search depth boosts average performance only when both the action proposal and the VF model are sufficiently accurate. The highest average success (78.5%) on the 10-demo setting is reported for K=3,d=3,G=0K=3,d=3,G=0. Excessive search depth can degrade performance due to accumulated VF model error, particularly when model predictions become less reliable.

7. Significance and Limitations

TVF demonstrates that explicit one-step visual prediction, when combined with multi-modal action proposals and local tree-search, significantly enhances the ability to compose skills for rearrangement tasks not seen during training, even in the low-data regime. The framework's reliance on SE(2)-equivariant representations and data augmentation is essential for sample efficiency and generalization. Notably, real robot results confirm substantial improvements on difficult compositions relative to leading imitation-only baselines.

A plausible implication is that the model-based approach, as instantiated in TVF, provides a tractable path for visual task planning under severe demonstration constraints. However, TVF’s efficacy is conditional on the accuracy of both its forward model and its action-value proposals: error accumulation in either component can impair scaling to deeper search or higher-horizon tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Transporters with Visual Foresight (TVF).