Transporters with Visual Foresight (TVF)

Updated 14 January 2026

TVF is an image-based robotic pick-and-place planning framework that integrates a learned visual foresight model for next-state prediction.
TVF refines the Goal-Conditioned Transporter Network with multi-modal action proposals and tree-search planning to enhance zero-shot generalization on unseen tasks.
Experimental results demonstrate that TVF significantly outperforms GCTN in simulation and real-robot setups, achieving up to 85.6% success with limited demonstration data.

Transporters with Visual Foresight (TVF) is an image-based task planning framework for robotic pick-and-place rearrangement, designed to generalize efficiently to unseen tasks with minimal demonstration data. TVF builds upon the Goal-Conditioned Transporter Network (GCTN) by integrating a learned Visual Foresight (VF) model for next-state prediction, enabling model-based search for anticipated outcomes. Experiments show TVF achieves significant improvements over GCTN alone in zero-shot generalization to unseen rearrangement problems, both in simulation and on real robot setups (Wu et al., 2022).

1. Formulation and Notation

The foundation of TVF is the planar tabletop pick-and-place rearrangement problem. At timestep $t$ , the agent receives a top-down orthographic observation $o_t\in\mathbb{R}^{H\times W\times 4}$ , where the four channels correspond to RGB and height. The agent is also provided with a goal observation $o_g\in\mathbb{R}^{H\times W\times 4}$ . The atomic action $a_t$ is parameterized by a pick pose $T_{\text{pick}}=(p_{\text{pick}},\theta_{\text{pick}})\in SE(2)$ and a place pose $T_{\text{place}}=(p_{\text{place}},\theta_{\text{place}})\in SE(2)$ .

Given a dataset $D=\{\xi_i\}_{i=1}^N$ of expert demonstration trajectories

$\xi_i = \{o_1, a_1, o_2, a_2, ..., o_{T_i}, a_{T_i}, o_{T_i+1}=o_g \}$

the goal is to learn a goal-conditioned pick-and-place policy $\pi(o_t, o_g)\to a_t$ that generalizes in a zero-shot manner to unseen task configurations from a small number of demonstrations.

2. Architectural Components

TVF consists of two synergistic modules: (A) a multi-modal action proposal network based on GCTN, and (B) a Visual Foresight model for one-step state prediction. At inference, GCTN generates $K$ candidate pick-and-place actions; the VF model is used to predict each outcome, enabling a tree-search planner to select actions optimized with respect to the downstream goal.

Goal-Conditioned Transporter Network (GCTN)

GCTN encodes both the current and goal images to output spatial Q-value maps for pick and place operations. The pick-value head $Q_{\text{pick}}(u,v|o_t,o_g)\in\mathbb{R}^{H\times W}$ predicts the optimal pick location:

$p_{\text{pick}} = \arg\max_{(u,v)} Q_{\text{pick}}(u,v|o_t,o_g).$

The place-value head $Q_{\text{place}}(u,v,r|o_t,o_g,p_{\text{pick}})\in\mathbb{R}^{H\times W\times R}$ incorporates $R$ discrete rotation bins:

$(u^*,v^*,r^*) = \arg\max Q_{\text{place}}.$

The imitation learning objective is a dense cross-entropy loss:

$\mathcal{L}_{\text{GCTN}} = -\sum_t [\log\operatorname{softmax}Q_{\text{pick}}(p^*_{\text{pick}}) + \log\operatorname{softmax}Q_{\text{place}}(p^*_{\text{place}},r^*)].$

Visual Foresight Model

The VF model directly encodes the action into image space. Given action $a_t$ , it constructs a pick mask $M_{\text{pick}} \in \{0,1\}^{H\times W}$ centered at $p_{\text{pick}}$ and a place mask $M_{\text{place}}\in\mathbb{R}^{H\times W\times 4}$ by rotating $o_t$ by $\Delta\theta = \theta_{\text{place}} - \theta_{\text{pick}}$ about $p_{\text{pick}}$ , cropping a same-sized patch, and pasting it at $p_{\text{place}}$ . The input $[o_t, M_{\text{pick}}, M_{\text{place}}]\in\mathbb{R}^{H\times W\times(4+1+4)}$ is processed by an encoder–decoder FCN with skip connections to yield a predicted next image $\hat{o}_{t+1}$ .

The VF loss is

$\mathcal{L}_{\text{VF}} = \|\hat{o}_{t+1} - o_{t+1}\|_1$

with a 5× penalty on the height channel.

SE(2) equivariance is exploited: applying any rigid $g\in SE(2)$ to $(o_t, a_t)$ and $o_{t+1}$ amounts to the same transform on the prediction, simplifying data augmentation.

TVF augments the typical single-action GCTN output with a multi-modal proposal strategy:

Compute $\tilde Q_{\text{place}}(u,v)=\max_r Q_{\text{place}}(u,v,r)$ and associated $\tilde r(u,v)=\arg\max_r Q_{\text{place}}(u,v,r)$ .
Apply a threshold $S=\{(u,v): \tilde Q(u,v)>\alpha Q^{\max}_{\text{place}}\}$ and select TopN=100 scores.
Cluster these using K-means into $K$ clusters.
From each cluster $i$ , pick $(u_i,v_i) = \arg\max_{(u,v)\in\text{cluster}_i} \tilde Q(u,v)$ and set $\theta_i = \tilde r(u_i,v_i)$ .

In tree-search planning, nodes are $n=[o,d,\tau]$ with image $o$ , depth $d$ , and action sequence $\tau$ . The tree is expanded to depth $d_{\max}$ , exploring $K$ candidates at each expansion. Each node is scored as

$V(n) = \gamma^{d-1}[C - \|o - o_g\|_1]$

with discount factor $\gamma\in(0,1)$ and constant $C > 0$ . At execution, the best-scored node's first action is performed, and the process is replanned until the goal is reached or a maximum step count is hit.

4. Training Protocol and Implementation

Both GCTN and VF modules are trained using Adam with learning rate 1e−4 over 60,000 steps, batch size 1, and SE(2)-invariant data augmentation via random rotation and translation. The FCN uses 5x5 convolutional kernels, downsampling to a 4×4 latent representation, with up-projection via deconvolutions and skip connections (matching the original TransporterNet FCN architecture). Rotation space is discretized into $R=36$ bins, corresponding to 10° increments. K-means threshold $\alpha=0.01$ , discount $\gamma=0.99$ , and $C=1$ are used. Computationally, training requires ~0.08 s/step for GCTN, 0.14 s/step for TVF-Small (K=2), and 1.8 s/step for TVF-Large (K=3, $d=3$ ) on an RTX 3090.

5. Experimental Evaluation and Quantitative Results

Experiments were conducted both in simulation (Ravens suite) and with a real Franka Panda robot equipped with a Carmine depth camera. In simulation, 14 tasks (6 train, 8 held-out) involve constructing planar shapes from $2\leq N\leq 5$ blocks per task. Expert demonstrations ( $N=1,10,100,1000$ per task) were used for training. Real-robot evaluation used 3 training tasks (10 demos each) and 3 unseen test tasks (10 trials each).

Main results include:

Demos/task	GCTN (%)	TVF-S (K=2,d=1) (%)	TVF-L (K=3,d=3) (%)
1	1.3	1.7	2.9
10	55.4	71.5	78.5
100	49.0	62.3	71.7
1000	54.2	72.5	85.6

On 8 unseen simulation tasks (10 demos/task), TVF-Large achieves 78.5% success versus 55.4% for GCTN. Per-task breakdowns show TVF outperforming GCTN, e.g., Plane Square (TVF-L 100% vs. GCTN 86.7%), Building (TVF-L 13.3% vs. GCTN 5.0%).

In real-robot trials (30 demos total), average success for unseen tasks improves from 30.0% (GCTN) to 63.3% (TVF), and dramatic gains occur on especially difficult tasks (e.g., Twin-Tower: 0% GCTN vs. 60% TVF).

Visual Foresight ablations show pixel-wise $L_1$ loss (10 demos/task): Latent Dynamics baseline 0.0875 (color), 0.0873 (height); TVF 0.0242 (color), 0.0136 (height).

6. Ablation Analysis

Systematic variation of $K\in\{2,3\}$ and search depth $d\in\{1,2,3,4\}$ demonstrates that increased tree-search depth boosts average performance only when both the action proposal and the VF model are sufficiently accurate. The highest average success (78.5%) on the 10-demo setting is reported for $K=3,d=3,G=0$ . Excessive search depth can degrade performance due to accumulated VF model error, particularly when model predictions become less reliable.

7. Significance and Limitations

TVF demonstrates that explicit one-step visual prediction, when combined with multi-modal action proposals and local tree-search, significantly enhances the ability to compose skills for rearrangement tasks not seen during training, even in the low-data regime. The framework's reliance on SE(2)-equivariant representations and data augmentation is essential for sample efficiency and generalization. Notably, real robot results confirm substantial improvements on difficult compositions relative to leading imitation-only baselines.

A plausible implication is that the model-based approach, as instantiated in TVF, provides a tractable path for visual task planning under severe demonstration constraints. However, TVF’s efficacy is conditional on the accuracy of both its forward model and its action-value proposals: error accumulation in either component can impair scaling to deeper search or higher-horizon tasks.

Markdown Report Issue Upgrade to Chat

References (1)

Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Transporters with Visual Foresight (TVF).

Transporters with Visual Foresight (TVF)

1. Formulation and Notation

2. Architectural Components

Goal-Conditioned Transporter Network (GCTN)

Visual Foresight Model

4. Training Protocol and Implementation

5. Experimental Evaluation and Quantitative Results

6. Ablation Analysis

7. Significance and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Transporters with Visual Foresight (TVF)

1. Formulation and Notation

2. Architectural Components

Goal-Conditioned Transporter Network (GCTN)

Visual Foresight Model

3. Multi-Modal Action Proposal and Planning

4. Training Protocol and Implementation

5. Experimental Evaluation and Quantitative Results

6. Ablation Analysis

7. Significance and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research