PredNet: A Predictive Coding Approach

Updated 13 February 2026

PredNet is a hierarchical recurrent CNN that embodies predictive coding principles to predict upcoming video frames from observed inputs.
It leverages convolutional LSTMs and error units to capture local prediction errors and propagate them across layers for refined temporal analysis.
Its versatility has led to diverse applications, including action recognition, image coding, and robotics sensorimotor integration.

PredNet is a hierarchical, recurrent convolutional neural network architecture implementing predictive coding principles for unsupervised learning, originally introduced for next-frame prediction in video sequences. Each layer of PredNet locally predicts its own feedforward activation, forwards only the residual prediction error to higher layers, and updates its state via recurrent dynamics. This architecture, motivated by the predictive coding theory in neuroscience, has been demonstrated to learn spatiotemporal features useful for both video prediction and downstream tasks, and forms the foundation for variants targeting action modulation, image coding, and action recognition.

1. Hierarchical Predictive Coding Architecture

PredNet is formed from $L+1$ stacked layers ( $l=0,\dots,L$ ), each with four interacting units:

Representation (R) units: Recurrent states implemented via convolutional LSTMs (ConvLSTMs) that memorize past errors, prior internal state, and top-down context.
Prediction ( $\hat{A}$ ) units: Convolutional layers generating a local prediction for their feedforward input.
Error (E) units: Rectified difference (split into positive and negative ReLU channels) between actual input and prediction.
Feedforward (A) units: Actual input at each level; raw image at $l=0$ , or a convolutional transformation (with pooling) of lower-level errors for $l>0$ .

The key hierarchical flow is defined by alternating top-down predictions (via $R \rightarrow \hat{A}$ ), comparison with the actual drive ( $A$ ), error computation ( $E$ ), and bottom-up error propagation ( $E$ forms the next layer's $A$ ). This local prediction-error update loop enforces the principle that only unexplained portions of input propagate upward, while fully explained parts are suppressed.

The core dynamic at each $t$ is:

$A_0^t$ is the observed frame $I^t$ ; higher $A_l^t$ obtained by downsampling convolutions of $E_{l-1}^t$ .
Each $R_l^t$ is computed from $E_l^{t-1}$ , $R_l^{t-1}$ , and (when $l<L$ ) upsampled $R_{l+1}^t$ via ConvLSTM.
Prediction: $\hat{A}_l^t = \mathrm{Conv}_p(R_l^t)$ .
Error: $E_l^t = \left[ \mathrm{ReLU}(A_l^t - \hat{A}_l^t); \mathrm{ReLU}(\hat{A}_l^t - A_l^t) \right]$ (channel stacking doubles filters).

The architecture can be summarized by the module arrangement and feature depths:

Layer Index	A-channels (PredNet-4)	A-channels (PredNet-5)
l=0	3 (RGB)	3
l=1	48	48
l=2	96	96
l=3	192	192
l=4	–	192

All convolutional kernels are $3 \times 3$ . Pooling is max-pooling ( $2 \times 2$ ); upsampling is typically nearest-neighbor.

2. Formal Update Equations and Training Objective

The full update for each $(l, t)$ is as follows (Fonseca, 2019, Lotter et al., 2016):

Feedforward assignment:

$A_0^t(x, y, c) = I^t(x, y, c) \qquad A_l^t = \mathrm{Conv}_{+}(E_{l-1}^t) \text{ for } l>0$

where Conv $_{+}$ indicates a ReLU $\rightarrow$ Conv $\rightarrow$ ReLU pipeline.

Prediction:

$\hat{A}_l^t = \mathrm{Conv}_p(R_l^t)$

Error:

$E_l^t = \left[ \mathrm{ReLU}(A_l^t - \hat{A}_l^t); \mathrm{ReLU}(\hat{A}_l^t - A_l^t) \right]$

Representation update:

$R_l^t = \mathrm{ConvLSTM}(R_l^{t-1}, E_l^{t-1}, \mathrm{Upsample}(R_{l+1}^t))$

(no upsampled term at top layer)

Loss (unsupervised, per sequence):

$L = \sum_{t=1}^T \sum_{l=0}^L \lambda_l \Vert E_l^t \Vert_1$

$\lambda_l$ are layerwise weights, often $\lambda_0=1$ , others $<1$ .

After running the update for $t$ , the model's predicted next frame is $\hat{A}_0^{t+1} = \mathrm{Conv}_p(R_0^{t+1})$ , i.e., the bottom-layer prediction after advancing the recurrent state.

3. Implementation Details and Variants

Several published studies have used PredNet with minor adjustments:

Initialization: $R_l^0 = 0$ , $E_l^0 = 0$ for all $l$ .
Training: Optimizer is Adam (default $\beta_1=0.9$ , $\beta_2=0.999$ ), typical learning rate $1 \times 10^{-3}$ with decay.
Input resolution: PredNet-4 uses $128 \times 160$ input, PredNet-5 $256 \times 256$ .
Batching: Sequence length is typically 10–20 frames.
Losses: $\ell_1$ norm is used in frame prediction and occupancy grid prediction; MSE is used in some psychophysics experiments.

Variants and extensions:

AFA-PredNet (Zhong et al., 2018): Introduces action modulation via a Multi-Layer Perceptron (MLP) to condition top-down predictions on motor signals in an embodied setting, enabling action-dependent prediction.
Feature-level adaptation (Huang et al., 2019): Replaces pixel inputs with features from a pre-trained CNN for action recognition, reducing PredNet hierarchy to two layers with 64 channels each.
Image coding setting (Zhang et al., 2018): A deep residual DCNN variant of PredNet for context-based image prediction and coding, trained with $\ell_1$ , $\ell_2$ , or $\ell_\infty$ loss and using stacked regression for robust lossless image prediction.

4. Empirical Properties, Successes, and Limits

PredNet has demonstrated the following empirical properties:

Video prediction: Outperforms standard CNN–LSTM encoder-decoders in both synthetic (e.g., rotating faces) and natural video sequences (e.g., KITTI) (Lotter et al., 2016, Sinapayen et al., 2019).
Representation learning: Latent representations $R_l^t$ contain linearly decodable information about pose, velocity, object identity, and can be used for steering-angle prediction in driving videos (Lotter et al., 2016).
Neuroscience correspondence: Features from PredNet's recurrent states show measurable similarity to human visual cortex activations in fMRI/MEG (Fonseca, 2019).
Illusory motion reproduction: Can qualitatively reproduce certain human visual illusions, notably in the Rotating Snakes pattern, but with inconsistency across initializations and limitations relative to actual human perception (Kirubeswaran et al., 2023).
Action-modulated predictions: Inclusion of action signals (AFA-PredNet) leads to predictions more consistent with motor commands and sharper, less ambiguous outputs (Zhong et al., 2018).
Image prediction and compression: PredNet variants achieve lower entropy or prediction error than classical image coders (Zhang et al., 2018).

Key limitations:

Generalization to discrete domains: PredNet fails catastrophically on artificial domains such as Conway's Game of Life, in which a stand-alone CNN trained on the same data achieves perfect predictions. This illustrates an architectural trade-off: high performance on spatiotemporally-continuous natural data precludes competence on discrete, rule-based patterns (Sinapayen et al., 2019).
Blurry predictions for multimodal futures: Training with unimodal loss functions (L1/L2) results in regression-to-the-mean when multiple futures are possible, so predictions may lack sharpness or structure (Rane et al., 2019).
Inconsistent hierarchical error minimization: Error fields do not generally decrease up the PredNet hierarchy as full predictive-coding theory would predict. In practice, per-layer error minimization (L $_\mathrm{all}$ ) underperforms focusing only on pixel-level errors (L $_0$ ) (Rane et al., 2019).
Lack of robust long-term prediction: In challenging or non-smooth video regimes, PredNet can simply "copy last frame" and fail to extrapolate meaningful dynamics (Rane et al., 2019).
Sensitivity to model initialization: Performance on certain qualitative tasks (psychophysical illusions) varies greatly across random initializations (Kirubeswaran et al., 2023).

5. Applications Across Domains

PredNet's predictive coding backbone supports applications in multiple settings:

Next-frame video prediction: Standard application for both research and benchmark comparisons; accurate for both synthetic and real videos (Lotter et al., 2016, Sinapayen et al., 2019).
Downstream representation learning: Features extracted from PredNet's ConvLSTM states are effective for steering angle prediction and potentially for other vision tasks (Lotter et al., 2016).
Action recognition: Used as a temporally-aware, motion-sensitive feature extractor, eliminating the need for explicit optical flow computation in state-of-the-art video classification pipelines (Huang et al., 2019).
Robotics and sensorimotor integration: AFA-PredNet enables prediction that conditions on robot actions, demonstrating that motor plans can be integrated into deep predictive coding (Zhong et al., 2018).
Urban environment modeling: Applied to LiDAR-derived occupancy grids for scene prediction in autonomous driving contexts, outperforming static and particle-filter baselines (Itkina et al., 2019).
Lossless image coding: Deep residual PredNet architectures, trained for various $\ell$ -norm prediction losses, outperform state-of-the-art linear predictors in entropy and prediction accuracy (Zhang et al., 2018).
Neuroscientific modeling: Used to probe correspondence between learned representations and primate visual cortex, as well as to model perceptual phenomena (Fonseca, 2019, Kirubeswaran et al., 2023).

6. Theoretical Considerations and Future Directions

PredNet demonstrates the viability of next-frame prediction as an unsupervised learning framework, where representations are shaped by the task of reducing future sensory prediction error. However, its architecture is tightly coupled to continuous, locally correlated statistics encountered in natural vision. This confers advantages for hierarchical spatiotemporal abstraction and supports neuroscientific plausibility, but reduces flexibility in domains defined by discrete or rule-based transitions (Sinapayen et al., 2019).

Limitations identified include:

Non-monotonic error suppression along the hierarchy, contrary to canonical predictive coding theory (Rane et al., 2019).
Inability to represent multimodal future distributions, motivating probabilistic or variational predictive coding extensions.
Vulnerability to initialization, suggesting a need for more robust regularization or architectural stabilization techniques (Kirubeswaran et al., 2023).
Lack of scalability in top-down conditioning for complex semantic tasks (e.g., action recognition in natural videos), where added semantic information does not improve frame-level prediction quality (Rane et al., 2019).

Areas for further development include probabilistic latent-space modeling, meta-learned or hybrid architectures to accommodate both continuous and discrete domains, temporally multi-scaled hierarchies, and explicit separation of static vs. dynamic context features. The structure of PredNet suggests additional neuroscientific hypotheses concerning the computational role of layered error signaling and recurrent feedback in cortical hierarchies.

7. Comparative Table of Core Configurations

Variant / Domain	Layers ( $L$ )	Input	Task	Loss Type	Notable Outputs
Original PredNet (Lotter et al., 2016)	4–5	RGB video	Next-frame prediction	L1 (sum over layers/times)	Latent R for parameter regression
AFA-PredNet (Zhong et al., 2018)	2–3	Image+action	Action-conditioned prediction	L1/L2 norm	Sharp, action-specific outputs
Occupancy PredNet (Itkina et al., 2019)	4	Lidar grid	Scene forecasting	L1	Future occupancy grid
Feature-level (Huang et al., 2019)	2	CNN feats	Action recognition	Cross-entropy	Replaces optical flow
DCNN-PredNet (Zhang et al., 2018)	–	Image patch	Lossless coding	L1/L2/L $\infty$	Lower entropy residuals