NoProp: Training Neural Networks without Back-propagation or Forward-propagation

Published 31 Mar 2025 in cs.LG and stat.ML | (2503.24322v1)

Abstract: The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stacked structure of neural networks, where each layer builds on the representation of the layer below, this approach leads to hierarchical representations. More abstract features live on the top layers of the model, while features on lower layers are expected to be less abstract. In contrast to this, we introduce a new learning method named NoProp, which does not rely on either forward or backwards propagation. Instead, NoProp takes inspiration from diffusion and flow matching methods, where each layer independently learns to denoise a noisy target. We believe this work takes a first step towards introducing a new family of gradient-free learning methods, that does not learn hierarchical representations -- at least not in the usual sense. NoProp needs to fix the representation at each layer beforehand to a noised version of the target, learning a local denoising process that can then be exploited at inference. We demonstrate the effectiveness of our method on MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks. Our results show that NoProp is a viable learning algorithm which achieves superior accuracy, is easier to use and computationally more efficient compared to other existing back-propagation-free methods. By departing from the traditional gradient based learning paradigm, NoProp alters how credit assignment is done within the network, enabling more efficient distributed learning as well as potentially impacting other characteristics of the learning process.

Abstract PDF Upgrade to Chat

Summary

The paper introduces NoProp, a novel training method that reframes neural network learning as independent denoising tasks per layer without relying on backpropagation.
It employs fixed Gaussian variational posteriors and noise scheduling inspired by diffusion models to simplify the overall training objective.
NoProp demonstrates competitive accuracy on image classification benchmarks while significantly reducing GPU memory usage compared to standard backpropagation.

The paper "NoProp: Training Neural Networks without Back-propagation or Forward-propagation" (2503.24322) introduces a novel method for training deep neural networks that does not rely on the traditional back-propagation or even full forward-propagation passes during training. This approach is inspired by denoising score matching techniques used in diffusion models.

The core idea behind NoProp (2503.24322) is to reframe the training process as a series of independent denoising tasks, one for each layer or block of the network. Instead of layers learning hierarchical representations by processing information sequentially from input to output and receiving error signals back, each layer independently learns to denoise a noisy version of the target label embedding.

Let's consider a neural network with $T$ blocks, processing input $x$ and aiming to predict label $y$ . In traditional backprop, the input propagates forward, a loss is computed at the output, and gradients propagate backward through all layers. NoProp (2503.24322) defines two processes: a forward process $p$ that transforms latent variables $z_{t-1}$ to $z_t$ conditioned on $x$ , and a backward process $q$ that defines a noising process from the target label $y$ to a noisy latent variable $z_T$ , and then successively adds noise backward in time ( $z_T \to z_{T-1} \to \dots \to z_0$ ). The variational posterior $q((z_t)_{t=0}^T | y, x)$ is fixed to a tractable Gaussian distribution, specifically derived from a variance-preserving Ornstein-Uhlenbeck process. The goal is to learn the forward process $p((z_t)_{t=0}^T, y | x)$ such that it explains the data.

The training objective is derived from the Evidence Lower Bound (ELBO) of the log-likelihood $\log p(y|x)$ . By fixing the structure of the forward process $p(z_t | z_{t-1}, x)$ to match the form of the fixed backward process $q(z_t | z_{t-1}, y)$ (up to a learned component), and parameterizing $p(z_t | z_{t-1}, x)$ as a neural network block $u_{\theta_t}(z_{t-1}, x)$ with parameters $\theta_t$ , the ELBO simplifies significantly.

For the discrete-time version (NoProp-DT), the objective for training each block $u_{\theta_t}$ at time step $t$ boils down to minimizing an L2 loss that encourages $u_{\theta_t}(z_{t-1}, x)$ to predict the target label embedding $u_y$ :

$\mathcal{L}_t = E_{q(z_{t-1}|y,x)} \left[ (\textrm{SNR}(t) - \textrm{SNR}(t-1)) \| u_{\theta_t}(z_{t-1},x) - u_y \|^2 \right]$

plus terms related to the final layer loss and the KL divergence of the initial latent state $z_0$ . The crucial aspect is that the expectation is taken with respect to $q(z_{t-1}|y,x)$ , which can be sampled based only on the target label $y$ and the predefined noise schedule (as $q(z_{t-1}|y)$ is known), and the input $x$ . This means training at time step $t$ requires only the input $x$ , the target label $y$ , and a sample $z_{t-1}$ from the noise process, not the activation from layer $t-1$ or a gradient from layer $t$ . The final layer $\hat{p}_{\theta_{\textrm{out}}}(y|z_T)$ (a linear layer plus softmax) and the label embedding matrix $W_{\mathrm{Embed}}$ are trained jointly with the blocks.

For practical implementation, this means that during training, you can pick a random time step $t$ , sample the corresponding noisy target $z_{t-1}$ , feed it along with the input $x$ to the $t$ -th block ( $u_{\theta_t}$ ), compute the L2 loss against the target embedding $u_y$ , and update only the parameters $\theta_t$ and the shared final layers/embeddings. The paper's Algorithm 1 for NoProp-DT sequentially updates parameters for $t=1, \ldots, T$ within each epoch, but independent sampling of $t$ is also possible.

The paper also explores continuous-time variations (NoProp-CT and NoProp-FM) based on continuous diffusion and flow matching. These variants train a single neural network $u_\theta(z_t, x, t)$ (for NoProp-CT) or $v_\theta(z_t, x, t)$ (for NoProp-FM) that takes time $t$ as an additional input. Training involves sampling a continuous time $t \in [0, 1]$ and minimizing a similar L2-based objective (Equation 8 for NoProp-CT, Equation 10 for NoProp-FM) that encourages the network to predict a target vector field or label embedding.

Practical Implementation Details:

Architecture: The network consists of $T$ identical blocks (for discrete-time) or a single block conditioned on time (for continuous-time). Each block takes the input image $x$ and a latent variable $z$ (a noised label embedding) as input.
Input Embedding: Separate pathways are used for processing $x$ and $z$ . Images $x$ go through a convolutional embedding module, and latents $z$ go through FC layers (or conv if $z$ has image dimensions). These embeddings are concatenated before being processed by subsequent FC layers within the block (Figure 1 (2503.24322)).
Output of Blocks: For discrete-time NoProp-DT, each block $u_{\theta_t}$ outputs logits which are then softmaxed and used as weights to form a convex combination of the class embeddings from $W_{\mathrm{Embed}}$ , producing an estimate of $u_y$ . For continuous-time flow matching (NoProp-FM), the block $v_\theta$ directly outputs an unconstrained vector in the embedding space.
Label Embeddings ( $W_{\mathrm{Embed}}$ ): Can be fixed (e.g., one-hot vectors) or learned. Learned embeddings can be initialized randomly, orthogonally, or using 'prototype' images (median images per class). Learned embeddings generally improve performance, especially on more complex datasets like CIFAR-100.
Training Loop:
- NoProp-DT (Algorithm 1): Iterate through epochs. For each epoch, iterate through time steps $t=1, \ldots, T$ . For each $t$ , iterate through mini-batches. For each sample $(x_i, y_i)$ in the batch, get $u_{y_i}$ , sample $z_{t-1, i}$ from the fixed noise process $q(z_{t-1}|y_i)$ , compute the block output $\hat{u}_{\theta_t}(z_{t-1,i}, x_i)$ , compute the loss (Equation 6), and update parameters $\theta_t$ , $\theta_{\text{out}}$ , and $W_{\mathrm{Embed}}$ . The loss also includes terms for the final layer and initial KL divergence, computed using $z_{T,i}$ and $z_{0,i}$ sampled from $q(\cdot|y_i)$ .
- NoProp-CT/FM (Algorithms 2 & 3): Iterate through epochs. For each epoch, iterate through mini-batches. For each sample $(x_i, y_i)$ , sample a time $t_i \in [0, 1]$ . For NoProp-CT, sample $z_{t_i, i}$ from $q(z_{t_i}|y_i)$ . For NoProp-FM, sample $z_{0, i}$ from $N(0,1)$ and $z_{t_i, i}$ from $N(t_i u_{y_i} + (1-t_i) z_{0, i}, \sigma^2)$ . Compute the block output ( $\hat{u}_{\theta}$ or $\hat{v}_{\theta}$ ) using $z_{t_i, i}$ , $x_i$ , and $t_i$ . Compute the respective loss function (Equation 8 or 10) and update parameters $\theta$ (or $\theta, \psi$ ), $\theta_{\text{out}}$ , and $W_{\mathrm{Embed}}$ .
Inference (Figure 2): Start with random noise $z_0 \sim N(0,1)$ . Pass it sequentially through the learned blocks $u_1, \dots, u_T$ (or simulate the learned ODE/SDE dynamics $u_\theta$ or $v_\theta$ for $T$ steps in continuous-time) while conditioning on the input $x$ . The final latent $z_T$ (or $z_1$ for NoProp-CT) is passed through the final linear layer and softmax for prediction $\hat{y}$ . For Flow Matching, the prediction can also be the class whose embedding $u_y$ is closest to the final $z_T$ in Euclidean distance.

Implementation Considerations and Trade-offs:

Computational Efficiency: A key benefit is that training updates for different layers (or time steps) can be parallelized if sampled independently, although the paper's discrete-time algorithm updates sequentially. The main win compared to backprop is reduced memory as intermediate activations for backprop are not needed across the entire network stack for gradient calculation for a specific layer's parameters. Table 2 (2503.24322) shows significant GPU memory reduction compared to Backprop and Adjoint methods.
Accuracy: NoProp-DT achieves comparable accuracy to standard Backprop on benchmark image classification tasks (Table 1 (2503.24322)), and significantly outperforms other backprop-free methods. Continuous-time variants are competitive with Adjoint methods and often more efficient (Figure 3 (2503.24322)), though currently less accurate than NoProp-DT or Backprop.
Simplicity and Robustness: The paper claims NoProp is simpler and more robust than prior backprop-free methods, partly due to leveraging the well-understood objectives from diffusion modeling.
Hyperparameters: The parameter $\eta$ in the NoProp loss (Equations 6, 8) balances the denoising objective against other terms and needs tuning. The choice of noise schedule and the number of steps $T$ are also hyperparameters.
Representation Learning: A significant departure from standard deep learning is that NoProp's intermediate representations ( $z_t$ ) are fixed by the user-defined noising process (noisy label embeddings), rather than being learned hierarchically. This simplifies the learning task for each layer but raises questions about the method's ability to generalize to tasks requiring complex, hierarchical feature extraction. The results suggest fixed representations can be effective for image classification with appropriate label embeddings.

In summary, NoProp offers a practical, memory-efficient alternative to back-propagation by framing neural network training as independent denoising tasks per layer/timestep. It achieves competitive performance on image classification benchmarks and highlights a potentially valuable paradigm shift towards designing representations rather than solely learning them. Implementing NoProp involves setting up the noise process (defining how $q(z_t|y)$ works), designing the layer blocks ( $u_\theta$ or $v_\theta$ ), and training each block (or the time-conditioned block) independently using samples from the noise process and the target label embedding, along with jointly training the final classifier and label embeddings.