Papers
Topics
Authors
Recent
Search
2000 character limit reached

NoProp: Training Neural Networks without Back-propagation or Forward-propagation

Published 31 Mar 2025 in cs.LG and stat.ML | (2503.24322v1)

Abstract: The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stacked structure of neural networks, where each layer builds on the representation of the layer below, this approach leads to hierarchical representations. More abstract features live on the top layers of the model, while features on lower layers are expected to be less abstract. In contrast to this, we introduce a new learning method named NoProp, which does not rely on either forward or backwards propagation. Instead, NoProp takes inspiration from diffusion and flow matching methods, where each layer independently learns to denoise a noisy target. We believe this work takes a first step towards introducing a new family of gradient-free learning methods, that does not learn hierarchical representations -- at least not in the usual sense. NoProp needs to fix the representation at each layer beforehand to a noised version of the target, learning a local denoising process that can then be exploited at inference. We demonstrate the effectiveness of our method on MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks. Our results show that NoProp is a viable learning algorithm which achieves superior accuracy, is easier to use and computationally more efficient compared to other existing back-propagation-free methods. By departing from the traditional gradient based learning paradigm, NoProp alters how credit assignment is done within the network, enabling more efficient distributed learning as well as potentially impacting other characteristics of the learning process.

Summary

  • The paper introduces NoProp, a novel training method that reframes neural network learning as independent denoising tasks per layer without relying on backpropagation.
  • It employs fixed Gaussian variational posteriors and noise scheduling inspired by diffusion models to simplify the overall training objective.
  • NoProp demonstrates competitive accuracy on image classification benchmarks while significantly reducing GPU memory usage compared to standard backpropagation.

The paper "NoProp: Training Neural Networks without Back-propagation or Forward-propagation" (2503.24322) introduces a novel method for training deep neural networks that does not rely on the traditional back-propagation or even full forward-propagation passes during training. This approach is inspired by denoising score matching techniques used in diffusion models.

The core idea behind NoProp (2503.24322) is to reframe the training process as a series of independent denoising tasks, one for each layer or block of the network. Instead of layers learning hierarchical representations by processing information sequentially from input to output and receiving error signals back, each layer independently learns to denoise a noisy version of the target label embedding.

Let's consider a neural network with TT blocks, processing input xx and aiming to predict label yy. In traditional backprop, the input propagates forward, a loss is computed at the output, and gradients propagate backward through all layers. NoProp (2503.24322) defines two processes: a forward process pp that transforms latent variables zt1z_{t-1} to ztz_t conditioned on xx, and a backward process qq that defines a noising process from the target label yy to a noisy latent variable zTz_T, and then successively adds noise backward in time (zTzT1z0z_T \to z_{T-1} \to \dots \to z_0). The variational posterior q((zt)t=0Ty,x)q((z_t)_{t=0}^T | y, x) is fixed to a tractable Gaussian distribution, specifically derived from a variance-preserving Ornstein-Uhlenbeck process. The goal is to learn the forward process p((zt)t=0T,yx)p((z_t)_{t=0}^T, y | x) such that it explains the data.

The training objective is derived from the Evidence Lower Bound (ELBO) of the log-likelihood logp(yx)\log p(y|x). By fixing the structure of the forward process p(ztzt1,x)p(z_t | z_{t-1}, x) to match the form of the fixed backward process q(ztzt1,y)q(z_t | z_{t-1}, y) (up to a learned component), and parameterizing p(ztzt1,x)p(z_t | z_{t-1}, x) as a neural network block uθt(zt1,x)u_{\theta_t}(z_{t-1}, x) with parameters θt\theta_t, the ELBO simplifies significantly.

For the discrete-time version (NoProp-DT), the objective for training each block uθtu_{\theta_t} at time step tt boils down to minimizing an L2 loss that encourages uθt(zt1,x)u_{\theta_t}(z_{t-1}, x) to predict the target label embedding uyu_y:

Lt=Eq(zt1y,x)[(SNR(t)SNR(t1))uθt(zt1,x)uy2]\mathcal{L}_t = E_{q(z_{t-1}|y,x)} \left[ (\textrm{SNR}(t) - \textrm{SNR}(t-1)) \| u_{\theta_t}(z_{t-1},x) - u_y \|^2 \right]

plus terms related to the final layer loss and the KL divergence of the initial latent state z0z_0. The crucial aspect is that the expectation is taken with respect to q(zt1y,x)q(z_{t-1}|y,x), which can be sampled based only on the target label yy and the predefined noise schedule (as q(zt1y)q(z_{t-1}|y) is known), and the input xx. This means training at time step tt requires only the input xx, the target label yy, and a sample zt1z_{t-1} from the noise process, not the activation from layer t1t-1 or a gradient from layer tt. The final layer p^θout(yzT)\hat{p}_{\theta_{\textrm{out}}}(y|z_T) (a linear layer plus softmax) and the label embedding matrix WEmbedW_{\mathrm{Embed}} are trained jointly with the blocks.

For practical implementation, this means that during training, you can pick a random time step tt, sample the corresponding noisy target zt1z_{t-1}, feed it along with the input xx to the tt-th block (uθtu_{\theta_t}), compute the L2 loss against the target embedding uyu_y, and update only the parameters θt\theta_t and the shared final layers/embeddings. The paper's Algorithm 1 for NoProp-DT sequentially updates parameters for t=1,,Tt=1, \ldots, T within each epoch, but independent sampling of tt is also possible.

The paper also explores continuous-time variations (NoProp-CT and NoProp-FM) based on continuous diffusion and flow matching. These variants train a single neural network uθ(zt,x,t)u_\theta(z_t, x, t) (for NoProp-CT) or vθ(zt,x,t)v_\theta(z_t, x, t) (for NoProp-FM) that takes time tt as an additional input. Training involves sampling a continuous time t[0,1]t \in [0, 1] and minimizing a similar L2-based objective (Equation 8 for NoProp-CT, Equation 10 for NoProp-FM) that encourages the network to predict a target vector field or label embedding.

Practical Implementation Details:

  1. Architecture: The network consists of TT identical blocks (for discrete-time) or a single block conditioned on time (for continuous-time). Each block takes the input image xx and a latent variable zz (a noised label embedding) as input.
  2. Input Embedding: Separate pathways are used for processing xx and zz. Images xx go through a convolutional embedding module, and latents zz go through FC layers (or conv if zz has image dimensions). These embeddings are concatenated before being processed by subsequent FC layers within the block (Figure 1 (2503.24322)).
  3. Output of Blocks: For discrete-time NoProp-DT, each block uθtu_{\theta_t} outputs logits which are then softmaxed and used as weights to form a convex combination of the class embeddings from WEmbedW_{\mathrm{Embed}}, producing an estimate of uyu_y. For continuous-time flow matching (NoProp-FM), the block vθv_\theta directly outputs an unconstrained vector in the embedding space.
  4. Label Embeddings (WEmbedW_{\mathrm{Embed}}): Can be fixed (e.g., one-hot vectors) or learned. Learned embeddings can be initialized randomly, orthogonally, or using 'prototype' images (median images per class). Learned embeddings generally improve performance, especially on more complex datasets like CIFAR-100.
  5. Training Loop:
    • NoProp-DT (Algorithm 1): Iterate through epochs. For each epoch, iterate through time steps t=1,,Tt=1, \ldots, T. For each tt, iterate through mini-batches. For each sample (xi,yi)(x_i, y_i) in the batch, get uyiu_{y_i}, sample zt1,iz_{t-1, i} from the fixed noise process q(zt1yi)q(z_{t-1}|y_i), compute the block output u^θt(zt1,i,xi)\hat{u}_{\theta_t}(z_{t-1,i}, x_i), compute the loss (Equation 6), and update parameters θt\theta_t, θout\theta_{\text{out}}, and WEmbedW_{\mathrm{Embed}}. The loss also includes terms for the final layer and initial KL divergence, computed using zT,iz_{T,i} and z0,iz_{0,i} sampled from q(yi)q(\cdot|y_i).
    • NoProp-CT/FM (Algorithms 2 & 3): Iterate through epochs. For each epoch, iterate through mini-batches. For each sample (xi,yi)(x_i, y_i), sample a time ti[0,1]t_i \in [0, 1]. For NoProp-CT, sample zti,iz_{t_i, i} from q(ztiyi)q(z_{t_i}|y_i). For NoProp-FM, sample z0,iz_{0, i} from N(0,1)N(0,1) and zti,iz_{t_i, i} from N(tiuyi+(1ti)z0,i,σ2)N(t_i u_{y_i} + (1-t_i) z_{0, i}, \sigma^2). Compute the block output (u^θ\hat{u}_{\theta} or v^θ\hat{v}_{\theta}) using zti,iz_{t_i, i}, xix_i, and tit_i. Compute the respective loss function (Equation 8 or 10) and update parameters θ\theta (or θ,ψ\theta, \psi), θout\theta_{\text{out}}, and WEmbedW_{\mathrm{Embed}}.
  6. Inference (Figure 2): Start with random noise z0N(0,1)z_0 \sim N(0,1). Pass it sequentially through the learned blocks u1,,uTu_1, \dots, u_T (or simulate the learned ODE/SDE dynamics uθu_\theta or vθv_\theta for TT steps in continuous-time) while conditioning on the input xx. The final latent zTz_T (or z1z_1 for NoProp-CT) is passed through the final linear layer and softmax for prediction y^\hat{y}. For Flow Matching, the prediction can also be the class whose embedding uyu_y is closest to the final zTz_T in Euclidean distance.

Implementation Considerations and Trade-offs:

  • Computational Efficiency: A key benefit is that training updates for different layers (or time steps) can be parallelized if sampled independently, although the paper's discrete-time algorithm updates sequentially. The main win compared to backprop is reduced memory as intermediate activations for backprop are not needed across the entire network stack for gradient calculation for a specific layer's parameters. Table 2 (2503.24322) shows significant GPU memory reduction compared to Backprop and Adjoint methods.
  • Accuracy: NoProp-DT achieves comparable accuracy to standard Backprop on benchmark image classification tasks (Table 1 (2503.24322)), and significantly outperforms other backprop-free methods. Continuous-time variants are competitive with Adjoint methods and often more efficient (Figure 3 (2503.24322)), though currently less accurate than NoProp-DT or Backprop.
  • Simplicity and Robustness: The paper claims NoProp is simpler and more robust than prior backprop-free methods, partly due to leveraging the well-understood objectives from diffusion modeling.
  • Hyperparameters: The parameter η\eta in the NoProp loss (Equations 6, 8) balances the denoising objective against other terms and needs tuning. The choice of noise schedule and the number of steps TT are also hyperparameters.
  • Representation Learning: A significant departure from standard deep learning is that NoProp's intermediate representations (ztz_t) are fixed by the user-defined noising process (noisy label embeddings), rather than being learned hierarchically. This simplifies the learning task for each layer but raises questions about the method's ability to generalize to tasks requiring complex, hierarchical feature extraction. The results suggest fixed representations can be effective for image classification with appropriate label embeddings.

In summary, NoProp offers a practical, memory-efficient alternative to back-propagation by framing neural network training as independent denoising tasks per layer/timestep. It achieves competitive performance on image classification benchmarks and highlights a potentially valuable paradigm shift towards designing representations rather than solely learning them. Implementing NoProp involves setting up the noise process (defining how q(zty)q(z_t|y) works), designing the layer blocks (uθu_\theta or vθv_\theta), and training each block (or the time-conditioned block) independently using samples from the noise process and the target label embedding, along with jointly training the final classifier and label embeddings.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 38 tweets with 435 likes about this paper.