- The paper introduces NoProp, a novel training method that reframes neural network learning as independent denoising tasks per layer without relying on backpropagation.
- It employs fixed Gaussian variational posteriors and noise scheduling inspired by diffusion models to simplify the overall training objective.
- NoProp demonstrates competitive accuracy on image classification benchmarks while significantly reducing GPU memory usage compared to standard backpropagation.
The paper "NoProp: Training Neural Networks without Back-propagation or Forward-propagation" (2503.24322) introduces a novel method for training deep neural networks that does not rely on the traditional back-propagation or even full forward-propagation passes during training. This approach is inspired by denoising score matching techniques used in diffusion models.
The core idea behind NoProp (2503.24322) is to reframe the training process as a series of independent denoising tasks, one for each layer or block of the network. Instead of layers learning hierarchical representations by processing information sequentially from input to output and receiving error signals back, each layer independently learns to denoise a noisy version of the target label embedding.
Let's consider a neural network with T blocks, processing input x and aiming to predict label y. In traditional backprop, the input propagates forward, a loss is computed at the output, and gradients propagate backward through all layers. NoProp (2503.24322) defines two processes: a forward process p that transforms latent variables zt−1 to zt conditioned on x, and a backward process q that defines a noising process from the target label y to a noisy latent variable zT, and then successively adds noise backward in time (zT→zT−1→⋯→z0). The variational posterior q((zt)t=0T∣y,x) is fixed to a tractable Gaussian distribution, specifically derived from a variance-preserving Ornstein-Uhlenbeck process. The goal is to learn the forward process p((zt)t=0T,y∣x) such that it explains the data.
The training objective is derived from the Evidence Lower Bound (ELBO) of the log-likelihood logp(y∣x). By fixing the structure of the forward process p(zt∣zt−1,x) to match the form of the fixed backward process q(zt∣zt−1,y) (up to a learned component), and parameterizing p(zt∣zt−1,x) as a neural network block uθt(zt−1,x) with parameters θt, the ELBO simplifies significantly.
For the discrete-time version (NoProp-DT), the objective for training each block uθt at time step t boils down to minimizing an L2 loss that encourages uθt(zt−1,x) to predict the target label embedding uy:
Lt=Eq(zt−1∣y,x)[(SNR(t)−SNR(t−1))∥uθt(zt−1,x)−uy∥2]
plus terms related to the final layer loss and the KL divergence of the initial latent state z0. The crucial aspect is that the expectation is taken with respect to q(zt−1∣y,x), which can be sampled based only on the target label y and the predefined noise schedule (as q(zt−1∣y) is known), and the input x. This means training at time step t requires only the input x, the target label y, and a sample zt−1 from the noise process, not the activation from layer t−1 or a gradient from layer t. The final layer p^θout(y∣zT) (a linear layer plus softmax) and the label embedding matrix WEmbed are trained jointly with the blocks.
For practical implementation, this means that during training, you can pick a random time step t, sample the corresponding noisy target zt−1, feed it along with the input x to the t-th block (uθt), compute the L2 loss against the target embedding uy, and update only the parameters θt and the shared final layers/embeddings. The paper's Algorithm 1 for NoProp-DT sequentially updates parameters for t=1,…,T within each epoch, but independent sampling of t is also possible.
The paper also explores continuous-time variations (NoProp-CT and NoProp-FM) based on continuous diffusion and flow matching. These variants train a single neural network uθ(zt,x,t) (for NoProp-CT) or vθ(zt,x,t) (for NoProp-FM) that takes time t as an additional input. Training involves sampling a continuous time t∈[0,1] and minimizing a similar L2-based objective (Equation 8 for NoProp-CT, Equation 10 for NoProp-FM) that encourages the network to predict a target vector field or label embedding.
Practical Implementation Details:
- Architecture: The network consists of T identical blocks (for discrete-time) or a single block conditioned on time (for continuous-time). Each block takes the input image x and a latent variable z (a noised label embedding) as input.
- Input Embedding: Separate pathways are used for processing x and z. Images x go through a convolutional embedding module, and latents z go through FC layers (or conv if z has image dimensions). These embeddings are concatenated before being processed by subsequent FC layers within the block (Figure 1 (2503.24322)).
- Output of Blocks: For discrete-time NoProp-DT, each block uθt outputs logits which are then softmaxed and used as weights to form a convex combination of the class embeddings from WEmbed, producing an estimate of uy. For continuous-time flow matching (NoProp-FM), the block vθ directly outputs an unconstrained vector in the embedding space.
- Label Embeddings (WEmbed): Can be fixed (e.g., one-hot vectors) or learned. Learned embeddings can be initialized randomly, orthogonally, or using 'prototype' images (median images per class). Learned embeddings generally improve performance, especially on more complex datasets like CIFAR-100.
- Training Loop:
- NoProp-DT (Algorithm 1): Iterate through epochs. For each epoch, iterate through time steps t=1,…,T. For each t, iterate through mini-batches. For each sample (xi,yi) in the batch, get uyi, sample zt−1,i from the fixed noise process q(zt−1∣yi), compute the block output u^θt(zt−1,i,xi), compute the loss (Equation 6), and update parameters θt, θout, and WEmbed. The loss also includes terms for the final layer and initial KL divergence, computed using zT,i and z0,i sampled from q(⋅∣yi).
- NoProp-CT/FM (Algorithms 2 & 3): Iterate through epochs. For each epoch, iterate through mini-batches. For each sample (xi,yi), sample a time ti∈[0,1]. For NoProp-CT, sample zti,i from q(zti∣yi). For NoProp-FM, sample z0,i from N(0,1) and zti,i from N(tiuyi+(1−ti)z0,i,σ2). Compute the block output (u^θ or v^θ) using zti,i, xi, and ti. Compute the respective loss function (Equation 8 or 10) and update parameters θ (or θ,ψ), θout, and WEmbed.
- Inference (Figure 2): Start with random noise z0∼N(0,1). Pass it sequentially through the learned blocks u1,…,uT (or simulate the learned ODE/SDE dynamics uθ or vθ for T steps in continuous-time) while conditioning on the input x. The final latent zT (or z1 for NoProp-CT) is passed through the final linear layer and softmax for prediction y^. For Flow Matching, the prediction can also be the class whose embedding uy is closest to the final zT in Euclidean distance.
Implementation Considerations and Trade-offs:
- Computational Efficiency: A key benefit is that training updates for different layers (or time steps) can be parallelized if sampled independently, although the paper's discrete-time algorithm updates sequentially. The main win compared to backprop is reduced memory as intermediate activations for backprop are not needed across the entire network stack for gradient calculation for a specific layer's parameters. Table 2 (2503.24322) shows significant GPU memory reduction compared to Backprop and Adjoint methods.
- Accuracy: NoProp-DT achieves comparable accuracy to standard Backprop on benchmark image classification tasks (Table 1 (2503.24322)), and significantly outperforms other backprop-free methods. Continuous-time variants are competitive with Adjoint methods and often more efficient (Figure 3 (2503.24322)), though currently less accurate than NoProp-DT or Backprop.
- Simplicity and Robustness: The paper claims NoProp is simpler and more robust than prior backprop-free methods, partly due to leveraging the well-understood objectives from diffusion modeling.
- Hyperparameters: The parameter η in the NoProp loss (Equations 6, 8) balances the denoising objective against other terms and needs tuning. The choice of noise schedule and the number of steps T are also hyperparameters.
- Representation Learning: A significant departure from standard deep learning is that NoProp's intermediate representations (zt) are fixed by the user-defined noising process (noisy label embeddings), rather than being learned hierarchically. This simplifies the learning task for each layer but raises questions about the method's ability to generalize to tasks requiring complex, hierarchical feature extraction. The results suggest fixed representations can be effective for image classification with appropriate label embeddings.
In summary, NoProp offers a practical, memory-efficient alternative to back-propagation by framing neural network training as independent denoising tasks per layer/timestep. It achieves competitive performance on image classification benchmarks and highlights a potentially valuable paradigm shift towards designing representations rather than solely learning them. Implementing NoProp involves setting up the noise process (defining how q(zt∣y) works), designing the layer blocks (uθ or vθ), and training each block (or the time-conditioned block) independently using samples from the noise process and the target label embedding, along with jointly training the final classifier and label embeddings.