Backpropagation-Free Transformations (BFT)

Updated 18 January 2026

BFT is a class of training methods that replaces global error backpropagation with local and forward update rules inspired by biological learning.
It employs techniques such as forward-mode differentiation, layerwise objectives, and closed-form solvers to lower computational overhead and memory usage.
BFT has shown competitive performance in few-shot learning and embedded applications while enhancing biological plausibility in network training.

Backpropagation-Free Transformations (BFT) are a class of learning methods in neural and neuromorphic computation that eschew the traditional global error backpropagation mechanism. Instead, BFTs rely on local, forward, algorithmic, or biologically inspired rules for weight updates. These methods aim to enable efficient, biologically plausible, or hardware-friendly training of multi-layer and deep networks across both supervised and unsupervised regimes. BFT encompasses local-error schemes (e.g., layerwise or blockwise classification objectives), feedback-free and zeroth-order optimization, Hebbian and associative plasticity, closed-form local solvers, and self-organizing or competitive learning, among others. Theoretical and empirical advances in BFT demonstrate competitive performance compared to conventional backpropagation (BP) in several contexts, including resource-constrained hardware and few-shot learning.

1. Theoretical Foundations of Backpropagation-Free Transformations

BFT methods substitute the global error signal and reverse-mode gradient computation with local or forward mechanisms for parameter adaptation. The fundamental goal is to relax or eliminate the backward locking, weight symmetry, and global synchronization requirements of BP. The principal BFT paradigms include:

Forward-mode automatic differentiation & directional derivatives: The forward gradient method computes an unbiased stochastic estimate of the loss gradient using a single forward-mode pass with random direction $v$ :

$\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$

This approach uses no backward sweep, only carrying tangents and computing the estimator in one forward pass (Baydin et al., 2022).

Layerwise/local objectives with independent updates: Layer or block-wise BFTs equip each module with its own auxiliary classifier and loss function, updating parameters with respect to local predictions. These local losses provide parallelism and eliminate backward locking (Gong et al., 16 Jan 2025, Cheng et al., 2023).
Feedback-free closed-form solvers: Forward Projection, as a strict BFT, fits layer weights to local targets generated by random projections of each layer’s activation and the label:

$W_i^* = \arg\min_W \|WX_i-T_i\|_F^2 + \lambda\|W\|_F^2,\qquad T_i = g(Q_i^T X_i) + g(U_i^T Y)$

This solves for $W_i$ in closed form, requiring only a single forward pass through the dataset (O'Shea et al., 27 Jan 2025).

Zeroth-order (gradient-free) and Hebbian updates: BFTs may use zeroth-order optimization via random perturbations or Hebbian/plasticity-inspired updates, particularly for resource-constrained or neuromorphic hardware (Zhao et al., 2023, Li, 11 Jan 2026).
Self-organizing and clustering transformations: In unsupervised settings, backpropagation-free convolutional networks utilize competitive learning (SOMs) and Hebbian mask adaptation to build deep representations (Stuhr et al., 2020).

BFT distills to the principle of minimizing or optimizing local criteria without reference to error backpropagation across layers, either analytically (closed-form) or algorithmically (greedy, local-gradient, competitive, or associative).

2. Algorithmic Instantiations and Principal Methodologies

Several BFT algorithms and families have been proposed, each tailored for specific architectures and workloads:

Mono-Forward (MF) and Layerwise Cross-Entropy: Each hidden layer is equipped with a local classifier and the cross-entropy loss

$\mathcal{L}^{(\ell)} = -\sum_{c=1}^C y_c \log \text{softmax}(M^{(\ell)} h^{(\ell)})_c$

Layer parameters $W^{(\ell)}, M^{(\ell)}$ are updated using only local information, fully eliminating global backward gradients (Gong et al., 16 Jan 2025, Spyra et al., 2 Nov 2025).

Forward Gradient Descent: Using forward-mode differentiation, the gradient estimate $\hat\nabla f(x)$ is computed in a single forward pass, enabling unbiased but higher-variance stochastic gradient descent (Baydin et al., 2022).
Block-Wise BP-Free (BWBPF) Networks: Deep architectures are partitioned into blocks, each with its own auxiliary classifier and local cross-entropy loss; only the output layer is optimized under the global loss. Block updates can be parallelized, and local losses do not propagate across blocks (Cheng et al., 2023).
Feedback-Hebbian Local Plasticity: In strictly local feedback architectures, synaptic updates combine centered covariance (Hebb), Oja normalization, and per-synapse supervised drive with no need for error transport or global gradients:

$\Delta w_i = \text{lr}\left[(x_i - \langle x_i\rangle)(y_i-\langle y_i\rangle) - \beta(y_i-\langle y_i\rangle)^2 w_i + (t_i-y_i)x_i\right]$

(Li, 11 Jan 2026)

Competitive Unsupervised Representation Learning: Convolutional self-organizing networks replace classical convolutional layers with SOMs, learning filter weights and inter-layer masks via local competition and Hebb–Oja adaptation (Stuhr et al., 2020).
Forward Projection Closed-Form Layer Solutions: Each layer’s weights are obtained analytically by regressing local projected targets generated solely from current activations and global labels (O'Shea et al., 27 Jan 2025).
Zeroth-Order Tensor-Compressed BFT for Edge and PINNs: Gradient estimation and loss evaluation is performed exclusively by forward passes using tensor-train (TT) compressed parameters and sparse-grid Stein quadrature:
- Tensor-train reduces trainable dimensions: $d_{tt}\ll d$
- Stein quadrature computes $\nabla_x u_\theta(x)$ , $\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 0 by summing shifted forward evaluations, avoiding AD
- Two-stage hybrid optimizer: random sign gradient descent (cheap/high-variance) followed by coordinatewise gradient estimation (expensive/low-variance) (Zhao et al., 2023)
Blockwise and Hybrid Local-Global Training: Scalable Forward-Forward (SFF) and related hybrid approaches allow BP within blocks while enforcing BFT between blocks, achieving favorable trade-offs in scalability and representation capacity (Krutsylo, 6 Jan 2025).

The following table summarizes characteristic update procedures for representative BFT methods:

BFT Family	Update Equation/Rule	Reference
Forward gradient estimator	$\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 1	(Baydin et al., 2022)
Mono-Forward local CE	$\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 2	(Gong et al., 16 Jan 2025)
Blockwise BP-free	$\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 3 (local)	(Cheng et al., 2023)
SOM competitive	$\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 4	(Stuhr et al., 2020)
TT-ZO hybrid (Stage 1)	$\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 5	(Zhao et al., 2023)
FP closed-form	$\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 6	(O'Shea et al., 27 Jan 2025)

3. Empirical and Hardware Performance

Numerous studies compare BFTs to standard BP with respect to accuracy, speed, memory, and energy:

Speed and Computation: Forward gradient methods reduce forward-to-backward computational ratio ( $\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 7) to $\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 8 in MNIST MLPs/CNNs, yielding up to 2× speedups; blockwise and layerwise methods can achieve 1.5–4× training throughput improvement under parallelization (Baydin et al., 2022, Cheng et al., 2023, Spyra et al., 2 Nov 2025).
Memory: BFTs generally store only current activations, not full backward tapes. Empirically, MF achieves $\hat\nabla f(x) = (\nabla f(x)\cdot v) v,\quad \mathbb{E}_v[\hat\nabla f(x)] = \nabla f(x)$ 916 MB/layer versus $W_i^* = \arg\min_W \|WX_i-T_i\|_F^2 + \lambda\|W\|_F^2,\qquad T_i = g(Q_i^T X_i) + g(U_i^T Y)$ 0204 MB/layer (BP) in MLPs (Gong et al., 16 Jan 2025). However, auxiliary buffers or TT/aux-classifiers impose small additional overheads (sometimes offsetting theoretical savings) (Spyra et al., 2 Nov 2025).
Accuracy and Generalization: MF and related layerwise BFTs match or exceed BP on standard vision benchmarks (MNIST, Fashion-MNIST, CIFAR-10/100) using identical architectures and hyperparameter protocols. For MLPs, MF achieves $W_i^* = \arg\min_W \|WX_i-T_i\|_F^2 + \lambda\|W\|_F^2,\qquad T_i = g(Q_i^T X_i) + g(U_i^T Y)$ 1 pp accuracy and $W_i^* = \arg\min_W \|WX_i-T_i\|_F^2 + \lambda\|W\|_F^2,\qquad T_i = g(Q_i^T X_i) + g(U_i^T Y)$ 2 energy over BP on CIFAR-10 (Spyra et al., 2 Nov 2025). In few-shot and biomedical regimes, forward-projection and closed-form BFTs achieve superior generalization compared to BP and other local methods (O'Shea et al., 27 Jan 2025).
Specialized Hardware and Edge Devices: BFTs are compatible with photonic substrates, FPGAs, ASICs, and microcontrollers by structurally decoupling training from backward-mode AD and reducing memory/computation footprints. In photonic AMLE networks, one-shot associative learning is realized via phase-change materials at $W_i^* = \arg\min_W \|WX_i-T_i\|_F^2 + \lambda\|W\|_F^2,\qquad T_i = g(Q_i^T X_i) + g(U_i^T Y)$ 31.8 nJ/update, with $W_i^* = \arg\min_W \|WX_i-T_i\|_F^2 + \lambda\|W\|_F^2,\qquad T_i = g(Q_i^T X_i) + g(U_i^T Y)$ 4 device volume and $W_i^* = \arg\min_W \|WX_i-T_i\|_F^2 + \lambda\|W\|_F^2,\qquad T_i = g(Q_i^T X_i) + g(U_i^T Y)$ 5 ns update time (Tan et al., 2020). For TT-compressed PINNs, all learning is reduced to forward propagation and lightweight TT contractions, enabling microcontroller-scale implementations (Zhao et al., 2023).
Scaling Laws and Limitations: Direct Feedback Alignment (DFA) underperforms BP in large transformers, as the random feedback matrix cannot align true errors at scale—the loss scales poorly and the compute efficiency frontier is dominated by BP (Filipovich et al., 2022). In contrast, BFTs with strictly local or one-shot associative rules trade off lower maximal representational depth for interpretability and speed (Tan et al., 2020, Stuhr et al., 2020).

4. Biological Plausibility and Neuroscientific Relevance

BFT methods address longstanding critiques of BP’s biological plausibility:

Strict locality: Updates in MF, feedback-Hebbian, and competitive learning are synapse-local, requiring only pre/post-activity and a local error/teaching signal. No weight symmetry or long-range transport is needed (Gong et al., 16 Jan 2025, Li, 11 Jan 2026).
Unsupervised and associative learning: AMLE photonic networks implement a direct correlational (Pavlovian) rule, reminiscent of biological spike-timing-dependent plasticity. CSNNs couple competitive learning with Hebbian masks, paralleling sensory cortical development (Tan et al., 2020, Stuhr et al., 2020).
Temporal context and regeneration: Feedback-Hebbian architectures train dedicated feedback layers for contextual recurrence and memory trace retention, supporting continual learning without global error propagation (Li, 11 Jan 2026).
No global synchronization: Blockwise and layerwise methods permit asynchronous parallel updates, aligning with the observed asynchronous plasticity in biological neural circuits (Cheng et al., 2023).

These features suggest BFTs provide a conceptual bridge between computational neuroscience and efficient machine learning.

5. Applications and Practical Scenarios

BFTs have demonstrated impact in several practical domains:

Edge and low-resource learning: The minimal memory and processing requirements of TT-compressed, zeroth-order, and closed-form BFTs enable efficient on-device learning for embedded, IoT, and neuromorphic systems (Zhao et al., 2023, O'Shea et al., 27 Jan 2025).
Test-time adaptation for EEG and BCI: BFT provides a plug-and-play, privacy-preserving inference-time adaptation suite for brain-computer interfaces under signal or domain shift, requiring no parameter updates and incurring minimal latency on CPUs (Li et al., 12 Jan 2026).
Unsupervised and few-shot learning: CSNNs, Forward Projection, and TT-based BFTs deliver competitive performance in unsupervised feature building and few-shot classification, exploiting local learning rules to enhance sample efficiency and representation stability (Stuhr et al., 2020, O'Shea et al., 27 Jan 2025, Zhao et al., 2023).
Physics-informed neural networks: Stein-gradient and TT-compressed BFTs allow PINN-style PDE solvers to be realized without backpropagation, handling high-dimensional tasks (e.g., 20D HJB) within $W_i^* = \arg\min_W \|WX_i-T_i\|_F^2 + \lambda\|W\|_F^2,\qquad T_i = g(Q_i^T X_i) + g(U_i^T Y)$ 6 of first-order BP accuracy (Zhao et al., 2023).
Transfer learning and scalability: SFF and blockwise BFTs maintain or outperform BP in deep CNNs (e.g., ResNet, MobileNet) and transfer scenarios, especially in small-data or high-class-count tasks (Krutsylo, 6 Jan 2025).
Photonic and non-electronic substrates: Monadic Pavlovian architectures offer high-bandwidth, ultra-fast learning in phase-change photonic platforms, demonstrating feasibility for future optical computing systems (Tan et al., 2020).

6. Limitations, Variance Issues, and Future Directions

While BFTs present substantial algorithmic and hardware merits, several open challenges persist:

Variance and stability: Stochastic estimators (forward gradients, zeroth-order) may exhibit high variance per step; averaging or structured perturbations can mitigate this at the cost of increased computation (Baydin et al., 2022, Zhao et al., 2023).
Scaling to very deep networks: Feedback alignment and random feedback projections struggle to align gradients at scale, particularly in transformers and deep residual architectures (Filipovich et al., 2022). Further, fully associative (symbolic) photonic circuits lack hierarchical depth and nonlinear representation power, limiting expressivity (Tan et al., 2020).
Convergence and optimization: Extensions to advanced optimizers (momentum, Adam, RMSProp) and general theoretical guarantees on convergence in nonconvex regimes remain largely unexplored for most BFTs (Baydin et al., 2022, Gong et al., 16 Jan 2025). Memory and overhead savings, while measurable, may be impacted by auxiliary classifiers or compression artifacts (Spyra et al., 2 Nov 2025).
Global objective mismatch: Competitive and strictly local BFTs may not align layerwise unsupervised/greedy objectives with downstream task loss, potentially limiting maximal accuracy or requiring careful hybridization with global criteria (Stuhr et al., 2020, Krutsylo, 6 Jan 2025).
Extendability and hybridization: Future work is expected to address combining local and occasional global updates, scaling closed-form solvers, and adapting BFT constructs to non-standard modalities (recurrent, graph, and spiking neural networks).

7. Summary and Outlook

Backpropagation-Free Transformations offer a rich, theoretically principled, and empirically validated framework for training neural networks without global error backpropagation. By embracing local, forward, or associative learning rules, BFTs enable scalable, memory- and energy-efficient learning, hardware acceleration, and enhanced biological plausibility. Current instantiations—including Mono-Forward, blockwise BP-free optimization, feedback-Hebbian plasticity, TT-compressed zeroth-order optimizers, and one-shot photonic associative learning—demonstrate that, under practical conditions, performance can match or surpass backpropagation baselines while unlocking new application scenarios and architectural freedoms (Gong et al., 16 Jan 2025, Spyra et al., 2 Nov 2025, Zhao et al., 2023, Baydin et al., 2022, Cheng et al., 2023, Tan et al., 2020, Li, 11 Jan 2026, Krutsylo, 6 Jan 2025, O'Shea et al., 27 Jan 2025). Continued progress in BFT algorithmics, hardware codesign, and theoretical analysis is likely to further expand their scope and impact in both machine learning and computational neuroscience.