BiFPN: Bi-directional Feature Pyramid Network

Updated 29 January 2026

BiFPN is a multi-scale fusion architecture that integrates top-down and bottom-up pathways to combine semantic and spatial features.
It employs learnable, normalized fusion weights to adaptively balance information from different resolutions and improve computational efficiency.
Variants like EfficientDet and RevBiFPN demonstrate significant gains in accuracy and memory optimization across vision and audio applications.

A Bi-directional Feature Pyramid Network (BiFPN) is an architectural module designed for efficient multi-scale feature fusion in deep learning models, central to scale-invariant detection, segmentation, and sequence modeling. BiFPN enables bidirectional information flow across feature maps of varying resolutions and employs normalized, learnable fusion weights to facilitate robust aggregation. This motif is widely adopted in state-of-the-art computer vision pipelines such as EfficientDet, RevBiFPN, and object detectors, and is increasingly extended to domains like sound source localization and pitch estimation. The core design principle is to maximize semantic and spatial context at every scale while minimizing computational and memory overhead.

1. Architectural Topology and Fusion Mechanism

A prototypical BiFPN takes as input a set of multi-resolution feature maps, typically representing outputs from a backbone network across discrete scales (e.g., $\{P_3, P_4, P_5, P_6, P_7\}$ in EfficientDet (Tan et al., 2019), $\{h_0, \dots, h_{N-1}\}$ in RevBiFPN (Chiley et al., 2022)). It stacks bidirectional fusion layers consisting of two principal pathways:

Top-down fusion: Information propagates from low-resolution, semantic-rich features toward high-resolution, spatially detailed maps via upsampling and weighted fusion.
Bottom-up fusion: After top-down enhancement, high-resolution maps are downsampled and re-fused with intermediate or coarse features.

The fusion at each node proceeds via weighted summation of inputs:

$O = \text{Conv}^{\text{dwsep}}\left(\sum_j \hat w_j I_j\right)$

where $\hat w_j = \frac{\text{ReLU}(w_j)}{\sum_k \text{ReLU}(w_k) + \epsilon}$ ensures the weights are convex combinations. The convolution (typically depthwise-separable) mixes channel information post-fusion (Tan et al., 2019, Healy et al., 2023).

2. Learnable Weighting and Normalization

One of BiFPN’s key innovations is the introduction of learnable, normalized fusion weights at each node, relaxing the constraint of equal contribution from features at different scales. This mechanism appears in:

EfficientDet BiFPN: Each input to a fusion node has a scalar trainable weight, rectified and normalized, so the network can adaptively balance semantic and spatial cues (Tan et al., 2019).
Sound Networks: Adaptations for signal classification and localization use ReLU-clipped, normalized weights for temporal fusion, implemented via custom weighted-average layers (Healy et al., 2023).

The weight normalization guarantees stability and avoids degenerate nodes—those with only one input are eliminated to ensure every fusion combines multiple sources (Tan et al., 2019).

3. Variants and Extensions

Multiple BiFPN variants have been developed to address specific bottlenecks and domains:

Variant	Key Feature	Memory Efficiency	Additional Notes
EfficientDet (BiFPN) (Tan et al., 2019)	Weighted fusion, repeated stacking	Moderate	Depthwise-sep conv, batchnorm
RevBiFPN (Chiley et al., 2022)	Fully reversible fusion (RevSilo)	Very high	$\sim$ 20x less memory than EfficientDet
Residual Bi-Fusion FPN (Chen et al., 2019)	CORE/RECORE, residual fusions	High	Residual skip connections
MF-PAM (audio) (Chung et al., 2023)	Single-round, light BiFPN	Moderate	Pitch estimation, halved channels
Sound SSL (Healy et al., 2023)	Four-level, 1D convolutions	Moderate	Bilinear up/downsampling

RevBiFPN leverages invertible additive-coupling fusions such that activations can be re-computed during backpropagation, thus only the initial pyramid inputs need to be stored. This enables scaling depth arbitrarily without growing activation memory consumption (Chiley et al., 2022).

4. Mathematical Formulation

A generalized mathematical description for bidirectional fusion is (EfficientDet-style):

Top-down fusion:

$P^{td}_l = \text{Conv}\left( \frac{w_1\,P^{in}_l + w_2\,\text{Resize}(P^{td}_{l+1})}{w_1 + w_2 + \epsilon} \right)$

Bottom-up fusion:

$P^{out}_l = \text{Conv}\left( \frac{w'_1\,P^{in}_l + w'_2\,P^{td}_l + w'_3\,\text{Resize}(P^{out}_{l-1})}{w'_1 + w'_2 + w'_3 + \epsilon} \right)$

$\epsilon$ is a small constant for stability.

Extensions such as RevBiFPN employ invertible mappings:

Forward (for each RevSilo):

$h_{k+n} = h_k + F_k(h_{<k})$

Reverse:

$h_k = h_{k+n} - F_k(h_{<k})$

This ensures each fusion operation is invertible, preventing memory bottlenecks for deep and wide pyramids (Chiley et al., 2022).

5. Empirical Performance Across Modalities

BiFPN’s efficacy has been demonstrated in large-scale benchmarks and diverse domains:

ImageNet classification / COCO detection: EfficientDet with BiFPN achieves state-of-the-art AP with up to $\{h_0, \dots, h_{N-1}\}$ 03x reduction in FLOPs and parameter count over prior FPN-based methods (Tan et al., 2019).
RevBiFPN vs. EfficientDet: RevBiFPN-S6 achieves 84.2% Top-1 accuracy on ImageNet with only 0.23 GB training memory, compared to EfficientNet-B7’s 84.3% Top-1 at 5.05 GB, demonstrating a nearly 20x memory reduction (Chiley et al., 2022).
Object detection/segmentation: ReBiF (Residual Bi-Fusion) shows that adding more pyramid levels continues to improve mAP, supported by residual skips stabilizing gradient flow (Chen et al., 2019).
Audio domains: MF-PAM achieves 99.2% pitch accuracy; BiFPN in sound localization reduces error rate by 9.8% and direction-of-arrival error by $\{h_0, \dots, h_{N-1}\}$ 143% in SELD tasks (Chung et al., 2023, Healy et al., 2023).

6. Implementation Considerations and Limitations

Fusion operators: Depthwise-separable convolutions are preferred for efficiency; batch normalization and SiLU activation are typical in vision tasks; only DSC+Swish in MF-PAM (Tan et al., 2019, Chung et al., 2023).
Upsampling/downsampling: Methods include bilinear interpolation for spectrograms, learned transpose convolutions for images, or strided 1D convolutions in temporal domains (Healy et al., 2023).
Limitations: All fusion operations must be invertible for RevBiFPN; slight computational overhead for reversible recomputation (15–25%). Shape alignment and channel normalization steps are required but do not break invertibility in practice (Chiley et al., 2022).
Residual connections: Residual Bi-Fusion FPNs avoid training instability common in deep FPNs by injecting direct skips at each fusion level (Chen et al., 2019).

7. Research Directions and Cross-Domain Adaptations

BiFPN’s modularity facilitates adaptation across domains:

Computer Vision: Backbone fusion for detection/classification, segmentation, instance recognition.
Audio Signal Processing: Multi-level temporal fusion for pitch estimation (MF-PAM (Chung et al., 2023)), sound event localization (SELDnet+BiFPN (Healy et al., 2023)).
Memory-constraint settings: RevBiFPN as a solution to scaling depth without prohibitive memory, increasing model capacity for high-resolution inputs (Chiley et al., 2022).

A plausible implication is that BiFPN, especially when implemented with reversible modules, enables deeper, more powerful, and parameter-efficient multi-scale networks in resource-constrained environments. Empirical evidence across vision and audio tasks indicates high-impact improvements in both accuracy and efficiency.

Summary: BiFPN is a foundational multi-scale feature fusion architecture, combining bidirectional cross-scale aggregation, learnable fusion weights, and repeated fusion layers. State-of-the-art BiFPN variants such as EfficientDet’s weighted BiFPN and RevBiFPN’s reversible fusion demonstrate compelling improvements in both computational efficiency and representational power. Its methodology, mathematical formulation, and empirical results substantiate BiFPN as an essential component in modern neural architectures for vision and beyond (Tan et al., 2019, Chiley et al., 2022, Chen et al., 2019, Chung et al., 2023, Healy et al., 2023, Wu et al., 2018).