Bi-Level Routing Attention (BRA)

Updated 5 February 2026

Bi-Level Routing Attention (BRA) is defined as a dynamic sparse attention mechanism that employs a hierarchical regionwise routing paradigm to prune token interactions.
It first aggregates region-level queries and keys to select top-ranked regions before applying fine-grained token-to-token attention, significantly reducing computational cost.
Empirical evaluations on benchmarks like ImageNet, COCO, and ADE20K demonstrate BRA’s effectiveness in improving accuracy and efficiency in dense prediction and detection tasks.

Bi-Level Routing Attention (BRA) is a dynamic sparse attention mechanism for vision transformers and CNN backbones that achieves query-adaptive allocation of computational resources by leveraging a hierarchical regionwise routing paradigm. Unlike classic multi-head self-attention, which incurs quadratic complexity in the number of tokens, BRA first prunes the candidate key-value space at a coarse region level using content-based affinity scores, and subsequently applies fine-grained token-to-token attention within only the union of top-ranked routed regions. This approach flexibly reduces computational overhead while preserving the capacity for modeling both local and long-range dependencies, making it highly suitable for dense prediction and detection tasks in large-scale vision models (Zhu et al., 2023, Long et al., 2024, Yang et al., 2023).

1. Formal Definition and Algorithmic Steps

Let $X \in \mathbb{R}^{H \times W \times C}$ be the input feature map with $N = H \cdot W$ spatial tokens and $C$ channels. BRA partitions $X$ into $R=S^2$ non-overlapping square regions (patches) of size $S \times S$ , with each region comprising $N_0 = N/R$ tokens. The algorithm proceeds as follows (Zhu et al., 2023, Yang et al., 2023, Long et al., 2024):

Projection and Region Aggregation
- Patchify $X$ into $R$ regions: $X^r \in \mathbb{R}^{R \times N_0 \times C}$
- Linear projections: $Q = X^r W_q$ , $K = X^r W_k$ , $V = X^r W_v$
- Region-level queries/keys: $Q^r_i = \frac{1}{N_0} \sum_{j} Q_{i,j}$ , $K^r_i = \frac{1}{N_0} \sum_{j} K_{i,j}$
Region-to-Region Routing
- Affinity: $A^r = Q^r (K^r)^{\top} \in \mathbb{R}^{R \times R}$
- For each query region $i$ , select top- $k$ key region indices: $I^r_i = \text{TopK}(A^r_{i,:}, k)$
Fine-Grained Token Attention
- For region $i$ , gather tokens from the $k$ routed regions: $K^g_i$ , $V^g_i \in \mathbb{R}^{kN_0 \times C}$
- Region-local queries: $Q_i \in \mathbb{R}^{N_0 \times C}$
- Attention: $A_i = \text{softmax}(Q_i (K^g_i)^\top / \sqrt{C})$
- Output: $O_i = A_i V^g_i + \text{DWConv}(V_i)$ (DWConv: local depthwise convolution for context enhancement)
Reassembly
- Unpatchify the outputs $\{O_i\}$ into an output tensor of shape $(H \times W \times C)$

A compact LaTeX representation is as follows: $Q, K, V = X' W_Q,\, X' W_K,\, X' W_V;\quad Q_r = \tfrac{1}{N} \sum_{j} Q_{:, j, :},\; K_r = \tfrac{1}{N} \sum_{j} K_{:, j, :}$

$A_r = Q_r K_r^\top;\quad I_r[i] = \mathrm{TopK}(A_r[i, :], k)$

$A_{\text{token}}^i = \mathrm{softmax}\bigl(Q^i K_g^{i\top} / \sqrt{d} \bigr);\quad O^i = A_{\text{token}}^i V_g^i + \mathrm{DWConv}(V^i)$

(Yang et al., 2023)

2. Motivation and Design Rationale

Bi-Level Routing Attention is motivated by the need to address the inefficiency of dense global self-attention in vision transformers, where both the computational cost and memory usage scale as $\mathcal{O}(N^2)$ in the token count. BRA introduces dynamic, content-aware sparsity by:

Performing coarse-level routing using learned content rather than fixed windowing or stripes, increasing flexibility.
Pruning irrelevant key-value interactions based on per-region affinities.
Enabling each query region to adaptively select only the most salient regions for attention, leading to robust handling of long-range dependencies and object-centric context.
Maintaining accuracy and expressiveness even at reduced computational budgets, especially for dense scenes or large inputs.

This adaptive query-to-region routing sharply contrasts with static sparsity patterns (e.g., fixed local attention, axial/dilated windows), allowing the mechanism to generalize across various task domains (Zhu et al., 2023, Yang et al., 2023, Long et al., 2024).

3. Computational Complexity and Implementation

Let $H=W$ , $S$ the partition grid size, $k$ the routing degree per query region, and $C$ the channel dimension:

Region routing: $O(S^4 C)$ (form $A^r$ )
Gathering routed tokens/values: $O(HW \cdot k C)$
Token-to-token attention (within routed union): $O\left(k (HW)^2/S^2 \; C\right)$
Depthwise convolution: $O(HW C K^2)$ , $K$ kernel size (typically $K=3$ or $5$)

Total practical complexity is $O(HW C k + S^4 C)$ per block, compared to $O((HW)^2 C)$ for standard multi-head self-attention. In typical configurations (e.g., $S=7$ , $k=4$ ), this yields substantial savings (Zhu et al., 2023, Yang et al., 2023).

BRA is highly compatible with GPU architectures, as all routing and token attention reduce to large dense matrix multiplications following a gather step. PyTorch-style pseudocode is available in (Zhu et al., 2023) and (Yang et al., 2023).

Parameter overhead consists of three projection matrices ( $3C^2$ ) and a small depthwise convolution module ( $C \cdot K^2$ ), less than 10% increase over a single 8-head MHSA layer for $C=512$ .

4. Empirical Results and Performance Gains

Studies in (Zhu et al., 2023) and (Yang et al., 2023) report the following key results:

ImageNet-1K (classification): BiFormer-S (26M/4.5G) achieves 83.8–84.3% top-1; BiFormer-B (57M/9.8G) reaches up to 85.4% with token labeling.
COCO 2017 (object detection with RetinaNet/Mask R-CNN): BiFormer-S yields box mAP=45.9; BiFormer-B mAP=47.1, outperforming FOLOPs-matched baselines (Swin, DAT, QuadTree).
ADE20K (semantic segmentation): BiFormer-S gives mIoU of 48.9–50.8, BiFormer-B up to 51.7 using UPerNet or FPN decoders.
Detection (YOLOv7-BRA): On the SCB-Dataset, YOLOv7-BRA achieves [email protected]=87.1% (+2.2% absolute gain over baseline YOLOv7), with qualitative improvements under occlusion and clutter (Yang et al., 2023).

Performance improvement is most pronounced on dense prediction benchmarks, small-object detection, and segmentation tasks, confirming the ability of BRA to capture local-global semantics with high computational efficiency (Zhu et al., 2023, Yang et al., 2023).

5. Design Choices and Ablation Studies

Key hyperparameters and architectural design choices include:

Region partition factor $S$ : Must evenly divide spatial dimensions; S=7 for classification, S=8/16 for larger maps in detection/segmentation (Zhu et al., 2023, Yang et al., 2023, Long et al., 2024).
Number of routed regions $k$ : Typically increases in deeper/later stages to maintain constant token budget per region (e.g., $k=[1,4,16,S^2]$ per stage).
Single vs multi-head: Both are supported; BiFormer uses multi-head, YOLOv7-BRA uses single head.
Local Context Enhancement: Depthwise convolution with kernel $K=3$ or $5$ improves local feature preservation.

Ablations indicate:

Too small $k$ degrades accuracy due to insufficient interaction scope.
Excessively large $S$ introduces computational overhead and padding inefficiency.
Local context enhancement (LCE) consistently improves results, especially in segmentation.

6. Extensions: Deformable Bi-Level Routing Attention (DBRA)

To overcome the fixed-grid limitations of BRA, DeBiFormer introduces Deformable Bi-level Routing Attention (DBRA) (Long et al., 2024), which interposes a deformable agent mechanism:

Deformable point generation: Learned offsets are applied to a low-resolution grid, producing sampling locations that adapt to image content.
Bilinear feature sampling: Features at deformable points are interpolated for routing, supporting better alignment with semantic object boundaries.
Two-stage attention: First, "agent" queries routed via deformable points perform inner-agent attention to top-k region keys; this is followed by global token-level attention.
Design enhancements: Offset groups $G$ encourage diversity, with normalization and gating layers maintaining stability. Implementation involves LayerNorm before MHA, residuals after both inner and outer attention, and MLP with expansion.

This design achieves more balanced and interpretable attention, as shown by Grad-CAMs and effective receptive field analyses (Long et al., 2024).

7. Integration and Broader Applications

BRA has been integrated into vision transformer designs (BiFormer (Zhu et al., 2023), DeBiFormer (Long et al., 2024)), CNN backbones (YOLOv7-BRA (Yang et al., 2023)), and U-Net-like pure transformer architectures for biomedical segmentation (Cai et al., 2023).

BRA’s dynamic, query-driven sparsity is shown to generalize well across tasks:

Image classification
Object detection (anchor-free, anchor-based, one-stage/two-stage)
Semantic and instance segmentation
Fine-grained behavior analysis (e.g., classroom activity detection)
Medical imaging (citation in (Cai et al., 2023)).

A plausible implication is that BRA serves as a principled framework for efficient attention in any spatially-structured neural architecture requiring both locality and adaptable long-range context.

References:

"BiFormer: Vision Transformer with Bi-Level Routing Attention" (Zhu et al., 2023)
"DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention" (Long et al., 2024)
"Student Classroom Behavior Detection based on YOLOv7-BRA and Multi-Model Fusion" (Yang et al., 2023)
"Pubic Symphysis-Fetal Head Segmentation Using Pure Transformer with Bi-level Routing Attention" (Cai et al., 2023)