Cross-Attention Message-Passing Transformers

Updated 5 February 2026

CrossMPT is a neural decoding method for error correcting codes that integrates Tanner-graph message passing with masked transformer attention.
It restricts attention to variable-to-check node interactions, reducing computation and memory costs while delivering superior BER/FER performance.
Extensions like FCrossMPT and CrossED offer code-agnostic and ensemble benefits, enabling scalability to longer codes and improved error floor mitigation.

Cross-Attention Message-Passing Transformers (CrossMPT) constitute a class of neural decoders for error correcting codes (ECCs) in communications, specifically designed to unify the inductive bias of Tanner-graph message-passing with the expressive power and scalability of deep transformers. These architectures—now including CrossMPT, its foundation model variant FCrossMPT, and ensemble extensions (CrossED, FCrossED)—employ stacked layers of masked cross-attention to propagate information exclusively along variable node (bit) to check node (syndrome) edges as dictated by the code’s parity-check matrix. CrossMPT consistently demonstrates superior decoding accuracy, reduced computational and memory costs, scalability to longer codes, and—through code-agnostic architectural choices—generalizes across code classes and parameters (Park et al., 2024, Park et al., 22 Jun 2025, Lau et al., 19 Sep 2025).

1. Motivation and Conceptual Basis

Classical ECC decoders such as belief propagation (BP) operate by iteratively exchanging messages between variable and check nodes of a bipartite Tanner graph, guided by the code’s parity-check matrix $H$ . Early neural decoders based on transformers, exemplified by ECCT, concatenated all input features (channel magnitude, syndrome) and applied self-attention, artificially treating all pairwise interactions (magnitude–magnitude, syndrome–syndrome, magnitude–syndrome) as equally relevant. This approach lacks structural bias towards the code’s specific message-passing dynamics, resulting in inefficiencies and less effective learning (Park et al., 2024, Park et al., 22 Jun 2025).

CrossMPT addresses this by:

Maintaining magnitude and syndrome as separate input streams.
Employing two complementary masked cross-attention layers in each block: one propagates information from check to variable nodes (mask = $H^\top$ ), the other from variable to check nodes (mask = $H$ ).
Explicitly enforcing code structure via masks matching the Tanner-graph adjacency, such that only relevant edges mediate attention.

This structure injects precise inductive bias, restricts the attention mechanism to physically meaningful interactions, and delivers substantial reductions in both computational and memory requirements.

2. Architectural Details and Mathematical Formulation

CrossMPT’s input consists of channel observations $y\in\mathbb{R}^n$ , processed into:

Magnitude vector $|y|\in\mathbb{R}^n$ (bit reliabilities),
Syndrome vector $s(y) = H\,\text{bin}(\text{sign}(y)) \in \{0,1\}^{n-k}$ (parity failures).

Each element of $|y|$ and $s(y)$ is linearly embedded into $\mathbb{R}^d$ .

For each of N stacked decoding layers:

Magnitude Update (VN $\to$ CN):

$Q_1 =\text{proj}_1(|y|)\in\mathbb{R}^{n\times d}$
$K_1 =\text{proj}_2(s(y))\in\mathbb{R}^{(n-k)\times d}$
$V_1 =\text{proj}_3(s(y))$
Mask $M_1 = H^\top \in \{0,1\}^{n\times(n-k)}$

$\text{Attention}_1(Q_1, K_1, V_1; M_1) = \text{softmax}\left(\frac{Q_1 K_1^\top}{\sqrt{d}}+\log M_1\right)V_1$

The attention form propagates only along Tanner-graph edges.

Feed-forward, Add, Normalize on updated magnitude.
Syndrome Update (CN $\to$ VN):

$Q_2 =\text{proj}_4(s(y))$
$K_2 =\text{proj}_5(|y|)$
$V_2 =\text{proj}_6(|y|)$
Mask $M_2 = H$

$\text{Attention}_2(Q_2, K_2, V_2; M_2) = \text{softmax}\left(\frac{Q_2 K_2^\top}{\sqrt{d}}+\log M_2\right)V_2$

Followed by another feed-forward, add, and normalization.

Output Layer: After N such blocks, the final embeddings are concatenated, normalized, and projected via two fully connected layers to yield the estimated noise $\hat{z}$ , with binary cross-entropy loss applied on the predicted error pattern.

This cross-graph attention mechanism, strictly masked according to $H$ and $H^\top$ , ensures model updates respect the true code structure.

3. Code-Agnostic, Scalable, and Ensemble Variants

To support code-agnostic operation and foundation model deployment across diverse codes (lengths, rates, families), FCrossMPT removes positional encoding and uses shared input embeddings:

$M_i = |y_i|\, W_M, \qquad S_j = s(y)_j W_S$

FCrossMPT thereby supports a “set”-structure, remaining invariant to code parameters and enabling training on multiple codes within a single network—with only the attention masks adapting per code via $H$ , $H^\top$ (Park et al., 22 Jun 2025).

The ensemble variant, CrossED, constructs $p$ different parity-check matrices ( $\{H_j\}_{j=1}^p$ ), processes each syndrome/magnitude pair with the same CrossMPT block in parallel (same weights, different masks), and fuses latent outputs by summation before the output head. This approach is especially effective in mitigating error floors on short codes: with no parameter or latency increase, CrossED achieves order-of-magnitude BER/BLER improvements over single-block CrossMPT for BCH and LDPC families.

4. Complexity, Training Efficiency, and Scalability

Computational complexity per decoding layer:

ECCT: $O((2n-k)^2 d)$ (self-attention on concatenated stream).
CrossMPT: $2 \cdot O(n(n-k)d)$ (two cross-attention maps, each sparse in $H$ , $H^\top$ ).

Defining code rate $R=k/n$ , the cost ratio is:

$\gamma = \frac{C_2}{C_1} = \frac{2(1-R)}{(2-R)^2} < \frac{1}{2}$

Thus, CrossMPT reduces attention-related compute and memory usage by at least 50% for any nontrivial code, dropping to $\sim10\%$ for high-rate codes. Training time per epoch is 50–65% lower than ECCT for key code families; training and inference remain scalable up to code lengths $n\approx 1000$ without prohibitive memory use, whereas ECCT fails to scale beyond moderate blocklengths (Park et al., 2024, Park et al., 22 Jun 2025).

5. Performance Across Codes and Empirical Results

CrossMPT consistently demonstrates superior performance as measured by BER and FER over belief propagation, alternative neural decoders, and prior transformer approaches across BCH, LDPC, Polar, and Turbo codes.

Selected empirical results ((Park et al., 2024), Table 1):

Code	Method	$-ln(\mathrm{BER})$ , $E_b/N_0=6$ dB
(31,16) BCH	BP	7.60
	Hyp BP	8.80
	AR BP	9.60
	ECCT	10.66
	CrossMPT	12.48
(121,70) LDPC	ECCT	16.11
	CrossMPT	17.52
(64,32) Polar	ECCT	12.32
	CrossMPT	13.31
(132,40) Turbo	ECCT	9.06
	CrossMPT	10.94

On Long Codes: (384,320) WRAN LDPC—CrossMPT outperforms ECCT by $\sim0.5$ –1dB, scaling up to (1056,880) WiMAX LDPC where ECCT is infeasible.
Ensemble decoding (CrossED): >2 orders of magnitude BER improvement at moderate SNR for (63,30) BCH, without additional parameters or latency (Park et al., 22 Jun 2025).
FCrossMPT and FCrossED foundation models generalize across code rate and class, achieving within 0.1–0.2dB of per-code CrossMPT.

6. Extensions, Limitations, and Future Directions

CrossMPT’s structure (exact Tanner-graph masking, code-parameterized attention) provides a “drop-in” replacement for ECCT and similarly structured neural decoders whenever $H$ is available. Nonetheless, several limitations remain:

The $O(n(n-k))$ attention requirement, though linear in code size for high-rate codes, may remain a bottleneck for ultra-long or extremely high-rate codes.
All CrossMPT variants require explicit knowledge of $H$ ; adaptation to codes with time-varying or highly irregular parity-checks may necessitate dynamic mask update strategies.
CrossMPT and FCrossMPT do not natively incorporate codebook-level global properties, though ensemble and loss-term innovations begin to address this.

Open research problems include: extending CrossMPT to joint code-network co-design (learning or adapting mask structure), integrating diffusion-based decoding mechanisms, developing sparsified or dynamic attention heads for ultra-long codes, and exploring domain-specific pretraining regimes (syndrome- or magnitude-only) for resource-constrained environments (Park et al., 2024, Park et al., 22 Jun 2025).

Differential-Attention Message-Passing Transformers (DiffMPT) further augment the CrossMPT principle by employing a difference of masked and unmasked softmaxes in the attention computation, suppressing background (non-Tanner-graph) interactions and introducing a differentiable syndrome loss incorporating soft check node consistency (Lau et al., 19 Sep 2025). DiffMPT empirically yields additional 0.2–0.3dB gains on short-medium blocklength LDPCs and Polars at relevant FER regimes.

Other neural message-passing frameworks have adopted similar masked cross-attention paradigms, now regarded as foundational for scalable transformer-based ECC decoding in next-generation communication scenarios.

References:

CrossMPT: Cross-attention Message-Passing Transformer for Error Correcting Codes (Park et al., 2024)
Cross-Attention Message-Passing Transformers for Code-Agnostic Decoding in 6G Networks (Park et al., 22 Jun 2025)
Interplay Between Belief Propagation and Transformer: Differential-Attention Message Passing Transformer (Lau et al., 19 Sep 2025)