Layer-wise Dynamic Attention PINN

Updated 26 January 2026

The paper introduces a dynamic attention mechanism at each hidden layer that enhances PINNs’ expressivity by adaptively fusing multi-view input encodings.
It employs layer-wise re-encoding and gating networks to resolve gradient conflicts and improve optimization in multi-task PDE solvers.
Empirical results show that LDA-PINN achieves significantly lower errors and faster convergence compared to standard PINNs across various benchmark problems.

Layer-wise Dynamic Attention PINN (LDA-PINN) is an advanced variant of physics-informed neural networks (PINNs) designed to address representational and optimization limitations inherent in standard PINN frameworks. LDA-PINN enhances the expressivity of the neural architecture by augmenting each hidden layer with a flexible, input-aware dynamic attention mechanism, enabling feature-wise fusion of multi-view encodings at every network depth. Its design is situated within a broader architecture-optimization co-design paradigm, with the aim of improving the trainability and accuracy of PINN-based solvers for partial differential equations (PDEs) (Niu et al., 19 Jan 2026).

1. Architectural Fundamentals of LDA-PINN

At the core of LDA-PINN is the replacement of the traditional multilayer perceptron (MLP) backbone in PINNs with a layer-wise dynamic attention (LDA) mechanism. Standard PINNs approximate solutions $u(x, t)$ to PDEs by optimizing a loss functional composed of the residuals of the governing equations, initial, and boundary conditions:

$\mathcal{L}(\theta) = \lambda_{\rm PDE}\,\mathcal{L}_{\rm PDE} + \lambda_{\rm IC}\,\mathcal{L}_{\rm IC} + \lambda_{\rm BC}\,\mathcal{L}_{\rm BC} \tag{2.4}$

where, for example, $\mathcal{L}_{\rm PDE} = \frac{1}{N_f}\sum_{i=1}^{N_f}|\mathcal{N}[u_\theta(x_f^{(i)}, t_f^{(i)})]|^2$ .

LDA-PINN augments the forward pass as follows: at each hidden layer $\ell$ , the MLP activation $a^{\rm MLP}_\ell$ is combined with two input re-encodings, $e^{(1)}_\ell$ and $e^{(2)}_\ell$ , both parameterized via input-specific transformations. The combined feature set, after concatenation, is passed through a gating network producing dynamic, feature-wise softmax weights $\{\alpha^{(i)}_{\ell,k}\}$ , which are used to modulate and residually inject the attention-composed feature vector $m_\ell$ into the MLP activations:

$a_\ell = a^{\rm MLP}_\ell + m_\ell \tag{2.12}$

This process is repeated at each layer, resulting in a multi-layer architecture where input coordinates are dynamically re-encoded and adaptively fused at every depth (Niu et al., 19 Jan 2026).

2. Layer-wise Dynamic Attention Mechanism

The LDA mechanism operates as follows:

Input Re-Encoding: For each layer $\ell$ , two distinct "views" of the raw input are formed:

$e_\ell^{(i)} = \phi\left(E_\ell^{(i)}\,x + c_\ell^{(i)}\right), \quad i=1,2 \tag{2.6}$

where $E_\ell^{(i)} \in \mathbb{R}^{h_\ell \times d}$ , $c_\ell^{(i)} \in \mathbb{R}^{h_\ell}$ , and $\phi(\cdot)$ is an elementwise nonlinearity (e.g., tanh).

Gating and Normalization: The activations and both input views are concatenated and processed by a gating network $g_\ell$ , which returns per-feature unnormalized gates. Feature-wise softmax is applied to obtain normalized weights:

$\alpha_{\ell,k}^{(i)} = \frac{\exp(G_{\ell,i,k})}{\exp(G_{\ell,1,k})+\exp(G_{\ell,2,k})}, \quad i=1,2 \tag{2.10}$

Feature Fusion: The attention vector combines the two input views for each feature, producing a modulation vector $m_\ell$ :

$m_\ell = \left(\alpha_{\ell,1}^{(1)} e_{\ell,1}^{(1)} + \alpha_{\ell,1}^{(2)} e_{\ell,1}^{(2)}, \dots\right) \tag{2.11}$

Residual Injection: The modulation vector is added to the current layer activation, ensuring the original features and attended signals are both accessible.

This process enables recurrent, data-dependent coordinate re-injection and adaptively shaped representations across all layers (Niu et al., 19 Jan 2026).

3. Training and Optimization: Multi-Task and Conflict-Resolution

LDA-PINN preserves the standard composite loss formulation but interprets the constituent losses ( $\mathcal{L}_{\rm PDE}, \mathcal{L}_{\rm IC}, \mathcal{L}_{\rm BC}$ ) as separate optimization "tasks". To address harmful gradient interference among these tasks, gradient conflict detection and resolution is applied:

Gradient Calculation: For each task $i$ , compute the parameter gradient $\mathbf{g}_i = \nabla_\theta \mathcal{L}_i(\theta)$ .
Conflict Detection and Projection: For all pairs $(i, j)$ where $\mathbf{g}_i^\top \mathbf{g}_j < 0$ , remove from $\mathbf{g}_i$ the component that opposes $\mathbf{g}_j$ :

$\mathbf{g}_i \leftarrow \mathbf{g}_i - \frac{ \mathbf{g}_i^\top \mathbf{g}_j }{ \|\mathbf{g}_j\|^2 } \mathbf{g}_j \tag{3.3}$

Gradient Aggregation and Update: The resolved gradients for each loss are summed, and the parameter update is performed with Adam (learning rate $10^{-3}$ ) or standard SGD.

This strategy, referred to as Projected Conflicting Gradient (PCGrad), is designed to preserve cooperative descent directions while reducing destructive gradient components, yielding improved convergence and stability (Niu et al., 19 Jan 2026).

4. Hyperparameters, Architectures, and Training Protocols

Key experimental settings for LDA-PINN are as follows:

Optimizer: Adam, default learning rate $10^{-3}$ .
Loss weights: All unity ( $\lambda_{\rm PDE} = \lambda_{\rm IC} = \lambda_{\rm BC} = 1$ ).
Activation: tanh in all cases.
Network architectures:
- Burgers: 4 hidden layers × 20 neurons
- Helmholtz: 4 hidden layers × 50 neurons
- Klein–Gordon: 3 hidden layers × 50 neurons
- Lid-driven cavity: 3 hidden layers × 50 neurons (outputs: streamfunction plus pressure)
Collocation points: $N_f = 10^4$ (Burgers, KG), $N_f = 10^3$ (cavity).
Training iterations: 40,000 (except cavity: 20,000).

The fully integrated ACR-PINN uses both the LDA backbone and PCGrad. LDA-PINN denotes the architecture with dynamic attention but standard (naive) summation of gradients (Niu et al., 19 Jan 2026).

5. Empirical Performance and Benchmark Results

LDA-PINN yields substantial improvements over standard PINNs in terms of accuracy and convergence speed. Across benchmark tasks, including Burgers, Helmholtz (multiple parameter regimes), Klein–Gordon, and lid-driven cavity flow PDEs, LDA-PINN achieves significantly lower mean and maximum relative errors compared to baselines.

Model	Burgers $\epsilon_{L_2}$ (×10⁻³)	Helmholtz (1,4) $\epsilon_{L_2}$ (×10⁻²)	Klein–Gordon $\epsilon_{L_2}$ (×10⁻³)
Std-PINN	9.96 ± 5.59	17.7 ± 5.5	62.3 ± 13.5
LDA-PINN	2.60 ± 1.89	3.06 ± 0.72	21.7 ± 4.1

LDA-PINN consistently outperforms the standard PINN and is outperformed only by the full ACR-PINN, which integrates both architecture (LDA) and optimization (PCGrad) enhancements (Niu et al., 19 Jan 2026).

6. Theoretical Insights and Synergy with Conflict-Resolved Optimization

The combination of LDA and conflict-resolved optimization (PCGrad) yields a synergistic effect. LDA modifies the geometry of task-specific gradients, improving their conditioning and separability by recurring coordinate-aware information throughout the network. This makes the detection and correction of destructive gradient interference more tractable for PCGrad. In turn, PCGrad ensures that architecture-enhanced representations are not hindered by optimization bias due to conflicting objectives. This co-design reduces both spectral and local representational underfitting as well as harmful inter-task optimization artifacts (Niu et al., 19 Jan 2026).

A plausible implication is that similar attention-based or dynamic input encoding mechanisms, when combined with multi-task training schemes utilizing conflict-aware optimization, can yield robust improvements across broader classes of scientific machine learning problems, provided task gradients are sufficiently structured by the architecture.

7. Relation to Extended ACR-PINN Designs

LDA-PINN constitutes one axis of an ACR-PINN—Architecture-Conflict-Resolved PINN—framework. This modularity allows integration of LDA with complementary advances targeting other failure modes. For instance, in Finite-PINN, finite geometric encoding and hybrid Euclidean–topological solution spaces are used to address geometric and boundary misalignment in solid mechanics PINNs (Li et al., 2024). This suggests the LDA methodology is generalizable and can augment various domain-specific PINN optimizations, provided the architectural design remains compatible with task-specific requirements.

Markdown Report Issue Upgrade to Chat

References (2)

Architecture-Optimization Co-Design for Physics-Informed Neural Networks Via Attentive Representations and Conflict-Resolved Gradients (2026)

Finite-PINN: A Physics-Informed Neural Network with Finite Geometric Encoding for Solid Mechanics (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Layer-wise Dynamic Attention PINN (LDA-PINN).