NAG-LoRA: Unified Graph Reasoning in LMs
- NAG-LoRA is a parameter-efficient extension of Transformer-based language models that natively integrates graph structures using topology-aware attention and low-rank adaptation.
- It employs a multi-mask attention mechanism and structural position calibration to jointly model linguistic and graph-theoretic information without external GNNs.
- Empirical evaluations demonstrate that NAG-LoRA outperforms dual-path and prefix-based methods on both synthetic and real-world graph reasoning tasks with minimal parameter overhead.
NAG-LoRA is a parameter-efficient extension of Transformer-based LMs enabling native comprehension of structured text-graphs without relying on external Graph Neural Networks (GNNs). Developed as part of the Native Architecture for Graphs (NAG) paradigm, NAG-LoRA introduces topology-aware attention, structural position calibration, and low-rank attention adaptation through LoRA (Low-Rank Adaptation) modules. This design allows pre-trained decoder-only LMs to internalize both linguistic and graph-theoretic reasoning, handling node/edge semantics and structural topology concurrently within the model’s manifold, and outperforms established dual-path and prefix-based alternatives on both synthetic and real-world graph reasoning tasks (Gong et al., 30 Jan 2026).
1. Topology-Aware Attention and Input Construction
NAG-LoRA represents graphs with sequences in which each node and edge is encapsulated within special tags (“<n>…</n>”, “<e>…</e>”), and the ensemble is surrounded by a global tag (“<g>…</g>”). The input sequence is designed to be permutation-invariant with respect to the order of nodes and edges, enforcing that the linguistic serialization does not encode spurious sequential bias.
To induce graph-structural dependencies during encoding, a binary attention mask is applied at every Transformer block. indicates token may attend to token , and is constructed as the logical OR of four sub-masks:
- Intra-element (causal) mask : permits causal attention within each element.
- Inter-element mask : links the closing (“hub”) tokens of nodes and edges according to graph connectivity, supporting directed message passing.
- Global mask : enables global “gather-and-broadcast”—the global closing tag aggregates from all hubs, and the opening tag broadcasts to all tokens.
- Query-graph mask : allows query tokens to attend either only to all hubs or all tokens, in either “Sparse” or “Full” regimes.
Formally, for element with token set and hub index :
- iff ,
- iff ,
- iff ,
- For in Sparse mode: iff .
This multi-faceted masking enables precise control over token-level dependency, allowing the LM to model graph semantics and topology jointly.
2. Low-Rank Adaptation Mechanism
Parameter-efficient adaptation is achieved by injecting LoRA modules into the query, key, and value projections within the attention layers. For a projection weight matrix , the updated parameterization employs:
with separate LoRA modules for , , . The attention calculation thus becomes:
where injects infinite negative bias wherever , enforcing precise topology-aware sparsity. Only the LoRA parameters and the new token embeddings are trainable; the backbone LM weights remain frozen.
3. Structural Position Calibration
To eliminate sequence-order bias while maintaining structural awareness, NAG-LoRA employs Rotary Positional Embeddings (RoPE) with a custom “hub” indexing scheme. Given (position of “<g>”) and the maximum element length :
- All element hubs are mapped to a uniform position .
- The global closing tag “</g>” is assigned .
- Within elements, positions increment normally.
- Query tokens resume sequential indexing from .
Relative attention thus relies solely on pairwise position differences, ensuring that attention between queries and hubs remains invariant to the order of graph elements. This strategy has been empirically validated, with deviations from position calibration degrading accuracy by up to 2.95% in connected-nodes tasks (Gong et al., 30 Jan 2026).
4. Training Regime and Optimization
NAG-LoRA is trained autoregressively to minimize the negative log-likelihood:
where encompasses only the LoRA parameters and new token embeddings. No auxiliary loss is applied; structural consistency emerges directly from masked attention and recalibrated positions.
Optimization utilizes the AdamW algorithm with a weight decay of $0.01$ over LoRA parameters, an initial learning rate of with a 500-step linear warm-up, and inherited dropout rates () from the backbone LM. Training typically employs a batch size of 32 sequences per GPU (16–24 tokens per graph and query), with early stopping on validation loss or task accuracy (3–5 epochs).
5. Empirical Results and Comparative Evaluation
Topological Awareness Tasks
On nine synthetic graph tasks, NAG-LoRA with a Qwen3-600M backbone demonstrated near-perfect or high accuracy, including:
| Task | Accuracy | AbsErr/F1 |
|---|---|---|
| Node Count | 100.00% | 0.00 |
| Edge Count | 94.95% | 0.06 |
| Cycle Check | 99.90% | — |
| Triangle Count | 74.35% | 0.89 |
| Node Degree | 99.75% | 0.00 |
| Connected Nodes | 84.90% | F1=0.98 |
| Reachability | 99.90% | — |
| Edge Existence | 99.70% | — |
| Shortest Path | 95.00% | 0.06 |
NAG-LoRA outperformed both LoRA-tuned linearization baselines (Qwen3-LoRA) and dual-path GNN-prefix methods (GraphToken), with the largest gains observed on higher-order tasks (Triangle Count +6.35%, Shortest Path +3.75% relative to NAG-Zero).
Semantic Graph Reasoning
On ExplaGraphs, SceneGraphs, and WebQSP real-world benchmarks:
| Benchmark | NAG-LoRA | Qwen3-LoRA | GraphToken |
|---|---|---|---|
| ExplaGraphs (Acc) | 82.49% | 62.09% | — |
| SceneGraphs (Acc) | 83.82% | 83.71% | — |
| WebQSP (Hit@1) | 55.25% | 44.37% | — |
These results indicate substantial improvement over both the linearization and token-prefix baselines, particularly for challenging semantic reasoning tasks. While the zero-shot NAG-Zero variant is competitive, LoRA adaptation bridges a semantic capacity gap by enabling all attention weights to learn graph dependencies.
Ablation and Performance Analysis
- Interaction strategy: “Sparse” vs. “Full” query-graph attention schemes show task- and regime-dependent trade-offs, with no universal optimum.
- Position calibration: Standard absolute positions introduce bias and degrade performance, confirming the necessity of recalibrated hub assignment (Gong et al., 30 Jan 2026).
6. Computational Efficiency and Practical Guidance
NAG-LoRA adds only $2rd$ parameters per attention projection—approximately 0.05–0.1% of model size for . Inference throughput is impacted minimally (<5% reduction relative to base LM), and training memory overhead allows full-fine batch updates even on a single 24 GB GPU, using FP16 mixed precision.
Recommended hyperparameters for full reproducibility:
| Parameter | Recommended Value |
|---|---|
| Backbone | Qwen3-600M LM |
| LoRA rank | 8 |
| LoRA scaling | 16 ( scaled by ) |
| Learning rate | (linear warmup/decay) |
| Weight decay | 0.01 |
| Batch size | 32 sequences/GPU |
| Training epochs | 3–5 |
| Dropout | |
| Positional embedding | RoPE + calibrated hubs |
| Mask computation | Precompute once per batch, add in attention |
| Precision | FP16 |
These guidelines are sufficient to reproduce the reported empirical gains across both synthetic and semantic graph tasks.
7. Significance and Context
NAG-LoRA exemplifies a shift from segregated (GNN-LM) graph-text modeling to a unified, encoder-free approach. By internalizing graph structure through attention mask engineering and efficient LoRA adaptation, NAG-LoRA obviates the complexity of external structural encoders and the need for dual embedding space alignment. The result is a language-native architecture capable of robust, permutation-invariant graph reasoning with negligible increase in parameters or computational burden (Gong et al., 30 Jan 2026). The methodology introduces new opportunities for graph-structured representation learning using text-centric foundation models and demonstrates the critical role of attention masking and positional strategies in bridging the gap between graph and language modalities.