Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pathway Activation Subspaces (PAS)

Updated 26 January 2026
  • Pathway Activation Subspaces are mathematically defined subspaces that capture low-rank input activations in neural architectures, enabling precise expert specialization.
  • They underpin routing in MoE and LoRA architectures by aligning activation energies with expert pathways to improve model stability and performance.
  • PAS methods facilitate rank stabilization, interpretability interventions, and continual learning, reducing forgetting while enhancing multi-task tuning.

A Pathway Activation Subspace (PAS) is a technical construct in neural network analysis that formalizes the set of input directions most responsible for an expert’s low-rank responses within mixture-of-experts (MoE) and low-rank adaptation (LoRA) architectures. The PAS framework is applied to characterize, control, and interpret expert specialization and to design routing and stabilization protocols that enable continual multi-task tuning while mitigating catastrophic forgetting and misaligned drift between routers and experts. PASs also underpin a mathematical and empirical lens on subspace intervention and patching, specifying when an activation subspace is genuinely causally mechanistic or merely an artefact of linear algebraic correlation.

1. Formal Definition and Construction of Pathway Activation Subspaces

Within LoRA-based mechanisms, a frozen linear layer WRdout×dinW \in \mathbb{R}^{d_{\mathrm{out}} \times d_{\mathrm{in}}} is decomposed via a low-rank update ΔW=BA\Delta W = B A, where ARr×dinA \in \mathbb{R}^{r \times d_{\mathrm{in}}}, BRdout×rB \in \mathbb{R}^{d_{\mathrm{out}} \times r}, and rank rmin(din,dout)r \ll \min(d_{\mathrm{in}}, d_{\mathrm{out}}). For expert ee, the incremental output on input hRdinh \in \mathbb{R}^{d_{\mathrm{in}}} is Δye(h)=Be(Aeh)\Delta y_e(h) = B_e (A_e h). The PAS of expert ee is defined as Se=span(Ae)Rdin\mathcal{S}_e = \mathrm{span}(A_e^\top) \subset \mathbb{R}^{d_{\mathrm{in}}}, i.e., the row-space of AeA_e.

The basis elements ae,1,,ae,ra_{e,1}^\top, \dots, a_{e,r}^\top span Se\mathcal{S}_e. The dimensionality is at most rr, controlled by the LoRA rank. A PAS thus uniquely characterizes which input features, when projected onto Se\mathcal{S}_e, are directly actionable by the expert’s low-rank pathway. The activation coordinates ze(h)=AehRrz_e(h) = A_e h \in \mathbb{R}^r specify a point in this subspace, and projection Pe(h)=Ae(AeAe)+AehP_e(h) = A_e^\top (A_e A_e^\top)^+ A_e h recovers the component of hh aligned with expert ee's PAS (Hou et al., 19 Jan 2026).

Subspace activation patching, as explored in interpretability contexts, similarly takes a hypothesized subspace SS of an activation space and performs interventions by orthogonally replacing the projected component for a source activation actAact_A into a target actBact_B via actBpatched=(IPS)actB+PSactAact_B^{\mathrm{patched}} = (I - P_S) act_B + P_S act_A (Makelov et al., 2023).

2. PAS-Guided Routing and Reweighting in MoE-LoRA

PASs enable routing algorithms to operate in the capability-aligned coordinate system defined by each expert’s functional subspace. For an input hh, the activation energy for expert ee is se(h)=1rAeh22=1rk=1r(ae,kh)2s_e(h) = \frac{1}{r} \|A_e h\|_2^2 = \frac{1}{r} \sum_{k=1}^r (a_{e,k}^\top h)^2. Mixture weights πe(h)\pi_e(h) are computed by softmax over these energies: πe(h)=exp(se(h))eexp(se(h))\pi_e(h) = \frac{\exp(s_e(h))}{\sum_{e'} \exp(s_{e'}(h))}.

Routing decisions are thereby grounded in the actual induced PAS activations, maintaining alignment between router and experts; the LoRA update is computed as ΔW(h)=eπe(h)BeAeh\Delta W(h) = \sum_e \pi_e(h) B_e A_e h. This mechanism prevents the phenomenon termed "misaligned co-drift," where router and experts’ preferences and specialization gradually diverge due to indiscriminate joint updates (Hou et al., 19 Jan 2026).

3. Rank Stabilization and Anti-Forgetting via PASs

To protect against forgetting in continual learning, PAS-aware rank stabilization tracks and regularizes the directions within each expert’s PAS that have been historically important across tasks. Per-task importance for expert ee, direction kk at stage tt is Ie,k(t)=EhDt[πe(h)(ae,kh)2]I_{e,k}(t) = \mathbb{E}_{h \sim \mathcal{D}_t} [\pi_e(h) (a_{e,k}^\top h)^2], aggregated over prior tasks as Ie,kagg(t1)=t=1t1Ie,k(t)I_{e,k}^{\mathrm{agg}(t-1)} = \sum_{t'=1}^{t-1} I_{e,k}(t'). A quadratic penalty is imposed on large changes for critically activated directions: Lstab=e,kwe,kae,k(t)ae,k(t1)22\mathcal{L}_{\mathrm{stab}} = \sum_{e,k} w_{e,k} \|a_{e,k}^{(t)} - a_{e,k}^{(t-1)}\|_2^2, we,kw_{e,k} normalized from Ie,kaggI_{e,k}^{\mathrm{agg}}. The overall loss is L=Ltask+λLstab\mathcal{L} = \mathcal{L}_{\mathrm{task}} + \lambda \mathcal{L}_{\mathrm{stab}}.

Retaining previous AeA_e matrices and running sums Ie,kaggI_{e,k}^{\mathrm{agg}} suffices for storage, and stabilized PAS directions correspond to those with high historical activation, preserving expert specialization and mitigating drift (Hou et al., 19 Jan 2026).

4. Interpretability, Activation Patching, and PASs

PASs connect with mechanistic interpretability paradigms, especially subspace activation patching. The procedure involves hypothesizing a subspace SS encoding a feature, constructing an orthogonal projector PSP_S, and patching activations as described. The efficacy of this approach is assessed using metrics such as the fractional logit-difference decrease (FLDD) and interchange accuracy, as presented in Table 1 from (Makelov et al., 2023):

Intervention FLDD (%) Interchange (%)
full MLP –8 0.0
v_MLP 46.7 4.2
rowspace(v_MLP) 13.5 0.2
nullspace(v_MLP) 0 0.0
full residual 123.6 54.8
v_resid 140.7 74.8
rowspace(v_resid) 127.5 63.1
nullspace(v_resid) 13.9 0.4
v_grad 111.5 45.1
rowspace(v_grad) 106.5 40.6
nullspace(v_grad) 2.2 0.0

Patching directions composed of nullspace (causally disconnected) and dormant directions can produce strong but illusory interpretability signals. Removal of the nullspace component eliminates the effect, indicating that genuine PAS interventions require strong mechanistic alignment with the model’s output pathways (Makelov et al., 2023).

5. PASs and the Interpretability Illusion: Mechanistic and Empirical Analysis

A critical finding is that subspace interventions can be deceptive: a patching direction mixing a correlational but disconnected vector with a dormant causal vector may induce the appearance of feature localization without mechanistic faithfulness. In toy models and real tasks (such as indirect object identification and factual recall), patching along such "illusory" subspaces mimics genuine flipping of the output, but does so by activating dormant pathways rather than faithfully tracing the original feature (Makelov et al., 2023).

A dormant subspace is one that is inactive on typical data, yet can steer output if artificially energized. A causally disconnected subspace cannot affect model output for any activation. Notably, the optimal patch in terms of output influence often mixes these two at θ=π/4\theta = \pi/4 (equal weights), formalized by v=αvdisconnected+βvdormantv = \alpha v_{\mathrm{disconnected}} + \beta v_{\mathrm{dormant}} (Makelov et al., 2023).

6. PASs, Rank-1 Editing, and Subspace Equivalence

In factual recall, the connection between PASs and rank-1 editing is established via ROME, which modifies model weights by adding a rank-1 update Wout=Wout+abW_{\mathrm{out}}' = W_{\mathrm{out}} + ab^\top to ensure the desired output on average subject activations. The empirical and theoretical correspondence between 1-D subspace patching and rank-1 edits is quantified by matching output directions and variance budgets: v=αWout+a+uv = \alpha W_{\mathrm{out}}^+ a + u with ukerWoutu \in \ker W_{\mathrm{out}}, showing high cosine similarity and near-identical rewrite scores between patch and weight-editing methods across network layers (Makelov et al., 2023).

7. Evaluating the Faithfulness of PASs: Sanity Checks and Formal Criteria

End-to-end flipping success is insufficient for asserting mechanistic faithfulness of a PAS. The following criteria are integral: (a) strong class-based activation correlation; (b) alignment with output-relevant rowspaces (outside of kerWout\ker W_{\mathrm{out}}); (c) dormancy in the irrelevant portion of the PAS on data distribution; (d) correct positioning within known circuit bottlenecks; (e) generalization to new instances; (f) consistency across patching, ablation, and rank-1 editing interventions. Satisfying these conditions supports a PAS as a genuine mechanistic variable rather than an artefactual correlation (Makelov et al., 2023).

8. Empirical Performance Gains and Impact

On a 7-task multimodal continual learning benchmark (MLLM-CTBench), PASs-based methods outperform conventional baselines in both accuracy (AP) and resistance to forgetting (BWT):

Method AP (%) BWT
MoELoRA (softmax) 43.36 –6.64
PASs-MoE (Ours) 48.46 –2.15

These results demonstrate the quantitative superiority and parameter efficiency of PASs-guided approaches. Gains include a +5.1 percentage point increase in average performance and substantial reductions in forgetting across domains without additional parameters, confirming that PASs effectuate robust continual learning and stable expert specialization (Hou et al., 19 Jan 2026).


Pathway Activation Subspaces provide both a theoretical and practical foundation for routed expertise in adaptive neural architectures while also raising deep questions of mechanistic interpretability, causal faithfulness, and the distinction between functional subspaces and illusory activation phenomena. The PAS concept thus serves as a focal point for rigorous model analysis and future methodological development.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pathway Activation Subspace (PASs).