Papers
Topics
Authors
Recent
Search
2000 character limit reached

Graded Transformers

Updated 13 February 2026
  • Graded Transformers are sequence modeling frameworks that integrate explicit algebraic and geometric inductive biases for structured, neurosymbolic learning.
  • They employ grading operators—linear and exponential—to modulate attention and feed-forward layers, enhancing feature prioritization and model robustness.
  • They provide strong theoretical guarantees including universal approximation, reduced sample complexity, and robust internalized symbolic computation via morphisms.

A Graded Transformer is a sequence modeling framework designed to incorporate explicit algebraic and geometric inductive biases using graded structures on vector spaces and morphic composition. The key innovation is the assignment of learnable or fixed grades—scalars or tuples indicating algebraic "degree" or feature salience—to coordinates or subspaces, which are then manipulated via grading operators embedded deeply within the transformer architecture. Graded Transformers unify symbolic and continuous computation by treating algebraic grading as first-class architectural and optimization components, with robust theoretical guarantees and direct support for structured, neurosymbolic, and hierarchical learning. The dominant architectures are the Linearly Graded Transformer (LGT) and Exponentially Graded Transformer (EGT), which generalize the transformer family by fusing algebraic symmetry principles with modern attention mechanisms (Sr, 27 Jul 2025), and the graded-morphism framework supporting internalized symbolic computation as composable, differentiable blocks (Shaska, 21 Nov 2025).

1. Graded Vector Spaces and Operators

A central component is the notion of a graded vector space. Given a ground field FF (typically R\mathbb{R}), a dd-dimensional graded space is written as

Vq=i=0d1Vi,Vi=Fei,gr(ei)=qiR0V_q = \bigoplus_{i=0}^{d-1} V_i, \qquad V_i = F e_i, \qquad \mathrm{gr}(e_i) = q_i \in \mathbb{R}_{\ge 0}

where q=(q0,,qd1)q = (q_0, \ldots, q_{d-1}) is a grading tuple assigning a nonnegative grade to each basis element.

Two grading operators are used:

  • Linear grading: For LGT, the diagonal scaling operator Glin(q)=Ψq=diag(q0,,qd1)G_\mathrm{lin}(q) = \Psi_q = \mathrm{diag}(q_0, \ldots, q_{d-1}) acts coordinate-wise as Glin(q)x=(q0x0,,qd1xd1)G_\mathrm{lin}(q)x = (q_0 x_0, \ldots, q_{d-1} x_{d-1})^\top.
  • Exponential grading: For EGT, with exponential base λ>1\lambda > 1, the operator is Gexp(q)=Ψˉq=diag(λq0,,λqd1)G_\mathrm{exp}(q) = \bar{\Psi}_q = \mathrm{diag}(\lambda^{q_0}, \ldots, \lambda^{q_{d-1}}), applied as Gexp(q)x=(λq0x0,,λqd1xd1)G_\mathrm{exp}(q)x = (\lambda^{q_0} x_0, \ldots, \lambda^{q_{d-1}} x_{d-1})^\top.

In graded self-attention, bilinear forms for similarity computation are replaced by graded versions: u,vqlin=udiag(q)v,u,vqexp=udiag(λq)v\langle u, v \rangle_{q}^{\mathrm{lin}} = u^\top \mathrm{diag}(q) v, \qquad \langle u, v \rangle_{q}^{\mathrm{exp}} = u^\top \mathrm{diag}(\lambda^{q}) v This prioritizes or attenuates dimensions according to their grades, biasing attention toward features deemed structurally important in the algebraic or symbolic prior (Sr, 27 Jul 2025).

2. Architectures: LGT and EGT

Graded Transformers adopt the standard transformer pipeline: embedding plus positional encoding, followed by stacked graded self-attention blocks, graded feed-forward layers, and a decoder. Grade application precedes or follows nonlinearity and is integrated into each architectural stage:

Stage LGT Operation EGT Operation
Input embedding x~i=Glin(q)xi\tilde{x}_i = G_\mathrm{lin}(q)x_i x~i=Gexp(q)xi\tilde{x}_i = G_\mathrm{exp}(q)x_i
Positional encoding Pelin(i)=qtPe(i)Pe_\mathrm{lin}(i) = q_t Pe(i) Peexp(i)=λqtPe(i)Pe_\mathrm{exp}(i) = \lambda^{q_t} Pe(i)
Multi-head attention softmax ⁣(Qdiag(q(h))Kdk)V\mathrm{softmax}\!\left(\frac{Q \mathrm{diag}(q^{(h)}) K^\top}{\sqrt{d_k}}\right)V softmax ⁣(Qdiag(λq(h))Kdk)V\mathrm{softmax}\!\left(\frac{Q \mathrm{diag}(\lambda^{q^{(h)}}) K^\top}{\sqrt{d_k}}\right)V
Feed-forward layer Grade inserted before or after the classical FFN Grade inserted before or after the classical FFN
Residual + layer norm Standard Standard

Grades may be shared or assigned per head, per coordinate, or per semantic type, and can optionally be optimized as model parameters, not just fixed (Sr, 27 Jul 2025). Each layer thus implements a compositional graded structure, supporting explicit feature hierarchies and morphological distinctions.

3. Graded Loss Functions and Optimization

Optimization in graded transformers employs loss functions weighted by the grading scheme, aligning learning dynamics with inductive biases. For supervised targets y,y^y,\hat{y} and grading tuple qq:

  • LGT (linear weights): L(y,y^)=1ni=1nk=1dqk(yi,ky^i,k)2L(y,\hat{y}) = \frac{1}{n}\sum_{i=1}^n \sum_{k=1}^d q_k (y_{i,k} - \hat{y}_{i,k})^2 or

L(y,y^)=i,kqkyi,klny^i,kL(y,\hat{y}) = -\sum_{i,k} q_k y_{i,k} \ln \hat{y}_{i,k}

  • EGT (exponential weights): L(y,y^)=i=1nk=1dλqk(yi,k,y^i,k)L(y,\hat{y}) = \sum_{i=1}^n\sum_{k=1}^d \lambda^{q_k} \, \ell(y_{i,k}, \hat{y}_{i,k}) with an added quadratic regularizer Ltotal=L+γkqk2L_\mathrm{total} = L + \gamma \sum_k q_k^2. This amplifies gradient updates on high-grade coordinates and stabilizes parameter trajectories.

Grades {qk}\{q_k\} are treated as differentiable parameters, updated via gradient descent: qkqkη(Lqk+2γqk)q_k \leftarrow q_k - \eta\left(\frac{\partial L}{\partial q_k} + 2\gamma q_k \right) with optional gradient clipping or (for EGT) annealing the exponential base. This enables adaptive feature prioritization, alleviating the rigidity of fixed-grade structuring (Sr, 27 Jul 2025).

4. Theoretical Guarantees

Graded Transformers provide a robust set of theoretical results that distinguish them from standard transformer architectures:

  • Universal Approximation: Both LGT and EGT are universal approximators for continuous functions f:ΩRn×df: \Omega \to \mathbb{R}^{n\times d} on compact domains Ω\Omega, with significant parameter savings when the target function depends primarily on top-graded features; parameter count reduces from O(ϵ2/kd)O(\epsilon^{-2/k} d) to O(ϵ2/kdeff/d)O(\epsilon^{-2/k} d_\mathrm{eff}/d) for Sobolev ff, where deffd_\mathrm{eff} is the number of dominant grades.
  • VC-Dimension and Sample Complexity: Denoting transformer depth NN, heads hh, and dimensionality dd, standard VC-dimension scales as cNhd2ln(Nhd)\leq c N h d^2 \ln(Nhd), while graded transformers satisfy cNhddeffln(Nhdeff)\leq c N h d d_\mathrm{eff} \ln(Nh d_\mathrm{eff}), often sharply reducing sample complexity for hierarchical tasks.
  • Lipschitz Continuity and Robustness: Grading operators G(q)xG(q)yLGxy\|G(q)x-G(q)y\| \leq L_G \|x-y\|, with LG=maxkqkL_G = \max_k q_k (LGT) or maxkλqk\max_k \lambda^{q_k} (EGT), ensure global Lipschitz continuity of the network. Robustness to adversarial or noisy inputs is enhanced: output perturbations scale as LtotalδL_\mathrm{total} \delta with LtotalL_\mathrm{total} a polynomial in the maximal grade and network depth, conferring resistance to input perturbations (Sr, 27 Jul 2025).

5. Internalization of Symbolic Computation via Morphisms

The graded transformer framework supports the direct embedding of symbolic operations as morphisms in the graded hidden space. The decomposition V=gGVgV = \bigoplus_{g \in G} V_g enables direct mapping of distinct semantic or computational channels, such as linguistic, retriever, or arithmetic pathways.

Typed morphisms ϕhg:VgVh\phi_{h \leftarrow g}: V_g \to V_h—constrained to a sparse admissible set EG×GE \subseteq G \times G—implement operations between grades. Morphic updates are selected by a differentiable routing policy defined by bilinear or context-dependent logits, achieving sparse and interpretable symbolic invocation.

A self-supervised graded utility functional quantifies the expected LM loss reduction ΔLt(hg)\Delta L_t(h \leftarrow g) from candidate morphisms and governs their activation through soft margin penalties and sparsity constraints on routing weights. This yields an architecture wherein symbolic computation, categorical composition, and continuous learning dynamics are unified and fully differentiable (Shaska, 21 Nov 2025).

Case studies on synthetic tasks (modular arithmetic, retrieval, Dyck-depth tracking) demonstrate clear separation and selective invocation of symbolic subchannels, validated by metrics such as ΔLt\Delta L_t histograms, routing entropy, and edge ablation experiments.

6. Algebraic-Geometric Foundations and Applications

The graded structure is motivated by classical constructions in algebraic geometry (graded rings, modules, cohomology), category theory (internal categories, morphisms, adjunctions), and information geometry (KL divergence, Bregman projection, Fisher natural gradient). Each homogeneous subspace VgV_g can be semantically typed, facilitating composability and interpretable channelization.

Applications span:

  • Algebraic geometry: prediction of polynomial invariants, manipulation of monomial degrees, zeta function computation.
  • Physics: encoding of multiscale physical phenomena graded by frequency or energy levels.
  • Natural language processing: hierarchical parsing, emphasizing syntactic heads and semantically key tokens.
  • Biological sequence analysis: distinguishing coding regions or active sites via grading among high-noise sequence backgrounds.
  • Emerging domains: node centrality in graph models, financial event importance, and hybrid symbolic-neural computation (Sr, 27 Jul 2025, Shaska, 21 Nov 2025).

The algebraic-morphic formalism subsumes external tool invocation (e.g., Toolformer) as a special case of functorial block selection in the internal graded category, but unlike such approaches, keeps all operations end-to-end differentiable within the network (Shaska, 21 Nov 2025).

7. Implementation and Empirical Behavior

A graded layer is implemented by maintaining parallel subspaces VgV_g, applying typed morphisms ϕhg\phi_{h \leftarrow g}, and combining updates with learned routing coefficients. PyTorch-style modules include direct handling of block matrices, grade-specific projections, layer norms, and differentiable utility-driven update rules. The training loop integrates standard LM loss and sparsity-inducing penalties on routing entropy.

Empirical sanity checks—such as entropy collapse, support size reduction, and loss degradation under ablation—demonstrate that graded transformers autonomously discover selective activation of interpretable computational pathways when trained on sufficiently structured or synthetic tasks (Shaska, 21 Nov 2025).


References:

  • "Graded Transformers: A Symbolic-Geometric Approach to Structured Learning" (Sr, 27 Jul 2025)
  • "Internalizing Tools as Morphisms in Graded Transformers" (Shaska, 21 Nov 2025)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graded Transformers.