Papers
Topics
Authors
Recent
Search
2000 character limit reached

Keel Architecture: Neural, Causal & Biological Models

Updated 28 January 2026
  • Keel architecture is a multi-domain construct that defines structural and algorithmic pathways to improve performance in deep learning, causal models, and biological patterning.
  • In deep neural networks, it introduces a scalar Highway-style skip connection and dual RMSNorm, resulting in stable ultra-deep training and improved benchmark accuracy.
  • For causal discovery and morphogenesis, Keel principles apply fuzzy constraints and aggregation–diffusion equations to achieve robustness, computational efficiency, and precise pattern localization.

Keel architecture encompasses a set of technical meanings across distinct domains, including deep neural network design (notably for LLMs), causal discovery in graphical models, and biological pattern formation on animal structures. In each context, "keel architecture" denotes a distinct class of structural or algorithmic principles driving critical functional improvements, typically by altering information pathways, robustness mechanisms, or pattern localization.

1. Keel Architecture in Deep Neural Networks

The Keel architecture, as defined in "1" (Chen et al., 27 Jan 2026), is a Transformer block variant engineered for stable, ultra-deep scaling. It remedies gradient vanishing—long the principal barrier to scaling Post-LayerNorm (Post-LN) Transformers—using a single-scalar, Highway-style skip connection and layered RMSNorm.

Architectural Characteristics

Variant Forward Update Normalization Skip Path
Pre-LN xl+1=xl+F(LN(xl))x_{l+1} = x_l + F(\mathrm{LN}(x_l)) Pre-residual LN Unweighted identity
Post-LN xl+1=LN(xl+F(xl))x_{l+1} = \mathrm{LN}(x_l + F(x_l)) Post-residual LN Unweighted identity
Keel xl+1=LN1(αxl+F(LN2(xl)))x_{l+1} = \mathrm{LN}_1(\alpha x_l + F(\mathrm{LN}_2(x_l))) Dual (RMSNorm) Scalar α\alpha scaling (Highway)

Keel introduces:

  • Scalar "Highway-style" skip path, scaling the skip connection by α1\alpha \gg 1, typically set as the total number of layers LL.
  • Dual normalization: LN2_2 stabilizes the input to FF, LN1_1 regulates the combined output.
  • All normalization is ε\varepsilon-RMSNorm with learnable γ\gamma (no bias).

Mathematical Motivation

In Post-LN, the backward Jacobian is dominated by repeated LayerNorm factors with norm O(1/2)O(1/\sqrt{2}), producing exponential gradient vanishing: l=1LJLN2=O(2L/2)\prod_{l=1}^L \left\|J_{\text{LN}}\right\|_2 = O(2^{-L/2}). In contrast, Keel's scaling yields backward norm 1\sim 1 as αL\alpha \rightarrow L \rightarrow \infty:

xl+1xlskip=JLN1(zl)αI    global gradient norm1.\frac{\partial x_{l+1}}{\partial x_l}\big|_{\mathrm{skip}} = J_{\mathrm{LN}_1}(z_l) \cdot \alpha I \implies \text{global gradient norm} \to 1.

This enables stable signal propagation even at depths exceeding 1000 layers, without custom initializations or regularization heuristics.

Implementation and Practical Impact

  • Plug-and-play compatibility with Pre-LN pipelines: swap residual blocks, set α=L\alpha = L.
  • RMSNorm with learned scale, no bias (γ\gamma, β=0\beta=0).
  • Standard AdamW optimization, batch sizes, and learning-rate schedules.
  • No learned gating is used; the scalar α\alpha serves as a fixed “gate.”

Keel achieves consistently higher training stability, larger permissible learning rates, and improved downstream task performance. For example, at 1024 layers, zero/few-shot accuracy is improved by 3.0–8.8 points on standard benchmarks. Depth scaling outpaces width scaling at constant parameter budget (Chen et al., 27 Jan 2026).

2. Keel Architecture for Causal Discovery

KEEL ("Weakly-supervised causal discovery based on fuzzy knowledge and complex data complementarity" (Li et al., 2024)) is an algorithmic causal discovery architecture emphasizing robustness to limited and noisy domain priors in high-dimensional, small-sample settings.

Architectural Pipeline

  • Input: Observational data XX (mixed, possibly incomplete, possibly multi-distribution) alongside a set of fuzzy causal statements KK.
  • Formalization Module: Each fuzzy statement is parsed into a "fuzzy causal mechanism" Q=(P,C,F,M,L)Q = (P, C, F, M, L), where F:P×C[0,1]F: P \times C \rightarrow [0,1] (e.g., via triangular, trapezoidal, or Gaussian kernels).
  • Weakened Constraints: Fuzzy mechanisms yield continuous constraints Cfuzzy(i,j)=1μF(Vi,Vj)C_\text{fuzzy}(i,j) = 1 - \mu_F(V_i, V_j) on the adjacency matrix BB of the causal graph.
  • Model Core: The extended linear causal model (ELCM) with

Xi=fi(βiXPA(i)+ϵi),ϵiP(ϵi),X_i = f_i(\beta_i^\top X_{\mathrm{PA}(i)} + \epsilon_i), \qquad \epsilon_i \sim P(\epsilon_i),

supporting arbitrary likelihoods for continuous or discrete variables and missing data via EM-style imputation.

  • Joint Optimization: The objective is

minB,θlogL(B,θX)+Lknow(B;Cfuzzy)+αB1+ah(B)+p2h(B)2,\min_{B, \theta} -\log L(B, \theta \mid X) + L_\text{know}(B; C_\text{fuzzy}) + \alpha \|B\|_1 + a h(B) + \frac{p}{2} h(B)^2,

where h(B)=Tr(eBB)Dh(B) = \mathrm{Tr}(e^{B \odot B}) - D enforces acyclicity (continuous); LknowL_\text{know} is a weighted L1L_1 penalty adapted by fuzzy constraints.

Fuzzy Knowledge Schema

KEEL operationalizes seven canonical fuzzy knowledge types (EOP, ETE, CCE, BNC, UCD, DC, FC), translating qualitative priors into differentiable constraints. Instead of enforcing binary edge inclusion/exclusion, constraint penalties are weighted by prior confidence, and the optimization is robust to erroneous or noisy statements.

Robustness and Efficiency

  • Fuzzy constraints reduce the search space but are data-corrective: penalties are soft, and erroneous priors are adaptively down-weighted.
  • ELCM supports arbitrary variable types and multi-distribution regimes, automatically aligning domain marginals if required.
  • Optimization is fully continuous, amenable to L-BFGS with O(D3)O(D^3) scaling per iteration, far superior to discrete DAG search.
  • Empirically, KEEL is up to 2×\times faster and more robust than existing constraint-based and score-based baseline methods at D30D \geq 30.

3. Keel Architecture in Biological Ridge Pattern Formation

In morphogenesis, "keel architecture" refers to the emergence and positioning of high-density linear ridges (keels) on curved biological surfaces, most canonically the midline keel of the turtle carapace. Nishihara & Ohira (Nishihara et al., 18 Apr 2025) formalize the mechanistic basis for these patterns via aggregation–diffusion equations with explicit cellular aggregation terms.

Governing Equations

Let u(t,x,y)u(t, x, y) denote surface density of keel-forming cells:

  • Density-dependent diffusion: JD=ϵ(up2+1)J^D = -\epsilon \nabla(u^{p_2+1}) with exponent p2p_2.
  • Distance-dependent aggregation (haptotaxis): JA=uΦ(x,y)J^A = -u \nabla \Phi(x, y), where Φ\Phi encodes spatially structured aggregation cues, e.g., Φ(x,y)=δxxP1x+1/(P1x+1)+δyyP1y+1/(P1y+1)\Phi(x, y) = \delta_x |x|^{P_{1x}+1}/(P_{1x}+1) + \delta_y |y|^{P_{1y}+1}/(P_{1y}+1).
  • Evolution: tu+(JD+JA)=0\partial_t u + \nabla \cdot (J^D + J^A) = 0.

Ridge Localization and Pattern Selection

  • Linear stability: Ridge formation (pattern selection) occurs when local aggregation outcompetes diffusion, specifically when L(x0)=xxΦ(x0)+yyΦ(x0)<0L(x_0) = \partial_{xx}\Phi(x_0) + \partial_{yy}\Phi(x_0) < 0.
  • Sign-change of aggregation flux: Ridges (keels) emerge at roots of the aggregation flux F(x)F(x); midline keel is robustly predicted for x=0x=0 as F(0)<0F'(0)<0 for all P1x>0P_{1x}>0.
  • Multiple keels: Lateral keels arise if F(x)F(x) has additional sign changes (e.g., via piecewise or bimodal δx\delta_x), and disappear as distance-sensitivity exponents increase.

Species-specificity and Parameter Sensitivity

  • Midline keel universality: Mechanistically enforced by the symmetry and monotonicity of F(x)F(x) at x=0x=0.
  • Lateral keels: Present for smaller P1xP_{1x} (higher aggregation sensitivity), corresponding to particular developmental or genetic conditions.
  • Diffusion strength: Higher ϵ\epsilon or exponent p2p_2 flattens patterns; aggregation must strengthen to preserve multiple keels.

Numerical Framework

Finite-difference integration of the governing PDEs supports quantitative species comparisons, scenario modeling, and predictions of both ridge multiplicity and domain boundaries.

4. Comparative Analysis of Keel-inspired Architectures

While the domains differ, Keel architectures share a unifying principle: the selective construction or modulation of information pathways (through residual scaling, constraint softening, or aggregation potential shaping) to achieve superior robustness, scalability, or pattern localization. Each instantiation—Transformer blocks, causal graphical models, or morphogenetic PDEs—adapts this principle to its structural or algorithmic substrate, typically yielding:

  • Stability under high-dimensionality or depth (Keel Transformer, KEEL DAGs).
  • Tunable selectivity of critical features or interactions (Highway-skip, fuzzy constraints, aggregation sign-reversal).
  • Computational tractability by replacing combinatorial/hard constraints with continuous/soft alternatives.

5. Empirical Performance and Practical Guidelines

Deep Networks

  • Maximum stable learning rates scale with depth for Keel: e.g., for 512 layers, Keel supports 6.31×1036.31\times10^{-3} vs Pre-LN 4.67×1034.67\times10^{-3} (Chen et al., 27 Jan 2026).
  • For 1024 layers, Keel exceeds Pre-LN by +3.0+3.0\% average zero/few-shot accuracy; GSM-8K improves by +8.8+8.8\%.

Table: Empirical Benchmarks for Keel Transformer Architecture (Chen et al., 27 Jan 2026)

Depth (layers) Pre-LN Accuracy (%) Keel Accuracy (%) Δ\Delta
64 37.9 39.6 +1.7
128 45.3 46.5 +1.2
512 54.3 58.1 +3.8
1024 57.9 60.9 +3.0

Causal Discovery

  • KEEL achieves higher accuracy and robustness vs. constraint-based and score-based methods on D30D\geq30, is 2×2\times faster than Notears due to search-space reduction via fuzzy priors (Li et al., 2024).

Morphogenetic Patterning

  • Aggregation–diffusion models reproduce single vs. multiple keels and boundary localization via continuous parameter tuning (Nishihara et al., 18 Apr 2025).

6. Domain-specific Implementation Guidelines

  • Keel Transformer: For L100L\gg 100, set skip scaling α=L\alpha=L; employ RMSNorm for all normalization, train with AdamW and cosine schedule. No learned gating or special initialization is required. For extremely wide models, α>L\alpha>L may be beneficial.
  • KEEL Causal Models: Encode all available prior causal knowledge in fuzzy schema; allow the optimization to down-weight imperfect priors. For incomplete data, employ EM-style imputation steps within continuous optimization.
  • Aggregation-Diffusion for Patterning: Select aggregation exponents and strengths according to desired ridge patterning; analyze the sign structure of the aggregation flux for predicting ridge localization.

7. Significance and Outlook

Keel architectures exemplify the strategic reengineering of transmission or constraint pathways to unlock expressivity, robustness, and scalability in domains ranging from artificial intelligence to biological morphogenesis. In neural networks, Keel scaling re-enables the dynamic range of transformative Post-LN architectures for extreme depths. In causal discovery, KEEL softens the brittle dependency on expert priors, turning uncertainty into algorithmic resilience. In developmental biology, keel-inspired models illuminate the universality and adaptability of spatial pattern formation mechanisms. Each context demonstrates that carefully regulated, non-binary pathways—whether scalar skip connections, fuzzy constraints, or tunable aggregation—enable systems to perform reliably in challenging, high-dimensional, or underdetermined regimes.

References:

"Post-LayerNorm Is Back: Stable, ExpressivE, and Deep" (Chen et al., 27 Jan 2026) "Weakly-supervised causal discovery based on fuzzy knowledge and complex data complementarity" (Li et al., 2024) "Local Ridge Formation and Domain Delimitation in Aggregation-Diffusion Equations" (Nishihara et al., 18 Apr 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Keel Architecture.