Negligible impact of negative-activation features on depth-wise scaling in DP TI models
Ascertain whether, in the direct-prediction Transformer trained on the transitive-inference task, the constraints associated with hidden units having negative readout weights have negligible effect on the achievable depth-versus-width scaling, thereby confirming that the depth-wise scaling bottleneck is governed primarily by the sign-sharing requirements of positive-readout units.
References
This mild condition is satisfied at initialization, and we conjecture that its impact on scaling is minimal.
— Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces
(2602.14404 - Tong et al., 16 Feb 2026) in Appendix: Theoretical analysis of transitive inference, Scaling relationships – Depth-wise scaling (DP model)