Uniform-attention optimality in the TI max-margin solution
Prove that, for the transitive-inference task learned by a single-layer, single-head Transformer without positional encodings and analyzed via the max-margin framework, the max-margin solution does not favor any position or symbol and yields uniform attention weights over input positions at the query token.
References
Because of this symmetry, we conjecture that the magnitude of each e_t tends to remain about the same in the max margin set, and no position or symbol is favored above any other.
— Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces
(2602.14404 - Tong et al., 16 Feb 2026) in Appendix: Theoretical analysis of transitive inference, Simplified model (paragraph “Uniform attention weights”)