Uniform-attention optimality in the TI max-margin solution

Prove that, for the transitive-inference task learned by a single-layer, single-head Transformer without positional encodings and analyzed via the max-margin framework, the max-margin solution does not favor any position or symbol and yields uniform attention weights over input positions at the query token.

Background

To analytically study transitive inference (TI), the authors simplify a Transformer and argue—based on symmetry and margin considerations—that no position or symbol should be preferred in the max-margin solution. They support this with empirical measurements showing near-uniform attention.

They explicitly conjecture that the per-position contributions are equal in magnitude and that no symbol/position is favored, which would justify uniform attention as optimal under the max-margin objective, but they do not provide a formal proof.

References

Because of this symmetry, we conjecture that the magnitude of each e_t tends to remain about the same in the max margin set, and no position or symbol is favored above any other.

— Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces (2602.14404 - Tong et al., 16 Feb 2026) in Appendix: Theoretical analysis of transitive inference, Simplified model (paragraph “Uniform attention weights”)

Uniform-attention optimality in the TI max-margin solution

Background

References

Related Problems