Ambiguity in parameterizations of the idealized autoregressive model
Determine the extent of ambiguity in parameter choices for the idealized autoregressive attention-based model that solves a given deterministic mapping from an input token sequence to an output token sequence, and assess whether selecting the parameterization that minimizes the total number of parameters is a principled canonical choice for such tasks.
References
There might be multiple choices of parameters that lead to the correct output, since multiple Turing machines can perform a given task. One possible choice is to choose an idealized model where the total number of parameters is as small as possible. We leave further investigation of this ambiguity to future work.
— A model of errors in transformers
(2601.14175 - Raju et al., 20 Jan 2026) in Subsection “Idealized autoregressive model,” Section 2 (Error model)