Ambiguity in parameterizations of the idealized autoregressive model

Determine the extent of ambiguity in parameter choices for the idealized autoregressive attention-based model that solves a given deterministic mapping from an input token sequence to an output token sequence, and assess whether selecting the parameterization that minimizes the total number of parameters is a principled canonical choice for such tasks.

Background

The paper defines an idealized autoregressive model—comprising an embedding layer, stacked attention and nonlinear layers, and an output layer—that solves deterministic token-sequence tasks with perfect accuracy. Because multiple Turing machines can implement the same function, the authors note that multiple parameterizations can produce the correct outputs, creating an ambiguity in choosing among them.

They suggest one possible selection principle: choosing an idealized model with the smallest total number of parameters. However, the authors explicitly leave the investigation of this ambiguity to future work, indicating that the structure and resolution of the parameterization space remains unresolved.

References

There might be multiple choices of parameters that lead to the correct output, since multiple Turing machines can perform a given task. One possible choice is to choose an idealized model where the total number of parameters is as small as possible. We leave further investigation of this ambiguity to future work.

A model of errors in transformers  (2601.14175 - Raju et al., 20 Jan 2026) in Subsection “Idealized autoregressive model,” Section 2 (Error model)