Learning nonlinear inversion with standard transformer MLP activations

Characterize how transformers using standard MLP activations such as ReLU can learn and implement nonlinear inversion-like computations (e.g., reciprocal 1/x) required by the constructions in this paper, thereby removing the need for explicitly assuming access to a specialized 1/x activation, and provide a theoretical account of such learning mechanisms.

Background

In the learning framework, the authors allow an additional activation f(x) = 1/x + ε to simplify certain constructions (e.g., inverting values output by attention). They justify this by universal approximation but recognize that this sidesteps a deeper question about how such behavior arises with commonly used activations.

They explicitly note that fully explaining how transformers with standard ReLU MLPs learn these functions is an open issue and defer this to future work.

References

The prevalent use of the Universal Approximation theorem notwithstanding, precisely understanding how such functions are learned on the basis of more commonly used activation functions such as ReLU in MLPs inside transformers is largely an open and highly challenging question, which we leave to future work.

On the Ability of Transformers to Verify Plans  (2603.19954 - Sarrof et al., 20 Mar 2026) in Section: Learning Framework, subsection "Model of the transformer with Extended Alphabet" (footnote)