Learning nonlinear inversion with standard transformer MLP activations
Characterize how transformers using standard MLP activations such as ReLU can learn and implement nonlinear inversion-like computations (e.g., reciprocal 1/x) required by the constructions in this paper, thereby removing the need for explicitly assuming access to a specialized 1/x activation, and provide a theoretical account of such learning mechanisms.
References
The prevalent use of the Universal Approximation theorem notwithstanding, precisely understanding how such functions are learned on the basis of more commonly used activation functions such as ReLU in MLPs inside transformers is largely an open and highly challenging question, which we leave to future work.
— On the Ability of Transformers to Verify Plans
(2603.19954 - Sarrof et al., 20 Mar 2026) in Section: Learning Framework, subsection "Model of the transformer with Extended Alphabet" (footnote)