UFM in Deep Neural Networks

Updated 2 February 2026

UFM is a theoretical framework that treats last-layer features as free variables to simplify deep network analysis.
It rigorously characterizes the optimization landscape, showing all local minima are global and form an ETF structure.
Empirical validations demonstrate UFM's effectiveness in explaining Neural Collapse and improving training efficiency.

The Unconstrained Features Model (UFM) is a theoretical and computational framework for analyzing the geometry, optimization landscape, and emergent phenomena in deep neural networks, particularly focusing on the last-layer features and classifier structure at the terminal phase of training. By treating the output of the penultimate layer (the final feature embeddings prior to classification) as free optimization variables decoupled from any parametric dependence on prior layers, the UFM reduces complex deep network analyses to mathematically tractable problems. This simplification allows rigorous characterization of structures such as Neural Collapse and Equiangular Tight Frame (ETF) emergence, offering precise predictions about the global optimality, critical point landscape, and practical implications for classification networks (Zhu et al., 2021).

1. Mathematical Formulation of the UFM

The UFM considers a $K$ -class classification problem, each class with $n$ training samples. The key abstraction is to treat all last-layer feature vectors $h_{k,i}\in\mathbb{R}^d$ (for sample $i$ in class $k$ ) as unconstrained free variables. The model optimizes jointly over these features, a linear classifier $W\in\mathbb{R}^{K\times d}$ , and a possible bias $b\in\mathbb{R}^K$ , under a loss function such as regularized cross-entropy:

$\min_{W\in\mathbb{R}^{K\times d},\,H\in\mathbb{R}^{d\times (Kn)},\,b\in\mathbb{R}^K} f(W,H,b) = \frac{1}{Kn} \sum_{k=1}^K \sum_{i=1}^n \ell(W h_{k,i} + b, y_k) + \frac{\lambda_W}{2} \|W\|_F^2 + \frac{\lambda_H}{2} \|H\|_F^2 + \frac{\lambda_b}{2} \|b\|_2^2,$

where $\ell(z, y_k) = -\log\left(\exp(z_k)/\sum_j \exp(z_j)\right)$ is the cross-entropy (Zhu et al., 2021). No further constraints are imposed on $H$ , justifying the “unconstrained” nomenclature.

This abstraction leverages the over-parameterization of deep networks, allowing the “layer-peeling” of lower network layers, and is justified by the empirical observation that modern deep networks can represent essentially any last-layer activations for a finite training set.

2. Optimization Problem and Landscape Analysis

The core problem is nonconvex, but possesses a highly structured landscape. The main theoretical results for the regularized, balanced case ( $d\ge K$ ) are as follows:

Global minimum: Simplex Equiangular Tight Frame (ETF) geometry. Any global minimizer $(W^\ast, H^\ast, b^\ast)$ $(W^{*}, H^{*}, b^{*})$ exhibits:
- Variability collapse (Neural Collapse NC1): For all $k,i$ , $h_{k,i}^* = \alpha w^{*k}$ for a scale factor $\alpha$ determined by the regularization parameters and sample count.
- Classifier and bias symmetry: All $\|w^{*k}\|$ are equal, $b^*_1 = \cdots = b^*_K$ .
- ETF structure: The columns of $W^{*\top}/\|w^{*1}\|$ (up to rotation) form a $K$ -point simplex ETF:
$M^\top M = \frac{K}{K-1} (I_K - \frac{1}{K} 1_K1_K^\top).$

(Zhu et al., 2021, Zhou et al., 2022)

Benign optimization landscape: For $d>K$ , all local minima are global and correspond to ETF solutions; all other critical points are strict saddles as their Hessians have at least one negative eigenvalue. Consequently, any first-order method such as gradient descent or stochastic gradient descent will (almost surely) converge to a global minimizer that realizes the ETF geometry, escaping all saddles (Zhu et al., 2021).
Nuclear norm perspective: Regularization on both $W$ and $H$ can be re-expressed as a nuclear norm penalty on $Z = H W$ , tightly linking the low-rank geometry of the solution to ETF structure through convex duality (Zhu et al., 2021).

3. Simplex Equiangular Tight Frames (ETFs) and Their Emergence

The ETF is central to the UFM's global minimizers. A $K$ -point simplex ETF in $\mathbb{R}^d$ ( $d\ge K$ ) is a set of unit-norm vectors $\{m_k\}$ such that $m_k^\top m_j = -1/(K-1)$ for every $k\neq j$ and $\sum_{k=1}^K m_k m_k^\top = (K/d) I_d$ .

This arrangement constitutes the maximally symmetric and maximally separated set of $K$ vectors on the unit sphere under the equal-angle constraint, and forms a tight frame in $\mathbb{R}^K$ . At the global optimum, the cross-entropy loss with weight decay drives both the classifier weights and the last-layer class means to occupy the ETF configuration (Zhu et al., 2021). The result is maximal margin between classes given the regularization-induced norm constraint.

4. Geometric Implications and Symmetry

The landscape structure of the UFM is “benign,” meaning all non-global-minimum critical points are strict saddles. The high symmetry group $O(d)$ invites a degeneracy at the optimum: any orthogonal transformation of the ETF frame is also optimal, resulting in a flat manifold of global minimizers. This degeneracy ensures that the ETF geometry is strongly favored under overparameterization and regularized loss, explaining the prevalence of Neural Collapse phenomena in observed networks (Zhu et al., 2021).

Algorithmically, these properties guarantee that any optimization scheme capable of escaping saddles (e.g., with randomness in initialization or explicit perturbations) will converge to an ETF solution, independent of optimization algorithm details.

5. Extension to Other Losses, Imbalanced Regimes, and Richer Architectures

Mean-Squared Error Loss: The ETF global optimality and landscape benignity have also been demonstrated for regularized mean-squared error losses, using singular-value thresholding and similar nuclear-norm reductions. Critically, these results are robust to various settings, including bias variation and scaling of target outputs (Zhou et al., 2022, Tirer et al., 2022).
Imbalanced datasets and generalizations: For imbalanced class sizes, the core within-class collapse (NC1) persists, but the ideal ETF geometry is distorted. Class means no longer form an ETF but exhibit cluster-dependent block structure, with precise collapse and “minority collapse” thresholds analytically characterized. In the limit of large sample sizes, the classic ETF geometry is asymptotically restored (Hong et al., 2023, Dang et al., 2024).
Deep and Nonlinear Models: Recent work generalizes the UFM to incorporate multiple layers (deep UFM). In these settings, “Deep Neural Collapse” (DNC)—variability collapse, orthogonal frame structure at each layer, and alignment (self-duality) between weights and features—persist as unique global minima, at least in the binary case and in the MSE regime (Súkeník et al., 2023, Garrod et al., 2024). Minimal nonlinear extensions (e.g., ReLU inserted between layers) retain these conclusions under mild assumptions (Tirer et al., 2022).
Other loss functions: UFM has been extended to analyze supervised contrastive loss, with analogous results: all local minima are global and unique up to rotation, and the UFM geometry again collapses to a simplex ETF or its label-geometry analog (Behnia et al., 2024).

6. Empirical Validation and Practical Implications

Empirical studies verify that these theoretical characterizations robustly manifest in real networks:

Training neural nets (e.g., ResNet-18) with SGD, Adam, or L-BFGS on datasets such as MNIST and CIFAR-10 consistently drives the evaluated NC metrics (within-class scatter, ETF fit, bias collapse) to zero, in line with the UFM predictions (Zhu et al., 2021).
Fixing the classifier to a simplex ETF and reducing the feature dimension to $K$ demonstrates significant memory savings (e.g., $\sim$ 20% for ResNet-18 on CIFAR-10; $K=10$ , $d=512$ ) with no loss of train or test accuracy, confirming that ETF geometry is sufficient for linear classification performance (Zhu et al., 2021).
Metrics quantifying within-class feature variance, ETF or OF frame distance, and classifier alignment all collapse as predicted in synthetic UFM optimization and in trained deep networks (Tirer et al., 2022, Zhou et al., 2022).

The UFM thus provides a powerful tool for predicting how architectural and regularization decisions shape feature/weight geometry in deep networks, and for explaining the observed universality of Neural Collapse phenomena in practice.

References

“A Geometric Analysis of Neural Collapse with Unconstrained Features” (Zhu et al., 2021)
“On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features” (Zhou et al., 2022)
“Neural collapse with unconstrained features” (Mixon et al., 2020)
“Extended Unconstrained Features Model for Exploring Deep Neural Collapse” (Tirer et al., 2022)
“Deep Neural Collapse Is Provably Optimal for the Deep Unconstrained Features Model” (Súkeník et al., 2023)
“Neural Collapse for Unconstrained Feature Model under Cross-entropy Loss with Imbalanced Data” (Hong et al., 2023)
“Neural Collapse for Cross-entropy Class-Imbalanced Learning with Unconstrained ReLU Feature Model” (Dang et al., 2024)
“Supervised Contrastive Representation Learning: Landscape Analysis with Unconstrained Features” (Behnia et al., 2024)
“Unifying Low Dimensional Observations in Deep Learning Through the Deep Linear Unconstrained Feature Model” (Garrod et al., 2024)