Optimality of diagonal scaling for cosine similarity in regularized matrix factorization

Determine whether the unique diagonal scaling matrix diag(√(σ_i (1 − λ/σ_i)_+)) over the top-k singular values σ_i of X, which arises in the closed-form solution to the regularized matrix factorization objective min_{A,B} ||X − XAB^T||_F^2 + λ(||XA||_F^2 + ||B||_F^2), yields the best possible semantic similarities in practice when cosine similarity is applied to the resulting user and item embeddings.

Background

The paper analyzes cosine similarity applied to embeddings learned by linear matrix factorization models under two different regularization schemes. The first scheme regularizes the product ABT and is invariant to arbitrary diagonal rescalings of latent dimensions, making cosine similarities non-unique and potentially arbitrary. The second scheme regularizes XA and B separately (analogous to weight decay), yielding a unique closed-form solution up to rotations and therefore unique cosine similarities.

For the second objective, the closed-form solution involves a specific diagonal scaling that depends on the singular values of X: diag(√(σi (1 − λ/σ_i)+)). This symmetric scaling appears for both the user embeddings (via XÂ) and the item embeddings. Although this yields uniqueness, the authors explicitly state that it is an open question whether this particular scaling leads to the best semantic similarities when cosine similarity is used in practice.

References

While this solution is unique, it remains an open question if this unique diagonal matrix $(...,\sqrt{\sigma_i\cdot (1-\frac{\lambda}{\sigma_i})_+} ,...)_k $ regarding the user and item embeddings yields the best possible semantic similarities in practice.

Is Cosine-Similarity of Embeddings Really About Similarity?  (2403.05440 - Steck et al., 2024) in Section 2.3, Details on Second Objective