Sublinear Time Quantum Algorithm for Attention Approximation
Abstract: Given the query, key and value matrices $Q, K, V\in \mathbb{R}{n\times d}$, the attention module is defined as $\mathrm{Att}(Q, K, V)=D{-1}AV$ where $A=\exp(QK\top/\sqrt{d})$ with $\exp(\cdot)$ applied entrywise, $D=\mathrm{diag}(A{\bf 1}n)$. The attention module is the backbone of modern transformers and LLMs, but explicitly forming the softmax matrix $D{-1}A$ incurs $Ω(n2)$ time, motivating numerous approximation schemes that reduce runtime to $\widetilde O(nd)$ via sparsity or low-rank factorization. We propose a quantum data structure that approximates any row of $\mathrm{Att}(Q, K, V)$ using only row queries to $Q, K, V$. Our algorithm preprocesses these matrices in $\widetilde{O}\left( ε{-1} n{0.5} \left( sλ{2.5} + s_λ{1.5} d + α{0.5} d \right) \right)$ time, where $ε$ is the target accuracy, $s_λ$ is the $λ$-statistical dimension of the exponential kernel defined by $Q$ and $K$, and $α$ measures the row distortion of $V$ that is at most $d/{\rm srank}(V)$, the stable rank of $V$. Each row query can be answered in $\widetilde{O}(s_λ2 + s_λd)$ time. To our knowledge, this is the first quantum data structure that approximates rows of the attention matrix in sublinear time with respect to $n$. Our approach relies on a quantum Nyström approximation of the exponential kernel, quantum multivariate mean estimation for computing $D$, and quantum leverage score sampling for the multiplication with $V$.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.