Query-Aware Scalable Softmax (QASSMax)
- The paper introduces QASSMax, a novel query-aware softmax that dynamically scales attention to counteract fading in long-context Transformer models.
- It leverages two lightweight MLPs for sequence length-based scaling and query-dependent gating, improving focus and model adaptability.
- Empirical evaluations in TabICLv2 demonstrate enhanced performance and robustness, achieving near-perfect accuracy in high negative-sample tasks.
Query-Aware Scalable Softmax (QASSMax) is a differentiable attention modification designed to maintain sharp focus in Transformer-style models as the attention context length grows, while retaining compatibility with dense (quadratic) attention mechanisms and existing hardware-accelerated kernels. QASSMax generalizes previous scalable softmax approaches by incorporating both sequence-length dependence and query-aware adaptation through neural network parameterizations. Originally introduced for tabular foundation models in "TabICLv2" (Qu et al., 11 Feb 2026), QASSMax addresses the degradation of attention selectivity ("attention-fading") in long-context regimes that arises with conventional softmax-based attention, and enables efficient generalization to large-scale, out-of-distribution contexts.
1. Motivation and Problem Statement
Standard dot-product attention, as used in Transformers, applies a softmax across the dot products between queries and keys. As the attention context size increases, the denominator in the softmax
sums over an increasing number of terms, which causes the resulting distribution to flatten. This "attention-fading" effect has the consequence that models pretrained on shorter contexts cannot effectively focus their attention when applied to much longer contexts, leading to deterioration in retrieval and reasoning tasks that require sharp, selective focus.
Previous methods such as Scalable-Softmax (SSMax) (Nakanishi, 31 Jan 2025) addressed this by uniformly scaling attention logits with a global, learnable factor proportional to , restoring concentration as grows. However, SSMax applies the same scaling to all queries within a head and does not account for the content of individual queries, potentially limiting adaptivity.
QASSMax extends this remedy by introducing a two-component, query-dependent scaling, allowing the model to adapt both to the global context length and the local query semantics.
2. Algorithmic Formulation
QASSMax modifies the computation of query vectors in multi-head attention modules by applying two forms of learned, multiplicative scaling:
- Base (sequence length-dependent) scaling: An MLP transforms into a per-head, per-dimension scaling vector , where is the current context size, is the attention head, and is the head dimension.
- Query-aware gating: A second MLP processes each query vector to produce a gate , with elementwise output in via .
These are combined elementwise:
where indexes tokens and indexes dimensions.
The scaled queries are then used in ordinary dot-product attention:
3. Implementation and Pseudocode
The essential steps for one multi-head attention block with QASSMax are:
1 2 3 4 5 6 7 8 9 |
L = log(n) b_base = MLP_base(L) # (H, d_h) for h in 1..H: B = repeat(b_base[h], times=n) # (n, d_h) G = 1 + tanh(MLP_gate(Q[:,h,:])) # (n, d_h) Q_tilde[:,h,:] = Q[:,h,:] * B * G logits = (Q_tilde[:,h,:] @ K[:,h,:].T) / sqrt(d_h) attn = softmax(logits, dim=1) out[:,h,:] = attn @ V[:,h,:] |
The two MLPs (MLP_base and MLP_gate) are both lightweight; MLP_base is called once per layer (on ), MLP_gate is applied per token per head.
4. Computational Complexity and Memory Considerations
The core cost of scaled dot-product attention remains due to the matrix multiplications and softmax operations; query scaling by QASSMax introduces negligible additional overhead:
- MLP_base: per forward pass where is the MLP hidden dimension.
- MLP_gate: per token per head.
Since , the added compute is dominated by the quadratic cost of attention. Memory overhead from the MLPs is negligible relative to key/query/value projections and attention maps. Notably, QASSMax maintains full compatibility with hardware accelerators and optimized softmax kernels (e.g., FlashAttention), and does not alter the O() regime.
5. Empirical Evaluation
Empirical studies in "TabICLv2" (Qu et al., 11 Feb 2026) demonstrate the impact of QASSMax in large-context and foundation model settings:
- Needle-in-haystack classification: When the number of negative samples increases from 1,000 to 15,000, attention entropy with standard softmax approaches 1 (uniform), and accuracy drops, indicating a loss of focus. QASSMax, in contrast, maintains nearly zero entropy and 100% accuracy up to 15,000 negatives, outperforming both vanilla softmax and SSMax.
- Ablation on 60 development datasets: Removing QASSMax deteriorates validation log-loss by approximately 0.02. When viewed as Elo scores, inclusion of QASSMax provides a roughly 100 Elo advantage and a 64% win-rate over the baseline without it.
- End-to-end tasks: On million-scale tasks from TabArena and TALENT, QASSMax enables TabICLv2 to generalize without retraining or distillation, preventing catastrophic attention-fading even when deployed on out-of-distribution, large real-world tables.
6. Integration and Architectural Role in TabICLv2
TabICLv2 leverages QASSMax in two main architectural locations:
- Column-wise embeddings: During the induced-attention SetTransformer phase, QASSMax supports construction of column features from up to tens of thousands of rows.
- Cross-attention ICL: In final inference phases, QASSMax is used within the cross-attention block where test rows attend over all training rows ("dataset-wise in-context learning").
By preventing attention-fading as increases, QASSMax enables TabICLv2 models pretrained only on moderate context lengths (e.g., 60K) to be deployed on tables with millions of rows, maintaining both runtime efficiency and predictive performance. In controlled ablation, QASSMax is one of three major architectural contributions (alongside target-aware embeddings and the Muon optimizer) required for TabICLv2 to surpass the state of the art (RealTabPFN-2.5) without downstream tuning or ensembling (Qu et al., 11 Feb 2026).
7. Comparative Context and Theoretical Significance
QASSMax builds upon and generalizes the principles of Scalable-Softmax (Nakanishi, 31 Jan 2025):
- SSMax attenuates attention-fading using a per-head, per-layer scalar multiply of .
- QASSMax replaces this with learned, per-dimension scaling and an additional query-aware gate, providing greater representational flexibility.
- Unlike sparse or kernel-based linear attention variants (which impose O(n) memory and computation per head), QASSMax preserves full-rank dense attention and is hardware-compatible.
A plausible implication is that QASSMax's adaptivity may also benefit arbitrarily long-context Transformer models beyond tabular data, wherever context-length generalization and sharp focus are limiting factors. Its demonstrated effectiveness across both synthetic benchmarks and real-world datasets establishes it as a core architectural mechanism in modern scalable tabular foundation models (Qu et al., 11 Feb 2026).