Query-Aware Scalable Softmax (QASSMax)

Updated 13 February 2026

The paper introduces QASSMax, a novel query-aware softmax that dynamically scales attention to counteract fading in long-context Transformer models.
It leverages two lightweight MLPs for sequence length-based scaling and query-dependent gating, improving focus and model adaptability.
Empirical evaluations in TabICLv2 demonstrate enhanced performance and robustness, achieving near-perfect accuracy in high negative-sample tasks.

Query-Aware Scalable Softmax (QASSMax) is a differentiable attention modification designed to maintain sharp focus in Transformer-style models as the attention context length grows, while retaining compatibility with dense (quadratic) attention mechanisms and existing hardware-accelerated kernels. QASSMax generalizes previous scalable softmax approaches by incorporating both sequence-length dependence and query-aware adaptation through neural network parameterizations. Originally introduced for tabular foundation models in "TabICLv2" (Qu et al., 11 Feb 2026), QASSMax addresses the degradation of attention selectivity ("attention-fading") in long-context regimes that arises with conventional softmax-based attention, and enables efficient generalization to large-scale, out-of-distribution contexts.

1. Motivation and Problem Statement

Standard dot-product attention, as used in Transformers, applies a softmax across the dot products between queries and keys. As the attention context size $n$ increases, the denominator in the softmax

$A = \text{softmax}(QK^\top)V$

sums over an increasing number of terms, which causes the resulting distribution to flatten. This "attention-fading" effect has the consequence that models pretrained on shorter contexts cannot effectively focus their attention when applied to much longer contexts, leading to deterioration in retrieval and reasoning tasks that require sharp, selective focus.

Previous methods such as Scalable-Softmax (SSMax) (Nakanishi, 31 Jan 2025) addressed this by uniformly scaling attention logits with a global, learnable factor proportional to $\log n$ , restoring concentration as $n$ grows. However, SSMax applies the same scaling to all queries within a head and does not account for the content of individual queries, potentially limiting adaptivity.

QASSMax extends this remedy by introducing a two-component, query-dependent scaling, allowing the model to adapt both to the global context length and the local query semantics.

2. Algorithmic Formulation

QASSMax modifies the computation of query vectors in multi-head attention modules by applying two forms of learned, multiplicative scaling:

Base (sequence length-dependent) scaling: An MLP transforms $\log n$ into a per-head, per-dimension scaling vector $b^{(\text{base})}_h \in \mathbb{R}^{d_h}$ , where $n$ is the current context size, $h$ is the attention head, and $d_h$ is the head dimension.
Query-aware gating: A second MLP processes each query vector $q_{h}[i]$ to produce a gate $A = \text{softmax}(QK^\top)V$ 0, with elementwise output in $A = \text{softmax}(QK^\top)V$ 1 via $A = \text{softmax}(QK^\top)V$ 2.

These are combined elementwise:

$A = \text{softmax}(QK^\top)V$ 3

where $A = \text{softmax}(QK^\top)V$ 4 indexes tokens and $A = \text{softmax}(QK^\top)V$ 5 indexes dimensions.

The scaled queries $A = \text{softmax}(QK^\top)V$ 6 are then used in ordinary dot-product attention:

$A = \text{softmax}(QK^\top)V$ 7

3. Implementation and Pseudocode

The essential steps for one multi-head attention block with QASSMax are: $\log n$ 9

The two MLPs (MLP_base and MLP_gate) are both lightweight; MLP_base is called once per layer (on $A = \text{softmax}(QK^\top)V$ 8), MLP_gate is applied per token per head.

4. Computational Complexity and Memory Considerations

The core cost of scaled dot-product attention remains $A = \text{softmax}(QK^\top)V$ 9 due to the matrix multiplications and softmax operations; query scaling by QASSMax introduces negligible additional overhead:

MLP_base: $\log n$ 0 per forward pass where $\log n$ 1 is the MLP hidden dimension.
MLP_gate: $\log n$ 2 per token per head.

Since $\log n$ 3, the added compute is dominated by the quadratic $\log n$ 4 cost of attention. Memory overhead from the MLPs is negligible relative to key/query/value projections and attention maps. Notably, QASSMax maintains full compatibility with hardware accelerators and optimized softmax kernels (e.g., FlashAttention), and does not alter the O( $\log n$ 5) regime.

5. Empirical Evaluation

Empirical studies in "TabICLv2" (Qu et al., 11 Feb 2026) demonstrate the impact of QASSMax in large-context and foundation model settings:

Needle-in-haystack classification: When the number of negative samples increases from 1,000 to 15,000, attention entropy with standard softmax approaches 1 (uniform), and accuracy drops, indicating a loss of focus. QASSMax, in contrast, maintains nearly zero entropy and 100% accuracy up to 15,000 negatives, outperforming both vanilla softmax and SSMax.
Ablation on 60 development datasets: Removing QASSMax deteriorates validation log-loss by approximately 0.02. When viewed as Elo scores, inclusion of QASSMax provides a roughly 100 Elo advantage and a 64% win-rate over the baseline without it.
End-to-end tasks: On million-scale tasks from TabArena and TALENT, QASSMax enables TabICLv2 to generalize without retraining or distillation, preventing catastrophic attention-fading even when deployed on out-of-distribution, large real-world tables.

6. Integration and Architectural Role in TabICLv2

TabICLv2 leverages QASSMax in two main architectural locations:

Column-wise embeddings: During the induced-attention SetTransformer phase, QASSMax supports construction of column features from up to tens of thousands of rows.
Cross-attention ICL: In final inference phases, QASSMax is used within the cross-attention block where test rows attend over all training rows ("dataset-wise in-context learning").

By preventing attention-fading as $\log n$ 6 increases, QASSMax enables TabICLv2 models pretrained only on moderate context lengths (e.g., $\log n$ 760K) to be deployed on tables with millions of rows, maintaining both runtime efficiency and predictive performance. In controlled ablation, QASSMax is one of three major architectural contributions (alongside target-aware embeddings and the Muon optimizer) required for TabICLv2 to surpass the state of the art (RealTabPFN-2.5) without downstream tuning or ensembling (Qu et al., 11 Feb 2026).

7. Comparative Context and Theoretical Significance

QASSMax builds upon and generalizes the principles of Scalable-Softmax (Nakanishi, 31 Jan 2025):

SSMax attenuates attention-fading using a per-head, per-layer scalar multiply of $\log n$ 8.
QASSMax replaces this with learned, per-dimension scaling and an additional query-aware gate, providing greater representational flexibility.
Unlike sparse or kernel-based linear attention variants (which impose O(n) memory and computation per head), QASSMax preserves full-rank dense attention and is hardware-compatible.

A plausible implication is that QASSMax's adaptivity may also benefit arbitrarily long-context Transformer models beyond tabular data, wherever context-length generalization and sharp focus are limiting factors. Its demonstrated effectiveness across both synthetic benchmarks and real-world datasets establishes it as a core architectural mechanism in modern scalable tabular foundation models (Qu et al., 11 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

TabICLv2: A better, faster, scalable, and open tabular foundation model (2026)

Scalable-Softmax Is Superior for Attention (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Query-Aware Scalable Softmax (QASSMax).

Query-Aware Scalable Softmax (QASSMax)

1. Motivation and Problem Statement

2. Algorithmic Formulation

3. Implementation and Pseudocode

4. Computational Complexity and Memory Considerations

5. Empirical Evaluation

6. Integration and Architectural Role in TabICLv2

7. Comparative Context and Theoretical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Query-Aware Scalable Softmax (QASSMax)

1. Motivation and Problem Statement

2. Algorithmic Formulation

3. Implementation and Pseudocode

4. Computational Complexity and Memory Considerations

5. Empirical Evaluation

6. Integration and Architectural Role in TabICLv2

7. Comparative Context and Theoretical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research