GRAB: Generative Ranking for Ads at Baidu

Updated 9 February 2026

The paper introduces GRAB, a generative, sequence-first CTR model that treats user-ad interactions as autoregressive events to capture temporal dependencies.
It leverages a fusion of sparse feature engineering and dense Transformer modules, including a causal action-aware multi-channel attention mechanism.
Empirical results on billion-scale data show steady AUC improvements, enhanced CTR/CPM, and production-level efficiency with optimized inference.

Generative Ranking for Ads at Baidu (GRAB) is an end-to-end, sequence-centric click-through rate (CTR) prediction and ranking framework inspired by LLM scaling principles. GRAB integrates sparse feature engineering, autoregressive modeling of behavioral event streams, and a causal action-aware multi-channel attention mechanism, resulting in substantial gains in both prediction accuracy and monetization when deployed at industrial scale in Baidu’s ads system (Chen et al., 2 Feb 2026).

1. Motivation and Paradigm Shift

Traditional Deep Learning Recommendation Models (DLRMs) have historically relied on large, sparse embedding tables coupled with MLPs to encode user and item features. While effective at memorizing high-cardinality ID features, these models face diminishing marginal returns with increased capacity and often generalize poorly in the presence of fast-evolving user behavior or novel ads. Empirical evidence indicates that increasing model size or depth in DLRMs yields sharply diminishing returns and fails to leverage longer behavior sequences efficiently.

In contrast, LLMs demonstrate smooth, monotonic improvements when scaling data, model width, and sequence length. GRAB reframes CTR ranking from a static, pointwise prediction task to a generative, sequence-first paradigm, treating the interaction between user behavior and candidate ads as an autoregressive process. This approach enables the model to exploit temporal and contextual dependencies at any scale, overcoming the generalization and expressivity limitations of classical DLRMs.

2. Sequence-First Generative Objective

The core objective of GRAB is to model the probability of a sequence of click events $y_{1:N} \in \{0,1\}^N$ given the corresponding sequence of context-rich event tokens $x_{1:N}$ :

$p(y_{1:N} \mid x_{1:N}) = \prod_{i=1}^N p(y_i \mid x_{1:i})$

Each candidate ad is evaluated in the behavioral context up to that impression. The model is trained by maximizing the conditional log-likelihood (or, equivalently, minimizing binary cross-entropy) at each autoregressive step:

$\mathcal{L}_{\rm gen} = -\sum_{i=1}^N \left[ y_i \log p(y_i \mid x_{1:i}) + (1-y_i) \log(1-p(y_i \mid x_{1:i})) \right]$

This paradigm allows the CTR model to predict in an online, sequential manner, naturally incorporating evolving user context and candidate heterogeneity.

3. Input Representation and Sequence Packing

Each user-event tuple is represented by embedding discrete fields (item/ad IDs, user attributes, contextual signals, interaction types) via a DLRM-style large sparse parameter server, followed by fusion through a “GateMLP” to produce dense $d_{\rm model}$ -dimensional tokens. At each timestamp $t$ :

Partial tokens $\mathbf{h}_t$ : encode temporally varying user history.
Full tokens $\mathbf{h}_t'$ : concatenate all features necessary for scoring a candidate ad impression at $t$ .

GRAB’s input for a user session is a concatenated sequence:

$H = [\,\mathbf h_1,\mathbf h'_1,\,\mathbf h_2,\mathbf h'_2,\dots,\,\mathbf h_{T},\mathbf h'_{T}\,]$

To eliminate padding inefficiencies, sequences are “packed” so that multiple contiguous impressions of the same user reside within each mini-batch, governed by a block-diagonal causal mask. The heterogenous visibility mask $M^{\rm het}$ ensures partial tokens only attend to prior partials, while full tokens may also attend to themselves and prior partials.

4. Causal Action-Aware Multi-Channel Attention (CamA)

GRAB utilizes CamA to explicitly capture action-type and channel-specific temporal dynamics in user behavior. User event streams are split into $C$ disjoint channels (e.g., clicks, views). For each channel $c$ , input $X^{(c)} \in \mathbb{R}^{T_c \times d}$ and the shared target token $x^{\rm tar}$ are processed through a stack of causal, sliding-window Transformer layers:

$H^{(c,\ell+1)} = \mathrm{Layer}^{(c)}_\ell \left( H^{(c,\ell)}, M^{(c)} \right)$

The attention weight between positions $i$ and $j$ within a layer uses a query-aware, relative bias:

$w_{ij} = q_i^\top k_j + \left( q_i^\top B^{\rm pos} \right)[p_{i,j}] + \left( q_i^\top B^{\rm act} \right)[a_{i,j}] + \left( q_i^\top B^{\rm time} \right)[t_{i,j}]$

Where $B^{\rm pos}$ , $B^{\rm act}$ , and $B^{\rm time}$ are bucketized, query-projected relative position, action, and time embedding banks, respectively. After channel-specific self-attention and feedforward layers, per-channel representations at the target position $t^\star$ are fused via a lightweight gated mixer:

$\tilde h^{(c,\ell)}_{t^\star} = h^{(c,\ell)}_{t^\star} + \sum_{c' \neq c} \beta^{(c',\ell)} \odot h^{(c',\ell)}_{t^\star}$

The concatenated, fused target representations are provided to a final logistic output head for CTR prediction.

5. Model Architecture and Scaling Behavior

GRAB integrates the following pipeline in production:

Sparse Feature Layer: Field lookup via hashing and embedding table.
Dense Tokenizer: Fusion of features to form $d_{\rm model}$ -dimensional event tokens.
Sequence Decoder: Stack of CamA-equipped Transformer blocks (typically 4–8 layers and attention heads) with sliding window length $L_w$ .
CTR Prediction Head: Logistic regression on the sequence output.

Empirically, scaling the model’s depth, width, or the context sequence length $T$ yields steady, monotonic increases in AUC, with no observed performance saturation up to several hundred tokens. This echoes the scaling phenomena seen in LLMs.

6. Training and Inference Protocol

GRAB is trained using AdamW on binary cross-entropy, combined with regularization and the sparse+dense Sequence-Then-Sparse (STS) training schedule. This decoupled approach involves:

Stage I (Sequence Phase): User sequences are packed and fed into the model. Sparse embeddings $\Phi$ are frozen; only dense tokenizer and Transformer weights $\{ \theta_{\rm cont}, \theta_{\rm tr} \}$ are updated.

Stage II (Sparse Phase): Decorrelated user–ad exposure tuples are sampled. The dense components are frozen, and only $\Phi$ is updated.

This regimen prevents intra-user correlation from degrading embedding learning. Typical hyperparameters: batch size ≈ 2,000 instances, learning rate ≈ 1e-4, linear warmup and cosine decay, $d_{\rm model}=128$ –256.

For inference, user history is encoded and cached, so that for each candidate ad, only the last Transformer layer and logistic head must be evaluated, maintaining inference cost comparable to legacy DLRM baselines.

7. Empirical Performance and Production Impact

On a billion-scale Baidu ad dataset, GRAB outperforms production DLRMs and prior deep sequential models:

Model	AUC
DIN	0.83309
SIM	0.83520
TWIN	0.83556
HSTU	0.83590
LONGER	0.83615
GRAB-small	0.83661
GRAB-standard	0.83772

Key ablation results for GRAB include:

Remove multi-channel module: −0.00029 AUC
Disable relative action bias: −0.00048 AUC
No STS schedule: −0.00123 AUC
Only partial tokens: −0.00280 AUC

Online A/B tests over one month (10% Baidu traffic) reported:

+3.49% CTR
+3.05% CPM (revenue)
Inference latency matching the previous DLRM system

Despite the compute-bound nature of Transformer attention, optimizations such as sliding-window blocks, operator fusion, and KV caching kept latency within service-level targets.

GRAB exemplifies a sequence-first generative strategy for industrial ad ranking, fusing DLRM-style sparse feature engineering with LLM-inspired sequential modeling and multi-channel temporal attention, and demonstrates robust scaling and production readiness in billion-scale deployment (Chen et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

GRAB: An LLM-Inspired Sequence-First Click-Through Rate Prediction Modeling Paradigm (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative Ranking for Ads at Baidu (GRAB).