GRAB: Generative Ranking for Ads at Baidu
- The paper introduces GRAB, a generative, sequence-first CTR model that treats user-ad interactions as autoregressive events to capture temporal dependencies.
- It leverages a fusion of sparse feature engineering and dense Transformer modules, including a causal action-aware multi-channel attention mechanism.
- Empirical results on billion-scale data show steady AUC improvements, enhanced CTR/CPM, and production-level efficiency with optimized inference.
Generative Ranking for Ads at Baidu (GRAB) is an end-to-end, sequence-centric click-through rate (CTR) prediction and ranking framework inspired by LLM scaling principles. GRAB integrates sparse feature engineering, autoregressive modeling of behavioral event streams, and a causal action-aware multi-channel attention mechanism, resulting in substantial gains in both prediction accuracy and monetization when deployed at industrial scale in Baidu’s ads system (Chen et al., 2 Feb 2026).
1. Motivation and Paradigm Shift
Traditional Deep Learning Recommendation Models (DLRMs) have historically relied on large, sparse embedding tables coupled with MLPs to encode user and item features. While effective at memorizing high-cardinality ID features, these models face diminishing marginal returns with increased capacity and often generalize poorly in the presence of fast-evolving user behavior or novel ads. Empirical evidence indicates that increasing model size or depth in DLRMs yields sharply diminishing returns and fails to leverage longer behavior sequences efficiently.
In contrast, LLMs demonstrate smooth, monotonic improvements when scaling data, model width, and sequence length. GRAB reframes CTR ranking from a static, pointwise prediction task to a generative, sequence-first paradigm, treating the interaction between user behavior and candidate ads as an autoregressive process. This approach enables the model to exploit temporal and contextual dependencies at any scale, overcoming the generalization and expressivity limitations of classical DLRMs.
2. Sequence-First Generative Objective
The core objective of GRAB is to model the probability of a sequence of click events given the corresponding sequence of context-rich event tokens :
Each candidate ad is evaluated in the behavioral context up to that impression. The model is trained by maximizing the conditional log-likelihood (or, equivalently, minimizing binary cross-entropy) at each autoregressive step:
This paradigm allows the CTR model to predict in an online, sequential manner, naturally incorporating evolving user context and candidate heterogeneity.
3. Input Representation and Sequence Packing
Each user-event tuple is represented by embedding discrete fields (item/ad IDs, user attributes, contextual signals, interaction types) via a DLRM-style large sparse parameter server, followed by fusion through a “GateMLP” to produce dense -dimensional tokens. At each timestamp :
- Partial tokens : encode temporally varying user history.
- Full tokens : concatenate all features necessary for scoring a candidate ad impression at .
GRAB’s input for a user session is a concatenated sequence:
To eliminate padding inefficiencies, sequences are “packed” so that multiple contiguous impressions of the same user reside within each mini-batch, governed by a block-diagonal causal mask. The heterogenous visibility mask ensures partial tokens only attend to prior partials, while full tokens may also attend to themselves and prior partials.
4. Causal Action-Aware Multi-Channel Attention (CamA)
GRAB utilizes CamA to explicitly capture action-type and channel-specific temporal dynamics in user behavior. User event streams are split into disjoint channels (e.g., clicks, views). For each channel , input and the shared target token are processed through a stack of causal, sliding-window Transformer layers:
The attention weight between positions and within a layer uses a query-aware, relative bias:
Where , , and are bucketized, query-projected relative position, action, and time embedding banks, respectively. After channel-specific self-attention and feedforward layers, per-channel representations at the target position are fused via a lightweight gated mixer:
The concatenated, fused target representations are provided to a final logistic output head for CTR prediction.
5. Model Architecture and Scaling Behavior
GRAB integrates the following pipeline in production:
- Sparse Feature Layer: Field lookup via hashing and embedding table.
- Dense Tokenizer: Fusion of features to form -dimensional event tokens.
- Sequence Decoder: Stack of CamA-equipped Transformer blocks (typically 4–8 layers and attention heads) with sliding window length .
- CTR Prediction Head: Logistic regression on the sequence output.
Empirically, scaling the model’s depth, width, or the context sequence length yields steady, monotonic increases in AUC, with no observed performance saturation up to several hundred tokens. This echoes the scaling phenomena seen in LLMs.
6. Training and Inference Protocol
GRAB is trained using AdamW on binary cross-entropy, combined with regularization and the sparse+dense Sequence-Then-Sparse (STS) training schedule. This decoupled approach involves:
Stage I (Sequence Phase): User sequences are packed and fed into the model. Sparse embeddings are frozen; only dense tokenizer and Transformer weights are updated.
Stage II (Sparse Phase): Decorrelated user–ad exposure tuples are sampled. The dense components are frozen, and only is updated.
This regimen prevents intra-user correlation from degrading embedding learning. Typical hyperparameters: batch size ≈ 2,000 instances, learning rate ≈ 1e-4, linear warmup and cosine decay, –256.
For inference, user history is encoded and cached, so that for each candidate ad, only the last Transformer layer and logistic head must be evaluated, maintaining inference cost comparable to legacy DLRM baselines.
7. Empirical Performance and Production Impact
On a billion-scale Baidu ad dataset, GRAB outperforms production DLRMs and prior deep sequential models:
| Model | AUC |
|---|---|
| DIN | 0.83309 |
| SIM | 0.83520 |
| TWIN | 0.83556 |
| HSTU | 0.83590 |
| LONGER | 0.83615 |
| GRAB-small | 0.83661 |
| GRAB-standard | 0.83772 |
Key ablation results for GRAB include:
- Remove multi-channel module: −0.00029 AUC
- Disable relative action bias: −0.00048 AUC
- No STS schedule: −0.00123 AUC
- Only partial tokens: −0.00280 AUC
Online A/B tests over one month (10% Baidu traffic) reported:
- +3.49% CTR
- +3.05% CPM (revenue)
- Inference latency matching the previous DLRM system
Despite the compute-bound nature of Transformer attention, optimizations such as sliding-window blocks, operator fusion, and KV caching kept latency within service-level targets.
GRAB exemplifies a sequence-first generative strategy for industrial ad ranking, fusing DLRM-style sparse feature engineering with LLM-inspired sequential modeling and multi-channel temporal attention, and demonstrates robust scaling and production readiness in billion-scale deployment (Chen et al., 2 Feb 2026).