Papers
Topics
Authors
Recent
Search
2000 character limit reached

History-Enhanced Two-Stage Transformer (HETT)

Updated 23 December 2025
  • The paper introduces explicit history tokens and spatial grid memories to integrate long-range context in transformer architectures for complex sequence tasks.
  • HETT employs a two-stage design where a global transformer aggregates historical data and a local transformer refines predictions using current observations.
  • Empirical results demonstrate significant performance gains over baselines in both event classification and vision-and-language navigation tasks.

The History-Enhanced Two-Stage Transformer (HETT) framework is a class of transformer-based models designed to address limitations in sequence processing tasks that require a robust integration of historical context for long-horizon, complex decision-making. HETT has been applied both to event sequence modeling—primarily in classification and prediction tasks for temporal point processes (Karpukhin et al., 2 Aug 2025)—and to aerial vision-and-language navigation, where spatial and historical information must be fused for effective policy generation (Ding et al., 16 Dec 2025). Core innovations include: explicit memory constructs (history tokens or grid-based spatial memories), architectural decomposition into global and local reasoning stages, and sparsely structured attention mechanisms that prioritize relevant temporal or spatial context.

1. Architectures: History-Enhanced Memory and Two-Stage Processing

Both event modeling and vision-and-language navigation variants of HETT introduce structured mechanisms for representing and integrating historical information.

In event sequence classification, the model augments transformer input sequences with learnable "history tokens"—vectors inserted at controlled frequencies and positions—which, through sparse attention masking, serve as explicit bottlenecks to accumulate prefix information. This design compensates for the absence of a single hidden state vector in causal transformers, enabling sequence-to-vector summarization analogous to RNN hidden states for downstream tasks (Karpukhin et al., 2 Aug 2025).

In aerial navigation, HETT applies a "historical grid map": a SH×SHS^H \times S^H structured memory of aggregated feature vectors mapped to discretized agent positions in the environment. Features from past steps are dynamically aggregated within each cell, leveraging instruction-dependent relevance weights. This spatial memory is flattened and fused with current visual, pose, instruction, and landmark embeddings to propagate both global scene knowledge and fine-grained, temporally informed context (Ding et al., 16 Dec 2025).

Both domains apply a two-stage transformer policy:

  • Global/Coarse Stage: Predicts overall targets (sequence-level classification or coarse map positions) by integrating historical context.
  • Local/Fine Stage: Refines predictions or actions, based on up-to-date local observations, spatial history, or both.

2. Formalisms and Core Mechanisms

Event Sequence Modeling

Let {e1,…,eL}\{e_1,\dots,e_L\} denote an event sequence. A set of HH learnable history tokens {HT1,…,HTH}\{\mathrm{HT}_1,\dots,\mathrm{HT}_H\}, each hk∈Rdh_k\in\mathbb{R}^d, is inserted at chosen positions {t1,…,tH}\{t_1, \dots, t_H\}. Attention masks are modified:

  • Each HTk\mathrm{HT}_k attends to all events preceding tkt_k, but not other history tokens.
  • Each event eie_i in (tk−1,tk](t_{k-1}, t_k] attends to events within that segment and exactly one prior history token (selected as the last, or at random).

During pretraining, the objective is next-token prediction, fieldwise (time, numerical, categorical). In fine-tuning, a single history token at sequence-end produces a compact embedding for classification, either via an external classifier or with an end-to-end head (Karpukhin et al., 2 Aug 2025).

Aerial Vision-and-Language Navigation

Inputs at step tt comprise:

  • Instruction tokens E∈RN×DE \in \mathbb{R}^{N \times D},
  • A spatial landmark token L∈RDL \in \mathbb{R}^D,
  • Aggregated historical-grid features Ft∈RDF_t \in \mathbb{R}^D,
  • Visual tokens Vt∈RDV_t \in \mathbb{R}^D (RGB-D observations cross-attended to instructions),
  • Pose embedding Pt∈RDP_t \in \mathbb{R}^D (3D agent position).

The multi-layer transformer (MLT) processes these inputs to produce:

  • A global target estimate gt∈[0,1]2g_t \in [0,1]^2 through a head applied to GtG_t,
  • A local action refinement output: progress rt∈[0,1]r_t \in [0,1] and angle at∈(−π,Ï€]a_t \in (-\pi, \pi] from [Rt;At][R_t; A_t] (Ding et al., 16 Dec 2025).

The historical grid map updates as follows: for each cell (x,y)(x, y), features {mt,j}\{m_{t, j}\} from prior steps are stored; their instruction affinity is computed as K(x,y)=Softmax(Mt,(x,y)H⋅E⊤)K_{(x, y)} = \text{Softmax}(M^H_{t,(x,y)} \cdot E^\top). These affinities aggregate cell features to form Ft,(x,y)F_{t, (x, y)}, subsequently pooled as FtF_t for the current step.

3. Training and Optimization Protocols

Event Sequences

HETT training follows a two-stage regime:

  1. Next-Token Prediction Pretraining: Sequences are augmented with history tokens at a frequency ff (0.05≤f≤0.20.05 \leq f \leq 0.2 robust), and attention is controlled with sparse masks. For each batch, history tokens are inserted with probability p=0.5p=0.5, and sequences are embedded and passed through the transformer with either the sparse or standard causal mask. Adam optimizer with a learning rate lr=10−3\mathrm{lr}=10^{-3} is used, with early stopping on validation loss.
  2. Downstream Classification: Either the pretrained transformer is frozen and its final history token embedding is input to a LightGBM classifier, or the architecture is fine-tuned end-to-end with a new classification head (using cross-entropy loss, optional LoRA adapters, ESFT=20E_\mathrm{SFT}=20 epochs) (Karpukhin et al., 2 Aug 2025).

Training for the navigation HETT employs DAgger imitation learning, supervising global target, local progress, and turning angle, using:

L=α1LG+α2LA+α3LR\mathcal{L} = \alpha_1 \mathcal{L}^G + \alpha_2 \mathcal{L}^A + \alpha_3 \mathcal{L}^R

with specified weights (α1,α2,α3)=(2.0,1.5,0.1)(\alpha_1, \alpha_2, \alpha_3) = (2.0, 1.5, 0.1) for balancing objectives. The transformer backbone uses L=4L=4 layers, h=8h=8 heads, FFN dim $2048$, and batch/inference procedures are tailored for GPU deployment. Input feature dimensionalities are standardized at D=768D=768 (Ding et al., 16 Dec 2025).

4. Empirical Results and Benchmarks

Event Sequences

HT-Transformer achieves state-of-the-art or near state-of-the-art results on five benchmarks:

Dataset Metric Best HETT Result Baseline Comparison
Churn ROC AUC 83.76±0.5083.76 \pm 0.50 82.80±0.4082.80 \pm 0.40 (NTP RNN+SFT)
AgePred Accuracy 64.26±0.3064.26 \pm 0.30 64.09±0.3164.09 \pm 0.31 (NTP Transformer+SFT)
AlfaBattle ROC AUC 81.63±0.0581.63 \pm 0.05 81.70±0.1781.70 \pm 0.17 (NTP Transformer+SFT)
MIMIC-III ROC AUC 92.97±0.0792.97 \pm 0.07 92.91±0.1592.91 \pm 0.15 (NTP RNN+SFT)
Taobao ROC AUC 87.29±0.5287.29 \pm 0.52 86.40±2.6486.40 \pm 2.64 (NTP RNN+SFT)

Ablation studies confirm that the combination of Bias-End placement, Last history token selection, and moderate application probability pp yield best results. Even a single final history token (f=0f=0) gives competitive performance, but additional tokens improve summarization of local context (Karpukhin et al., 2 Aug 2025).

On the refined CityNav dataset, the navigation HETT achieves:

Split NE (m) ↓ SR (%) ↑ OSR (%) ↑ Baseline (SR, %)
Val-Seen $37.2$ $31.09$ $51.86$ $16.93$ (MGP)
Val-Unseen $51.3$ $19.10$ $34.78$ $8.35$ (MGP)
Test-Unseen — $28.90$ — $10.90$ (MGP)

Ablations isolate the effect of dataset refinement (+3–4% SR), the two-stage policy (+1.2–1.4%), and the historical grid map (+4.3–1.7%)—gains are additive to produce an overall +11.7% (Val-Seen) and +9.3% (Val-Unseen) SR improvement compared to a single-stage baseline without explicit history (Ding et al., 16 Dec 2025).

5. Design Considerations and Limitations

Both HETT variants rely on engineering choices that affect performance and deployment:

  • History Token/Memory Placement: The event HETT requires placement policies (e.g., Bias-End strategy) for history tokens; future work could use learnable or adaptive selection (Karpukhin et al., 2 Aug 2025).
  • Sparse Attention: The event model implements sparsity by masking; further speed-up and scaling may require custom CUDA kernels.
  • Spatial Memory: The navigation HETT's grid is static and bounded; it cannot accommodate dynamic scene changes or growing memory footprints.
  • Landmark Handling: The navigation variant presupposes pre-defined landmark contours; it cannot handle unknown landmarks in novel environments without extension.
  • Real-Time Deployment: Transformer computational cost remains a challenge for embedded UAV use (Ding et al., 16 Dec 2025).

6. Extensions and Future Directions

Proposed enhancements include:

  • For Event Modeling: Dynamic or learnable token placement, optimized sparse attention implementation, or integration with contrastive/global objectives for explicitly global properties (Karpukhin et al., 2 Aug 2025).
  • For Navigation: Addition of online semantic segmentation/object detection (to discover new landmarks), dynamic/learnable spatial memory (e.g., Neural SLAM), policy distillation for hardware deployment, or multi-modal temporal encoding overlays on memory tokens (Ding et al., 16 Dec 2025).

This suggests that HETT's foundational principle—the explicit encoding and structured integration of historical context, whether for event or spatial-temporal tasks—remains central as the architectures adapt to broader settings and higher data complexity.

7. Software and Best Practices

The HT-Transformer codebase for event sequence modeling is publicly available at https://github.com/ivan-chai/pretpp. Best practices for deployment in navigation include careful calibration of pose mapping into the historical grid, tuning grid size to environment granularity, pre-filtering landmarks, conservative progress thresholds, and off-board GPU inference to manage computational load (Karpukhin et al., 2 Aug 2025, Ding et al., 16 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to History-Enhanced Two-Stage Transformer (HETT).