Temporal-aware Matryoshka Representation Learning

Updated 12 January 2026

The paper introduces TMRL, which explicitly dedicates a temporal subspace in embeddings using targeted contrastive and self-distillation techniques.
It leverages a nested Matryoshka design that enables dynamic truncation and tunable efficiency–accuracy trade-offs in retrieval and RAG applications.
Empirical results show TMRL maintains competitive semantic retrieval while reducing storage and latency, outperforming traditional MRL approaches.

Temporal-aware Matryoshka Representation Learning (TMRL) is a framework for equipping text embedding models (TEMs) with a dedicated temporal subspace, enabling efficient, flexible retrieval of temporally relevant context—particularly in Retrieval-Augmented Generation (RAG) systems. TMRL leverages the nested structure of Matryoshka Representation Learning, explicitly reserves dimensions for temporal encoding, and integrates targeted contrastive learning and self-distillation. The approach yields competitive performance for temporal information retrieval and temporal RAG compared with prior methods, while offering controllable efficiency–accuracy trade-offs (Huynh et al., 9 Jan 2026).

1. Background: Matryoshka Embeddings and Temporal Motivation

Conventional text embedding models encode a query or passage as a single $d$ -dimensional vector. Matryoshka Representation Learning (MRL) augments this paradigm by training the encoder such that any prefix of $m$ dimensions—where $m\in \mathcal{M} = \{64, 128, \ldots, d\}$ —forms a performant embedding:

Full embedding: $f_\theta(x) \in \mathbb{R}^d$
Truncated embedding at level $m$ : $f_\theta(x)_{1:m} \in \mathbb{R}^m$

Vanilla MRL relies on semantic InfoNCE losses summed across truncation levels,

$\mathcal{L}_{\mathrm{MRL}} = \sum_{m \in \mathcal{M}} w_m\,\mathcal{L}_{\mathrm{InfoNCE}}^{(m)}(q, p^+, \mathcal{N}_q)$

but does not guarantee any explicit temporal signal in the embedding subspaces. Temporal retrieval demands embeddings that encode both "when" and "what." TMRL addresses this by explicitly designating the first $t$ dimensions as a temporal-aware subspace.

2. TMRL Model Architecture

TMRL adapts a frozen base TEM using lightweight Low-Rank Adaptation (LoRA) and introduces a temporal projection module.

2.1 Matryoshka Embedding Split

The representation is split as

$f_\theta(x)_{1:m} = \underbrace{f_\theta(x)_{1:t}}_{\text{temporal subspace}} \oplus \underbrace{f_\theta(x)_{t+1:m}}_{\text{semantic subspace}}, \quad (m \ge t)$

For a sequence of hidden states $f_\theta(q) = [h_1, \ldots, h_L]$ , TMRL identifies temporal token positions $\mathcal{T}(q)$ —as tagged by tools like SUTime. The corresponding vectors are passed through a 2-layer temporal projector $P$ , and then mean-pooled:

$\bar q_{\mathrm{T}} = \frac{1}{L_{\mathrm{T}}}\sum_{i\in\mathcal{T}(q)} P(h_i)\in \mathbb{R}^t$

2.2 Temporal Subspace Contrastive Learning

Temporal retrieval is supervised using positive and negative queries generated through data augmentation and LLM prompting (Qwen3-4B). Contrastive InfoNCE losses in the $t$ -dimensional subspace are defined for both query-to-passage and passage-to-query alignments:

Query-to-passage:

$\mathcal{L}^{q}_{\rm Temp} = - \log \frac{\exp(\mathrm{sim}^{(t)}(\bar p_{\mathrm T},\,f_\theta(q^+)) / \tau)} {\exp(\mathrm{sim}^{(t)}(\bar p_{\mathrm T},\,f_\theta(q^+)) / \tau)+\sum_{i}\exp(\mathrm{sim}^{(t)}(\bar p_{\mathrm T},\,f_\theta(q_i^-)) / \tau)}$

Passage-to-query:

$\mathcal{L}^{p}_{\rm Temp} = - \log \frac{\exp(\mathrm{sim}^{(t)}(f_\theta(p),\,\bar q^+_{\mathrm T}) / \tau)} {\exp(\mathrm{sim}^{(t)}(f_\theta(p),\,\bar q^+_{\mathrm T}) / \tau)+\sum_{i}\exp(\mathrm{sim}^{(t)}(f_\theta(p),\,\bar q^-_{i,\mathrm T}) / \tau)}$

The full temporal contrastive loss is

$\mathcal{L}_{\rm Temp} = \mathcal{L}^{q}_{\rm Temp} + \mathcal{L}^{p}_{\rm Temp}$

where $\mathrm{sim}^{(t)}$ denotes cosine similarity over the temporal subspace, and $\tau$ is a temperature parameter (0.02–0.05).

2.3 Self-Distillation Regularization

To enforce consistency across dimensional truncations, TMRL includes:

Local similarity preservation—top- $k$ neighbors ( $L_1$ alignment):

$\mathcal{L}_{\rm Dist} = \sum_{m<d}\sum_i\sum_{j\in\mathrm{top}_k^{(d)}(i)} \lVert \mathrm{sim}^{(d)}(x_i,x_j) - \mathrm{sim}^{(m)}(x_i,x_j)\rVert_1$

Global geometry alignment—linear CKA:

$\mathcal{L}_{\rm CKA} = \sum_{m<d}(1 - \mathrm{CKA}(X^{(d)},X^{(m)})), \;\; \mathrm{CKA}(X,Y) = \frac{\lVert\mathrm{cov}(X,Y)\rVert_F^2}{\lVert\mathrm{cov}(X,X)\rVert_F\,\lVert\mathrm{cov}(Y,Y)\rVert_F}$

2.4 Unified Training Objective

The final loss combines semantic, temporal, and regularization objectives:

$\mathcal{L}_{\rm TMRL} = \mathcal{L}_{\rm MRL} + \alpha\,\mathcal{L}_{\rm Temp} + \beta\,\mathcal{L}_{\rm Dist} + \gamma\,\mathcal{L}_{\rm CKA}$

with $\alpha \in [0.1,0.25]$ , $\beta=\gamma=0.1$ , and $t\in\{64,128\}$ typical.

The high-level training pipeline involves freezing the base TEM, applying LoRA adapters and the temporal projector, batchwise computation of all loss terms from temporal token extraction, and backpropagation only through LoRA and the projector. At convergence, LoRA is merged and the projector discarded.

3. Training Protocols and Hyperparameter Choices

Optimization is performed using AdamW (learning rate $1\times10^{-4}$ , batch size 256, four hard negatives, 5 epochs—1 for Nomic), and LoRA is set with rank $r=4$ , $\alpha_{\rm scale}=4$ , and dropout 0.1 for all linear layers. The temperature $\tau$ is tuned per base model. Key hyperparameters are adapted for different TEMs and datasets, with, e.g., Contriever using $t=64/128$ , $\alpha=0.25/0.1$ on TNP/TimeQA; GTE uses $t=64/64$ , $\alpha=0.1/0.1$ . Self-distillation ( $\beta$ , $\gamma$ ) is fixed at 0.1.

4. Evaluation Suites, Data, and Metrics

Datasets and Preprocessing

TMRL is benchmarked on:

Temporal Nobel Prize (TNP): Paragraph-level, TemporalQA-style queries with single temporal anchors; passages split, multi-date sentences excluded, and queries augmented (explicit, implicit, temporal-answer variants) using Qwen3-4B.
TimeQA: Single Wikipedia snapshot, chunked to paragraphs and augmented similarly.

Passage indices comprise millions of precomputed representations, queried with FAISS.

Metrics

Retrieval: nDCG@10, Recall@100 (TNP, TimeQA)
Semantic Generality: nDCG@10 on BEIR NQ
RAG Outcome: F1 for answers using Qwen3-8B (FlashRAG, top-5 context)

Baseline Comparisons

Sparse: BM25 variants
Zero-shot: Off-the-shelf TEMs
Supervised Temporal: Ts-Retriever
Inference Fusion: TempRetriever (joint semantic/temporal encoding by fusion at inference)
Matryoshka-Adaptor (M-Adaptor), LoRA-only, LoRA-based MRL (Matryoshka w/o temp. subspace)

5. Quantitative Performance and Ablation Analyses

Table: Highlighted Empirical Findings

Retrieval Scenario	Best TMRL nDCG@10 (Contriever, TNP)	Latency/Storage Impact
Full-dim (768) Retrieval	61.26 (vs 56.91, MRL baseline)	Zero inference overhead
m=256 Matryoshka Truncation	F1 within 1 pp of full-dim model	$3\times$ storage cut
m=64 Matryoshka Truncation	nDCG@10 ≈ 39 (vs ≈35 for MRL only)	$5\times$ smaller index

Retrieval and RAG performance is competitive or superior across all truncation levels, particularly for smaller embedding models. TMRL raises nDCG@10 at all $m$ (e.g., Contriever m=64, +6pp over MRL). Recall@100 is typically maintained or modestly reduced at maximal dimension, a trade-off deemed acceptable for RAG. Semantic robustness (BEIR NQ) is preserved.

Ablation studies indicate:

Temporal loss weight $\alpha\approx0.1$ –0.25 is optimal; higher values favor temporal recall at the expense of semantics.
Temporal subspace dimension $t$ : Contriever benefits up to 128; BGE requires at least 64.
Self-distillation regularization yields marginal gains at moderate values (0.1); increasing further is detrimental.

Performance directly correlates with RAG F1; at $m=256$ , TMRL matches or exceeds full fine-tuned baselines with $3\times$ storage savings and halved latency.

6. Flexibility in Accuracy–Efficiency Trade-offs

The nested Matryoshka design allows retrieval at any $m\in \{64,128,256,512,768\}$ without retraining:

At $m=64$ , Contriever-TMRL achieves nDCG@10 $\approx39$ on TNP (vs $\approx$ 35 for semantic MRL) using a $5\times$ smaller index.
At $m=256$ , RAG F1 is within 1 percentage point of the full $d=768$ result.

This enables real-time or large-scale retrieval scenarios where footprint and latency are critical, while quality can be maintained by adjusting $m$ . A plausible implication is that TMRL uniquely combines plug-and-play fine-tuning with explicit temporal encoding, mixed-dimensionality, and strong semantic retention, all within a single, flexible model instance.

7. Distinguishing Features and Contributions

TMRL is the first model to:

Efficiently fine-tune existing TEMs (using LoRA) to support Matryoshka truncation.
Explicitly dedicate a $t$ -dimensional subspace to temporal signals, learned via targeted contrastive objectives.
Leverage systematically augmented positive/negative temporal training pairs.
Retain semantic retrieval capacity, as evidenced by stable BEIR NQ results.
Provide a continuous accuracy–efficiency frontier for temporal retrieval and RAG, with no need to retrain per configuration.

These properties make TMRL a novel, unified approach for efficient, flexible, and temporally-aware information retrieval tasks (Huynh et al., 9 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Efficient Temporal-aware Matryoshka Adaptation for Temporal Information Retrieval (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal-aware Matryoshka Representation Learning (TMRL).