Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporal-aware Matryoshka Representation Learning

Updated 12 January 2026
  • The paper introduces TMRL, which explicitly dedicates a temporal subspace in embeddings using targeted contrastive and self-distillation techniques.
  • It leverages a nested Matryoshka design that enables dynamic truncation and tunable efficiency–accuracy trade-offs in retrieval and RAG applications.
  • Empirical results show TMRL maintains competitive semantic retrieval while reducing storage and latency, outperforming traditional MRL approaches.

Temporal-aware Matryoshka Representation Learning (TMRL) is a framework for equipping text embedding models (TEMs) with a dedicated temporal subspace, enabling efficient, flexible retrieval of temporally relevant context—particularly in Retrieval-Augmented Generation (RAG) systems. TMRL leverages the nested structure of Matryoshka Representation Learning, explicitly reserves dimensions for temporal encoding, and integrates targeted contrastive learning and self-distillation. The approach yields competitive performance for temporal information retrieval and temporal RAG compared with prior methods, while offering controllable efficiency–accuracy trade-offs (Huynh et al., 9 Jan 2026).

1. Background: Matryoshka Embeddings and Temporal Motivation

Conventional text embedding models encode a query or passage as a single dd-dimensional vector. Matryoshka Representation Learning (MRL) augments this paradigm by training the encoder such that any prefix of mm dimensions—where mM={64,128,,d}m\in \mathcal{M} = \{64, 128, \ldots, d\}—forms a performant embedding:

  • Full embedding: fθ(x)Rdf_\theta(x) \in \mathbb{R}^d
  • Truncated embedding at level mm: fθ(x)1:mRmf_\theta(x)_{1:m} \in \mathbb{R}^m

Vanilla MRL relies on semantic InfoNCE losses summed across truncation levels,

LMRL=mMwmLInfoNCE(m)(q,p+,Nq)\mathcal{L}_{\mathrm{MRL}} = \sum_{m \in \mathcal{M}} w_m\,\mathcal{L}_{\mathrm{InfoNCE}}^{(m)}(q, p^+, \mathcal{N}_q)

but does not guarantee any explicit temporal signal in the embedding subspaces. Temporal retrieval demands embeddings that encode both "when" and "what." TMRL addresses this by explicitly designating the first tt dimensions as a temporal-aware subspace.

2. TMRL Model Architecture

TMRL adapts a frozen base TEM using lightweight Low-Rank Adaptation (LoRA) and introduces a temporal projection module.

2.1 Matryoshka Embedding Split

The representation is split as

fθ(x)1:m=fθ(x)1:ttemporal subspacefθ(x)t+1:msemantic subspace,(mt)f_\theta(x)_{1:m} = \underbrace{f_\theta(x)_{1:t}}_{\text{temporal subspace}} \oplus \underbrace{f_\theta(x)_{t+1:m}}_{\text{semantic subspace}}, \quad (m \ge t)

For a sequence of hidden states fθ(q)=[h1,,hL]f_\theta(q) = [h_1, \ldots, h_L], TMRL identifies temporal token positions T(q)\mathcal{T}(q)—as tagged by tools like SUTime. The corresponding vectors are passed through a 2-layer temporal projector PP, and then mean-pooled:

qˉT=1LTiT(q)P(hi)Rt\bar q_{\mathrm{T}} = \frac{1}{L_{\mathrm{T}}}\sum_{i\in\mathcal{T}(q)} P(h_i)\in \mathbb{R}^t

2.2 Temporal Subspace Contrastive Learning

Temporal retrieval is supervised using positive and negative queries generated through data augmentation and LLM prompting (Qwen3-4B). Contrastive InfoNCE losses in the tt-dimensional subspace are defined for both query-to-passage and passage-to-query alignments:

  • Query-to-passage:

LTempq=logexp(sim(t)(pˉT,fθ(q+))/τ)exp(sim(t)(pˉT,fθ(q+))/τ)+iexp(sim(t)(pˉT,fθ(qi))/τ)\mathcal{L}^{q}_{\rm Temp} = - \log \frac{\exp(\mathrm{sim}^{(t)}(\bar p_{\mathrm T},\,f_\theta(q^+)) / \tau)} {\exp(\mathrm{sim}^{(t)}(\bar p_{\mathrm T},\,f_\theta(q^+)) / \tau)+\sum_{i}\exp(\mathrm{sim}^{(t)}(\bar p_{\mathrm T},\,f_\theta(q_i^-)) / \tau)}

  • Passage-to-query:

LTempp=logexp(sim(t)(fθ(p),qˉT+)/τ)exp(sim(t)(fθ(p),qˉT+)/τ)+iexp(sim(t)(fθ(p),qˉi,T)/τ)\mathcal{L}^{p}_{\rm Temp} = - \log \frac{\exp(\mathrm{sim}^{(t)}(f_\theta(p),\,\bar q^+_{\mathrm T}) / \tau)} {\exp(\mathrm{sim}^{(t)}(f_\theta(p),\,\bar q^+_{\mathrm T}) / \tau)+\sum_{i}\exp(\mathrm{sim}^{(t)}(f_\theta(p),\,\bar q^-_{i,\mathrm T}) / \tau)}

The full temporal contrastive loss is

LTemp=LTempq+LTempp\mathcal{L}_{\rm Temp} = \mathcal{L}^{q}_{\rm Temp} + \mathcal{L}^{p}_{\rm Temp}

where sim(t)\mathrm{sim}^{(t)} denotes cosine similarity over the temporal subspace, and τ\tau is a temperature parameter (0.02–0.05).

2.3 Self-Distillation Regularization

To enforce consistency across dimensional truncations, TMRL includes:

  • Local similarity preservation—top-kk neighbors (L1L_1 alignment):

LDist=m<dijtopk(d)(i)sim(d)(xi,xj)sim(m)(xi,xj)1\mathcal{L}_{\rm Dist} = \sum_{m<d}\sum_i\sum_{j\in\mathrm{top}_k^{(d)}(i)} \lVert \mathrm{sim}^{(d)}(x_i,x_j) - \mathrm{sim}^{(m)}(x_i,x_j)\rVert_1

  • Global geometry alignment—linear CKA:

LCKA=m<d(1CKA(X(d),X(m))),    CKA(X,Y)=cov(X,Y)F2cov(X,X)Fcov(Y,Y)F\mathcal{L}_{\rm CKA} = \sum_{m<d}(1 - \mathrm{CKA}(X^{(d)},X^{(m)})), \;\; \mathrm{CKA}(X,Y) = \frac{\lVert\mathrm{cov}(X,Y)\rVert_F^2}{\lVert\mathrm{cov}(X,X)\rVert_F\,\lVert\mathrm{cov}(Y,Y)\rVert_F}

2.4 Unified Training Objective

The final loss combines semantic, temporal, and regularization objectives:

LTMRL=LMRL+αLTemp+βLDist+γLCKA\mathcal{L}_{\rm TMRL} = \mathcal{L}_{\rm MRL} + \alpha\,\mathcal{L}_{\rm Temp} + \beta\,\mathcal{L}_{\rm Dist} + \gamma\,\mathcal{L}_{\rm CKA}

with α[0.1,0.25]\alpha \in [0.1,0.25], β=γ=0.1\beta=\gamma=0.1, and t{64,128}t\in\{64,128\} typical.

The high-level training pipeline involves freezing the base TEM, applying LoRA adapters and the temporal projector, batchwise computation of all loss terms from temporal token extraction, and backpropagation only through LoRA and the projector. At convergence, LoRA is merged and the projector discarded.

3. Training Protocols and Hyperparameter Choices

Optimization is performed using AdamW (learning rate 1×1041\times10^{-4}, batch size 256, four hard negatives, 5 epochs—1 for Nomic), and LoRA is set with rank r=4r=4, αscale=4\alpha_{\rm scale}=4, and dropout 0.1 for all linear layers. The temperature τ\tau is tuned per base model. Key hyperparameters are adapted for different TEMs and datasets, with, e.g., Contriever using t=64/128t=64/128, α=0.25/0.1\alpha=0.25/0.1 on TNP/TimeQA; GTE uses t=64/64t=64/64, α=0.1/0.1\alpha=0.1/0.1. Self-distillation (β\beta, γ\gamma) is fixed at 0.1.

4. Evaluation Suites, Data, and Metrics

Datasets and Preprocessing

TMRL is benchmarked on:

  • Temporal Nobel Prize (TNP): Paragraph-level, TemporalQA-style queries with single temporal anchors; passages split, multi-date sentences excluded, and queries augmented (explicit, implicit, temporal-answer variants) using Qwen3-4B.
  • TimeQA: Single Wikipedia snapshot, chunked to paragraphs and augmented similarly.

Passage indices comprise millions of precomputed representations, queried with FAISS.

Metrics

  • Retrieval: nDCG@10, Recall@100 (TNP, TimeQA)
  • Semantic Generality: nDCG@10 on BEIR NQ
  • RAG Outcome: F1 for answers using Qwen3-8B (FlashRAG, top-5 context)

Baseline Comparisons

  • Sparse: BM25 variants
  • Zero-shot: Off-the-shelf TEMs
  • Supervised Temporal: Ts-Retriever
  • Inference Fusion: TempRetriever (joint semantic/temporal encoding by fusion at inference)
  • Matryoshka-Adaptor (M-Adaptor), LoRA-only, LoRA-based MRL (Matryoshka w/o temp. subspace)

5. Quantitative Performance and Ablation Analyses

Table: Highlighted Empirical Findings

Retrieval Scenario Best TMRL nDCG@10 (Contriever, TNP) Latency/Storage Impact
Full-dim (768) Retrieval 61.26 (vs 56.91, MRL baseline) Zero inference overhead
m=256 Matryoshka Truncation F1 within 1 pp of full-dim model 3×3\times storage cut
m=64 Matryoshka Truncation nDCG@10 ≈ 39 (vs ≈35 for MRL only) 5×5\times smaller index

Retrieval and RAG performance is competitive or superior across all truncation levels, particularly for smaller embedding models. TMRL raises nDCG@10 at all mm (e.g., Contriever m=64, +6pp over MRL). Recall@100 is typically maintained or modestly reduced at maximal dimension, a trade-off deemed acceptable for RAG. Semantic robustness (BEIR NQ) is preserved.

Ablation studies indicate:

  • Temporal loss weight α0.1\alpha\approx0.1–0.25 is optimal; higher values favor temporal recall at the expense of semantics.
  • Temporal subspace dimension tt: Contriever benefits up to 128; BGE requires at least 64.
  • Self-distillation regularization yields marginal gains at moderate values (0.1); increasing further is detrimental.

Performance directly correlates with RAG F1; at m=256m=256, TMRL matches or exceeds full fine-tuned baselines with 3×3\times storage savings and halved latency.

6. Flexibility in Accuracy–Efficiency Trade-offs

The nested Matryoshka design allows retrieval at any m{64,128,256,512,768}m\in \{64,128,256,512,768\} without retraining:

  • At m=64m=64, Contriever-TMRL achieves nDCG@10 39\approx39 on TNP (vs \approx35 for semantic MRL) using a 5×5\times smaller index.
  • At m=256m=256, RAG F1 is within 1 percentage point of the full d=768d=768 result.

This enables real-time or large-scale retrieval scenarios where footprint and latency are critical, while quality can be maintained by adjusting mm. A plausible implication is that TMRL uniquely combines plug-and-play fine-tuning with explicit temporal encoding, mixed-dimensionality, and strong semantic retention, all within a single, flexible model instance.

7. Distinguishing Features and Contributions

TMRL is the first model to:

  1. Efficiently fine-tune existing TEMs (using LoRA) to support Matryoshka truncation.
  2. Explicitly dedicate a tt-dimensional subspace to temporal signals, learned via targeted contrastive objectives.
  3. Leverage systematically augmented positive/negative temporal training pairs.
  4. Retain semantic retrieval capacity, as evidenced by stable BEIR NQ results.
  5. Provide a continuous accuracy–efficiency frontier for temporal retrieval and RAG, with no need to retrain per configuration.

These properties make TMRL a novel, unified approach for efficient, flexible, and temporally-aware information retrieval tasks (Huynh et al., 9 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal-aware Matryoshka Representation Learning (TMRL).