Learning the Signature of Memorization in Autoregressive Language Models

Published 3 Apr 2026 in cs.CL, cs.CR, and cs.LG | (2604.03199v1)

Abstract: All prior membership inference attacks for fine-tuned LLMs use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning LLMs produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces LT-MIA, a learned attack that reframes membership inference as a supervised sequence classification task using distributional feature sequences.
It demonstrates impressive architecture and domain transferability with AUCs up to 0.972 and significantly improved recall in low false-positive regimes.
The work highlights that the memorization signature is an architecture-agnostic byproduct of gradient descent and cross-entropy fine-tuning, raising key privacy concerns.

Learning the Signature of Memorization in Autoregressive LLMs

Introduction

"Learning the Signature of Memorization in Autoregressive LLMs" (2604.03199) addresses the fundamental and technically unresolved issue of membership inference in LLMs. The work introduces a learned attack, LT-MIA, that generalizes MIA across diverse neural architectures, leveraging the observation that fine-tuning via gradient descent and cross-entropy inherently imprints an architecture-agnostic “memorization signature.” Unlike earlier heuristic approaches bounded by statistic design and often limited in transferability, LT-MIA reframes membership inference as a supervised sequence classification task over distributional features extracted from both a fine-tuned and its corresponding pre-trained reference model, achieving zero-shot generalization to new architectures and data domains.

Methodological Framework

LT-MIA replaces heuristic membership statistics with data-driven learning, exploiting the key property that known membership labels can be generated ad infinitum by synthetic fine-tuning runs. Under a black-box threat model, the adversary has query access to both a fine-tuned model and its reference checkpoint sharing identical architecture but requires no white-box access, dataset knowledge, or expensive shadow models. This shift enables the standard deep learning advantages: scalable training by accumulating feature-labeled data from model–dataset combinations, and generalization forced through diversity.

Feature extraction forms the core of LT-MIA. For an input sequence $x$ , 154 per-token features are computed, including per-token losses from the target/reference models, difference statistics, rank and logit features for ground-truth and top/bottom tokens, and cross-distribution rank comparisons, all normalized for scale invariance. These features are encoded as a sequence (not mean pooled), preserving the positional structure that can be exploited by a 2-layer transformer classifier, which adaptively weights the tokens most informative for membership detection.

Figure 1: The LT-MIA pipeline extracts distributional feature sequences contrasting fine-tuned and pre-trained models and applies a lightweight transformer as membership predictor.

This reframing brings MIA into the regime of deep learning, with features, classifier capacity, and diversity all scalable for further performance improvements.

Evaluation and Key Results

Evaluation comprises extensive held-out transfer: LT-MIA is trained solely on transformer models and evaluated on never-seen transformer combinations as well as non-transformer architectures including Mamba (selective state-space), RWKV (linear attention), and RecurrentGemma (gated recurrence), excluding all overlap between training and test splits at the model, family, and data levels.

LT-MIA achieves the following salient results:

Architecture transferability: On all non-transformer families, LT-MIA outperforms its performance on held-out transformers, with AUCs of 0.963 (Mamba), 0.972 (RWKV), and 0.936 (RecurrentGemma), all exceeding the transformer-only AUC of 0.908, despite the classifier never seeing non-transformer data in training.
Low-FPR regime: At stringent operating points (TPR at 1% and 0.1% FPR), LT-MIA exhibits substantially higher recall compared to EZ-MIA and reference-based heuristics. On transformers, 1.6 $\times$ higher TPR@1%, and 2.8 $\times$ higher [email protected]% than best baseline.
Domain transfer: LT-MIA achieves 0.865 AUC on code datasets, demonstrating transfer beyond natural language.
Figure 2: LT-MIA transfer performance, showcasing consistent surpassing of transformer baseline AUC on out-of-family architectures.

Likewise, all likelihood-based statistics (Loss, RefLoss, EZ-MIA) transfer across architectures—the signal is not specific to any detection method, but LT-MIA captures it most effectively.

Analysis: Invariances and Enablers

Ablation studies identify why LT-MIA achieves cross-architecture transfer. Feature importance analysis demonstrates that the dominant membership signal is relational—comparison features contrasting target and reference model predictions—rather than absolute model outputs. This results in invariant importance hierarchies across transformer, state-space, linear attention, and gated recurrent families.

Figure 3: Ablation-based feature importance illustrates the primacy of comparison features for all model families, underscoring the universality of the relational signal.

Training diversity analysis reveals that increasing the number of model–dataset combinations (while keeping total data fixed) closes the generalization gap: artifacts specific to individual models (e.g., tokenizer quirks, logit scale) are filtered, yielding a classifier sensitive only to the architecture-agnostic memorization shift.

Figure 4: Effect of diversity on generalization—scaling combinations, not just data size, diminishes train–eval gap.

Sequence modeling outperforms aggregation; encoding positional dependence grants a 5 AUC point improvement versus mean pooling or flat MLPs.

Implications and Future Directions

Several implications arise from these findings:

Memorization as a universal phenomenon: The persistence of the membership signal across architecture families, matched with the failure of architectural migration (e.g., to SSMs or linear-RNN hybrids) to mitigate attack success, implicates cross-entropy/minimization via gradient descent as the fundamental cause. The leakage is output-distributional, independent of underlying computation.
Scalability of learned attacks: With classifier capacity, training diversity, and feature expressiveness all open to scaling, the ceiling for learned MIA performance is an open question. Scaling laws governing attack generalization across families and domains remain to be elucidated.
Defensive considerations: Differential privacy is the only known defense offering theoretical guarantees, but requires noise magnitudes that destroy utility for LLMs. Architectural defenses are contraindicated by architecture-invariant attack success; any practical defense would need to fundamentally alter the training (e.g., by undermining the output shift that results from memorization).
Beyond supervised fine-tuning: The extent to which RLHF, DPO, or large-scale pretraining (which imprints weaker per-example shifts) manifest the same signature remains an open research direction. Non-cross-entropy training paradigms may offer partial mitigation, but this hypothesis remains to be tested empirically.

Limitations

Reference required: The attack presumes access to both fine-tuned and matching pre-trained checkpoints, which is generally feasible for open-weight models, not closed APIs.
Pretraining data detection: Performance degrades when reference already encodes high likelihood for target text, limiting applicability to strong memorization settings (fine-tuning or rare pretraining data).
Unexplored regimes: Application to reinforcement learning, instruction tuning, and other optimization regimes has not been empirically validated.

Conclusion

This work rigorously characterizes the architecture-invariant signature of memorization induced by autoregressive LLM fine-tuning. LT-MIA manifests a transferable learned MIA that sets new state-of-the-art in both architecture and domain transfer, decisively moving membership inference into the deep learning paradigm. The findings stress that privacy risks from memorization are an intrinsic product of the loss function and optimization, not architectural choice. This mandates a recalibration of defense focus toward training objective adjustments and motivates future research into the limits of learned privacy attacks and their mitigations under diverse optimization schemes.

Markdown Report Issue