Persistence of the memorization signature under non–cross-entropy training paradigms

Ascertain whether the architecture-invariant memorization signature detectable after supervised fine-tuning with cross-entropy loss persists when language models are trained or adapted using alternative paradigms such as reinforcement learning from human feedback, direct preference optimization, instruction tuning, or continual pretraining.

Background

The authors demonstrate that a detectable, architecture-invariant signature of memorization emerges from cross-entropy fine-tuning, enabling strong transfer of learned membership inference across model families. This suggests the signal arises from optimization rather than specific computational mechanisms.

However, it remains uncertain whether the same signature appears when models are trained or aligned using other objectives and procedures (e.g., RLHF, DPO, instruction tuning, continual pretraining), motivating investigation beyond the cross-entropy setting.

References

Whether it persists under other training paradigms remains open.

Learning the Signature of Memorization in Autoregressive Language Models  (2604.03199 - Ilić et al., 3 Apr 2026) in Discussion, Beyond Cross-Entropy Fine-Tuning subsection