Scaling behavior of LLM-based deanonymization to internet-scale candidate pools
Prove that the large-language-model-based deanonymization pipeline that extracts micro-data from user comments as structured summaries, searches via cosine similarity over dense text embeddings, selects from the top-k candidates using an LLM, and calibrates match decisions via pairwise LLM comparisons maintains non-trivial recall at high precision (e.g., 90% precision) when the candidate pool size grows to internet-scale (on the order of one million users), consistent with the observed log-linear scaling from smaller pools.
References
Nevertheless, we conjecture that LLM deanonymization scales to internet-scale candidate pools with non-trivial success.
— Large-scale online deanonymization with LLMs
(2602.16800 - Lermen et al., 18 Feb 2026) in Section 6.2 (Comparing difficulty parameters of our attack model)