CanaryBench: Reproducible Privacy Leakage Evaluation
- CanaryBench is a reproducible evaluation suite that quantifies privacy leakage by detecting synthetic canary tokens in cluster-based summaries of LLM interactions.
- It employs synthetic data injection, TF-IDF feature embedding, and K-means clustering to measure the exposure of PII using exact-match metrics.
- Experimental results show that basic defenses like aggregation thresholds and regex redaction completely eliminate leakage while maintaining cluster coherence.
CanaryBench is a reproducible evaluation suite for quantifying privacy leakage in cluster-level conversation summaries published from user–LLM interactions. It specifically measures the propensity of derived analytics artifacts—summaries produced via clustering and extractive reporting of conversation text—to expose personally identifying information (PII) and uniquely traceable strings ("0") introduced as synthetic secrets. The framework does not claim differential privacy or formal guarantees, but provides rigorous, worst-case leakage measurement under intentionally adversarial summarization scenarios, with the aim of guiding deployers toward safer practices (Mehta, 25 Jan 2026).
1. Motivation and Privacy Threat Model
CanaryBench addresses privacy vulnerabilities inherent to aggregate analytics workflows in LLM-based systems, where raw conversational data is typically withheld in favor of publishing short cluster-level summaries. While this approach is presumed to protect user privacy, if summarization methods quote text verbatim or employ extractive techniques, uniquely identifying fragments inserted in the original dialogues may still be revealed. Such leakage allows adversaries to re-identify users, correlate sensitive topics (e.g., medical, legal, or stigmatizing content) with individuals, and causes potential chilling effects or targeted harm. CanaryBench operationalizes a stress test to measure the frequency and conditions under which verbatim secrets bypass aggregation defenses and appear in published summaries.
2. Synthetic Data Generation and Canary Injection
Experimental runs in CanaryBench synthesize single-turn conversations across 24 diverse topics. Each conversation instance contains:
- A natural-language prompt or question.
- A canary string selected from , injected with probability , modeling unique identifiers such as synthetic emails (
[email protected]), phone numbers (+1-415-555-YYYY), or proprietary phrases. - Additional PII-like strings (emails, phones, ZIP codes) injected at probability 0.20 to test regex-based detection and redaction pipelines.
The explicit planting of deterministic canary tokens facilitates exact-match ground truth for leakage detection. Every published summary can thus be automatically audited for the presence or absence of governed secrets.
3. Embedding, Clustering, and Summarization Pipeline
Conversations are mapped to TF–IDF feature vectors , where for each term in document :
Clustering is performed using -means with in the baseline experiment. Each conversation is assigned to a cluster so as to minimize within-cluster squared Euclidean distances. For cluster , the centroid is computed as:
Summarization follows two main approaches:
- Keyword-based (non-extractive): Abstractive selection of high–TF–IDF terms or topics ensures no verbatim content from dialogues is reused.
- Extractive example-based: Selection and concatenation of real conversation snippets (sentences or spans) models commonly deployed "quote-like" reporting, which is intentionally susceptible to leaking canaries.
4. Leakage Metrics and Evaluation Methodology
Leakage is characterized using strict string-matching criteria. For each cluster containing conversation set , the canary set is defined by:
A canary is said to leak in published cluster summary iff . Two principal metrics are reported:
- Per-canary leak rate :
- Cluster-level leak rate :
In addition, regex patterns are applied to summaries to count second-order PII exposures (e.g., emails, phone numbers, ZIP codes), serving as a supplementary indicator of privacy risk.
5. Experimental Results and Baseline Vulnerabilities
In the extractive example-based setting, without defenses:
- 54 clusters published
- 1,835 total canary instances in published clusters
- 51 leaked canary instances:
- Of 52 clusters containing at least one canary, 50 leak at least one:
- Regex PII hits: 17 emails, 20 phones, 27 ZIPs
- Cluster coherence (proxy for utility): 0.653
This demonstrates that the extractive summarization protocol leaks planted secrets in nearly all canary-containing clusters, confirming the risk that even aggregated publication can lead to privacy compromise.
6. Minimal Defenses: -Min Threshold and Regex Redaction
The implementation of minimal defenses consists of:
- -min threshold (): Summaries are published only for clusters satisfying , formally .
- Regex redaction: For all published summaries, substrings matching email, phone, ZIP, or canary-like patterns are replaced via regular expression substitution (e.g.,
[REDACTED_EMAIL],[REDACTED_PHONE]).
Combined, this pipeline suppresses high-risk small clusters and redacts remaining canary or PII tokens. In the defended run:
- 32 clusters published (22 suppressed, 41% reduction)
- 1,699 canary instances
- 0 leaked canary instances:
- 0 clusters leaking canaries:
- 0 regex PII hits
- Cluster coherence: 0.662 (no degradation in utility proxy)
7. Societal Implications and Mitigation Strategies
Verbatim leakage of identifiers adjacent to sensitive topics enables adversarial re-identification, phishing, discrimination, and doxing. Awareness of these privacy failures can deter vulnerable populations (e.g., seeking help for mental health, LGBTQ+ issues) from using conversational AI systems. Broad recommendations derived from the evaluation include:
- Refraining from extractive summarization in any public analytics context.
- Enforcing aggregation thresholds ( minimum, higher for sensitive domains).
- Layered redaction pipelines (regex, learned PII detectors, manual oversight).
- Treating analytic outputs as sensitive: access controls, auditing, limited retention.
- Considering differential privacy for quantitative aggregates.
- Implementing user opt-out and data-deletion options.
These practices align analytical utility with robust privacy standards, supporting safer deployment of LLM analytics.
Summary and Benchmark Positioning
CanaryBench establishes that extractive cluster-level summaries propagate planted secrets with near-universal cluster-level leakage, reliably detected via exact-matching metrics. Simple mitigations—aggregation thresholds and regex-based redaction—completely eliminate measured leakage while preserving content utility. The benchmark is intended for continuous-integration privacy checks over any system that publishes cluster-level text analytics derived from sensitive conversational data (Mehta, 25 Jan 2026).