CanaryBench: Reproducible Privacy Leakage Evaluation

Updated 1 February 2026

CanaryBench is a reproducible evaluation suite that quantifies privacy leakage by detecting synthetic canary tokens in cluster-based summaries of LLM interactions.
It employs synthetic data injection, TF-IDF feature embedding, and K-means clustering to measure the exposure of PII using exact-match metrics.
Experimental results show that basic defenses like aggregation thresholds and regex redaction completely eliminate leakage while maintaining cluster coherence.

CanaryBench is a reproducible evaluation suite for quantifying privacy leakage in cluster-level conversation summaries published from user–LLM interactions. It specifically measures the propensity of derived analytics artifacts—summaries produced via clustering and extractive reporting of conversation text—to expose personally identifying information (PII) and uniquely traceable strings ("^{^{^{^{0^{^{^{^")}}}}}}} introduced as synthetic secrets. The framework does not claim differential privacy or formal guarantees, but provides rigorous, worst-case leakage measurement under intentionally adversarial summarization scenarios, with the aim of guiding deployers toward safer practices (Mehta, 25 Jan 2026).

1. Motivation and Privacy Threat Model

CanaryBench addresses privacy vulnerabilities inherent to aggregate analytics workflows in LLM-based systems, where raw conversational data is typically withheld in favor of publishing short cluster-level summaries. While this approach is presumed to protect user privacy, if summarization methods quote text verbatim or employ extractive techniques, uniquely identifying fragments inserted in the original dialogues may still be revealed. Such leakage allows adversaries to re-identify users, correlate sensitive topics (e.g., medical, legal, or stigmatizing content) with individuals, and causes potential chilling effects or targeted harm. CanaryBench operationalizes a stress test to measure the frequency and conditions under which verbatim secrets bypass aggregation defenses and appear in published summaries.

2. Synthetic Data Generation and Canary Injection

Experimental runs in CanaryBench synthesize $N=3{,}000$ single-turn conversations across 24 diverse topics. Each conversation instance $c_i$ contains:

A natural-language prompt or question.
A canary string selected from $\mathcal{W} = \{w_1, ..., w_M\}$ , injected with probability $p_{\text{canary}} = 0.60$ , modeling unique identifiers such as synthetic emails ([email protected]), phone numbers (+1-415-555-YYYY), or proprietary phrases.
Additional PII-like strings (emails, phones, ZIP codes) injected at probability 0.20 to test regex-based detection and redaction pipelines.

The explicit planting of deterministic canary tokens facilitates exact-match ground truth for leakage detection. Every published summary can thus be automatically audited for the presence or absence of governed secrets.

3. Embedding, Clustering, and Summarization Pipeline

Conversations are mapped to TF–IDF feature vectors $x_i\in\mathbb{R}^d$ , where for each term $t$ in document $i$ :

$\mathrm{tfidf}_{t,i} = \mathrm{tf}_{t,i}\times\log{\frac{N}{\mathrm{df}_t}}$

Clustering is performed using $K$ -means with $K = 54$ in the baseline experiment. Each conversation $c_i$ is assigned to a cluster $j_i$ so as to minimize within-cluster squared Euclidean distances. For cluster $j$ , the centroid is computed as:

$\mu_j = \frac{1}{|\mathcal{C}_j|}\sum_{i:f_{\mathrm{cluster}(c_i) = j}} x_i$

Summarization follows two main approaches:

Keyword-based (non-extractive): Abstractive selection of high–TF–IDF terms or topics ensures no verbatim content from dialogues is reused.
Extractive example-based: Selection and concatenation of real conversation snippets (sentences or spans) models commonly deployed "quote-like" reporting, which is intentionally susceptible to leaking canaries.

4. Leakage Metrics and Evaluation Methodology

Leakage is characterized using strict string-matching criteria. For each cluster $j$ containing conversation set $\mathcal{C}_j$ , the canary set is defined by:

$\mathcal{W}_j = \bigcup_{c_i \in \mathcal{C}_j} \mathrm{canaries}(c_i)$

A canary $w_m$ is said to leak in published cluster summary $S_j$ iff $w_m \in \mathrm{substrings}(S_j)$ . Two principal metrics are reported:

Per-canary leak rate $\mathrm{LR}_{\mathrm{canary}}$ :

$\mathrm{LR}_{\mathrm{canary}} = \frac{\sum_{c_i \in \mathcal{C}_{\mathrm{pub}}}\sum_{w_m \in \mathrm{canaries}(c_i)}\mathrm{leaked}(w_m, S_{j_i})}{\sum_{c_i \in \mathcal{C}_{\mathrm{pub}}} |\mathrm{canaries}(c_i)|}$

Cluster-level leak rate $\mathrm{LR}_{\mathrm{cluster}}$ :

$\mathrm{LR}_{\mathrm{cluster}} = \frac{1}{|\{j : |\mathcal{W}_j| > 0\}|}\sum_{j:|\mathcal{W}_j| > 0} \mathbb{1}\left[\exists w_m \in \mathcal{W}_j : \mathrm{leaked}(w_m, S_j)\right]$

In addition, regex patterns are applied to summaries to count second-order PII exposures (e.g., emails, phone numbers, ZIP codes), serving as a supplementary indicator of privacy risk.

5. Experimental Results and Baseline Vulnerabilities

In the extractive example-based setting, without defenses:

54 clusters published
1,835 total canary instances in published clusters
51 leaked canary instances: $\mathrm{LR}_{\mathrm{canary}} \approx 0.0278$
Of 52 clusters containing at least one canary, 50 leak at least one: $\mathrm{LR}_{\mathrm{cluster}} \approx 0.962$
Regex PII hits: 17 emails, 20 phones, 27 ZIPs
Cluster coherence (proxy for utility): 0.653

This demonstrates that the extractive summarization protocol leaks planted secrets in nearly all canary-containing clusters, confirming the risk that even aggregated publication can lead to privacy compromise.

6. Minimal Defenses: $k$ -Min Threshold and Regex Redaction

The implementation of minimal defenses consists of:

$k$ -min threshold ( $k_{\min}=25$ ): Summaries $S_j$ are published only for clusters satisfying $|\mathcal{C}_j| \geq 25$ , formally $\mathcal{S}_{\mathrm{pub}} = \{S_j : |\mathcal{C}_j| \ge k_{\min}\}$ .
Regex redaction: For all published summaries, substrings matching email, phone, ZIP, or canary-like patterns $\mathcal{P}$ are replaced via regular expression substitution (e.g., [REDACTED_EMAIL], [REDACTED_PHONE]).

Combined, this pipeline $\mathcal{S}'_{\mathrm{pub}} = \{f_{\mathrm{redact}}(S_j) : |\mathcal{C}_j| \geq k_{\min}\}$ suppresses high-risk small clusters and redacts remaining canary or PII tokens. In the defended run:

32 clusters published (22 suppressed, 41% reduction)
1,699 canary instances
0 leaked canary instances: $\mathrm{LR}_{\mathrm{canary}} = 0$
0 clusters leaking canaries: $\mathrm{LR}_{\mathrm{cluster}} = 0$
0 regex PII hits
Cluster coherence: 0.662 (no degradation in utility proxy)

7. Societal Implications and Mitigation Strategies

Verbatim leakage of identifiers adjacent to sensitive topics enables adversarial re-identification, phishing, discrimination, and doxing. Awareness of these privacy failures can deter vulnerable populations (e.g., seeking help for mental health, LGBTQ+ issues) from using conversational AI systems. Broad recommendations derived from the evaluation include:

Refraining from extractive summarization in any public analytics context.
Enforcing aggregation thresholds ( $k \geq 25$ minimum, higher for sensitive domains).
Layered redaction pipelines (regex, learned PII detectors, manual oversight).
Treating analytic outputs as sensitive: access controls, auditing, limited retention.
Considering differential privacy for quantitative aggregates.
Implementing user opt-out and data-deletion options.

These practices align analytical utility with robust privacy standards, supporting safer deployment of LLM analytics.

Summary and Benchmark Positioning

CanaryBench establishes that extractive cluster-level summaries propagate planted secrets with near-universal cluster-level leakage, reliably detected via exact-matching metrics. Simple mitigations—aggregation thresholds and regex-based redaction—completely eliminate measured leakage while preserving content utility. The benchmark is intended for continuous-integration privacy checks over any system that publishes cluster-level text analytics derived from sensitive conversational data (Mehta, 25 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CanaryBench: Stress Testing Privacy Leakage in Cluster-Level Conversation Summaries (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CanaryBench.