Robustness of distilled privacy sensitivity classifiers

Establish the robustness of encoder-based privacy sensitivity classifiers distilled from Mistral Large 3 by calibrating their 1–5 Likert-scale scores, evaluating and improving performance on out-of-domain inputs, and auditing domain- and demographic-dependent failure modes to ensure safe deployment in automated pipelines.

Background

Although the distilled Ettin-150M model aligns strongly with aggregated human judgments and even surpasses the teacher model on the benchmark, the authors emphasize that reliability across settings is not yet demonstrated. They specifically highlight the need for calibration, out-of-domain evaluation, and systematic audits of domain- and demographic-dependent failures.

This open problem is critical for practical use, as uncalibrated or biased sensitivity scores could misinform de-identification evaluation or downstream privacy-aware applications without appropriate robustness checks.

References

Finally, robustness remains open: calibrating scores, dealing with out-of-domain inputs, and auditing domain- and demographic-dependent failure modes are essential before deploying the model as part of automated pipelines.

Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models  (2603.29497 - Loiseau et al., 31 Mar 2026) in Section: Discussion — Future Work (final paragraph)