Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generative Data Refinement: Just Ask for Better Data

Published 10 Sep 2025 in cs.LG and cs.CL | (2509.08653v2)

Abstract: For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web, leading to projected data exhaustion over the next decade. Much more data exists as user-generated content that is not publicly indexed, but incorporating such data comes with considerable risks, such as leaking private information and other undesirable content. We introduce a framework, Generative Data Refinement (GDR), for using pretrained generative models to transform a dataset with undesirable content into a refined dataset that is more suitable for training. Our experiments show that GDR can outperform industry-grade solutions for dataset anonymization, as well as enable direct detoxification of highly unsafe datasets. Moreover, we show that by generating synthetic data that is conditioned on each example in the real dataset, GDR's refined outputs naturally match the diversity of web scale datasets, and thereby avoid the often challenging task of generating diverse synthetic data via model prompting. The simplicity and effectiveness of GDR make it a powerful tool for scaling up the total stock of training data for frontier models.

Summary

  • The paper introduces a framework that leverages pretrained generative models to sanitize datasets by removing PII, toxic language, and sensitive data while keeping essential information.
  • Methodology involves prompt engineering, model adaptation, and verification functions, achieving superior results with a recall of 0.99 and precision of 0.80 compared to rule-based methods.
  • Empirical evaluations across text, code, and detoxification tasks demonstrate that GDR salvages unusable data and enhances model training by preserving diversity and reducing noise.

Generative Data Refinement: A Framework for Dataset Sanitization and Augmentation

Motivation and Problem Statement

The paper introduces Generative Data Refinement (GDR), a framework leveraging pretrained generative models to transform datasets containing undesirable content—such as personally identifiable information (PII), toxic language, or sensitive facts—into refined datasets suitable for model training. The motivation stems from the observation that the scaling laws governing large model performance are increasingly constrained by the availability and quality of training data. As web-indexed data approaches exhaustion, vast quantities of user-generated and proprietary data remain untapped due to privacy, safety, and copyright risks. Existing synthetic data generation and differential privacy (DP) approaches either fail to preserve data utility or suffer from mode collapse and overfitting, limiting diversity and realism.

GDR Framework and Methodology

GDR reframes synthetic data generation as a grounded process: each real data sample xix_i is transformed by a generative process g(xi)g(\cdot|x_i), producing yiy_i that satisfies a semantic constraint h(yi)=1h(y_i) = 1 (e.g., no PII, low toxicity) while minimizing a distance metric A(xi,yi)A(x_i, y_i). This approach anchors synthetic data to real examples, preserving diversity and realism. The generative model (typically an LLM) is prompted or fine-tuned to rewrite each sample, selectively removing or replacing undesirable content while retaining useful information.

Key implementation details include:

  • Prompt Engineering: Zero-shot and few-shot prompts are designed for specific domains (text, code, JSON) and constraints (PII removal, detoxification).
  • Model Adaptation: Performance can be improved via few-shot prompting and supervised fine-tuning (SFT) on domain-specific examples, enabling smaller models to match or surpass larger ones.
  • Verification Functions: Criteria for refinement are encoded as indicator functions hh, which can be implemented via rule-based, classifier, or API-based methods (e.g., Perspective API for toxicity).

Empirical Evaluation

PII Anonymization

GDR is benchmarked against a commercial Detector-based Information Removal Service (DIRS) across 20k sentences and 108 PII categories. GDR, using a single zero-shot prompt with Gemini Pro 1.5, achieves higher recall and precision than DIRS, which relies on brittle rule-based and statistical detectors. Notably, GDR generalizes across PII types and contexts, salvaging data that DIRS would otherwise discard.

  • Recall: GDR achieves 0.99 vs. DIRS's 0.53.
  • Precision: GDR achieves 0.80 vs. DIRS's 0.52.
  • F-score: GDR achieves 0.88.

Smaller models (Flash 8B, Gemma 9B/27B) approach Gemini Pro 1.5's recall but lag in precision. Few-shot prompting and SFT on 10k examples enable Flash 8B to surpass Gemini Pro 1.5, demonstrating that compute cost can be amortized by adapting smaller models.

Utility of Anonymized Data

Models trained on GDR-refined datasets retain the ability to answer questions about public facts while failing to recite private facts, confirming that GDR preserves utility without leaking sensitive information. In contrast, DIRS-redacted datasets suffer from low precision, indiscriminately removing both private and public information.

Codebase Anonymization

GDR is applied to 1.2M lines of code from 479 repositories, outperforming DIRS in agreement with human expert annotations at both document and line levels. GDR's generative rewrites accurately identify and replace PII in code comments, strings, and configuration files, minimizing false positives and negatives. Some failure modes include over-conservative rewrites and missed hash values, but these are rare and can be mitigated via static analysis and prompt refinement.

Content Detoxification

GDR is used to detoxify 100k message pairs from the /pol/ board of 4chan, notorious for toxic content. Using Gemini Pro 1.5 and a zero-shot prompt, GDR reduces mean toxicity scores (Perspective API) from 0.19 (raw) to 0.13 (refined), outperforming synthetic chat baselines. Extracted question-answer pairs from detoxified data demonstrate that world knowledge is preserved. Models fine-tuned on GDR-refined data achieve higher accuracy on knowledge quizzes and produce responses less likely to be detected as LLM-generated, indicating improved human-likeness.

Diversity Analysis

GDR-refined datasets exhibit greater diversity than synthetic datasets generated via direct model prompting, as measured by ROUGE-2 and embedding-based metrics. UMAP visualizations confirm that GDR avoids mode collapse, maintaining coverage of the latent space comparable to or exceeding the original data.

Theoretical and Practical Implications

GDR addresses key limitations of DP and synthetic data generation:

  • Selective Content Removal: Unlike DP, which injects noise and degrades utility, GDR uses LLMs as intelligent noising operators, selectively rewriting only problematic content.
  • Data Salvaging: GDR enables the recovery and reuse of otherwise unusable data, increasing the effective stock of training tokens for frontier models.
  • Scalability: While GDR incurs significant compute cost (up to one-third of a full training run), this cost is amortized by dataset reuse and can be reduced via model adaptation.
  • Generalizability: GDR is applicable across domains (text, code, structured data) and constraints (privacy, safety, copyright), and can be integrated into composite data pipelines.

Future Directions

Potential extensions include:

  • On-policy Distillation and RL Fine-tuning: Leveraging reward models for both risk detection and information preservation.
  • Corpus-level Risk Mitigation: Addressing indirect leakage via cross-document inference.
  • Multimodal Data Refinement: Applying GDR to images, audio, and other modalities.
  • Automated Prompt Optimization: Systematic search for optimal prompts and verification functions.

Conclusion

Generative Data Refinement provides a principled, empirically validated framework for dataset sanitization and augmentation using pretrained generative models. By anchoring synthetic data generation to real examples and leveraging the world knowledge of LLMs, GDR achieves superior performance in privacy, safety, and diversity, with broad applicability to scaling and curating training data for large models. The approach is complementary to existing synthetic data and privacy-preserving methods, and its effectiveness is contingent on continued advances in generative modeling and prompt engineering.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper introduces a simple idea called Generative Data Refinement (GDR). It’s a way to ask a powerful AI to rewrite existing text or code to remove “bad” parts (like private details or rude language) while keeping the “good” parts (useful facts, structure, and style). The goal is to safely unlock lots more data for training future AI models.

Why this matters: Big AI models get better when they train on more and better data. But clean, safe data on the public web is running out, and lots of non-public data can be risky to use because it may include private information or toxic content. GDR helps turn risky data into safe, useful data.

What questions did the researchers ask?

In simple terms, they asked:

  • Can an AI carefully rewrite real data so it’s safe to use, without throwing away useful information?
  • Is this better than current tools that only detect and redact (black out) sensitive pieces?
  • Can this work at large scale for different types of data, like text and computer code?
  • Will the cleaned data still help train good models?
  • Can this approach keep the variety and “realness” of natural data better than fully made-up (synthetic) data?

How did they do it?

The key idea: Generative Data Refinement (GDR)

Think of having a messy school essay that includes personal details and some rude phrases. Instead of crossing things out with a marker (which can ruin the flow), you ask a smart editor to rewrite the essay: remove private names and rude words, but keep the main ideas and facts. That’s GDR.

  • “Generative” means the AI writes new text.
  • “Refinement” means it edits each original example to be safe and clean, not inventing something unrelated.
  • “Grounded” means each rewrite is based on the specific original example, so the results stay realistic and diverse.

What they used GDR for

  • Anonymizing text: removing personally identifiable information (PII), like real names, ID numbers, emails, or keys.
  • Anonymizing code: replacing secrets in software (API keys, passwords, private URLs) with safe placeholders.
  • Detoxifying text: rewriting toxic or hateful messages into safe versions while keeping any useful facts.

How they checked success (explained simply)

  • Precision: Of the things the system changed or flagged, how many truly needed changing? (Fewer false alarms is better.)
  • Recall: Of all the bad or sensitive things present, how many did the system actually fix? (Missing fewer problems is better.)
  • Toxicity scores: Independent tools rated how toxic a text is; lower is better.
  • Usefulness tests: They trained small models on the refined datasets to see if the models learned the public facts but not the private ones.
  • Diversity: They compared how varied the refined data is versus data that’s purely made up by prompting a model from scratch.

What did they find?

Here are the main results and why they matter:

  • Better anonymization than common industry tools:
    • GDR outperformed a commercial detector-and-redactor service (they call it DIRS) on both catching sensitive info (higher recall) and avoiding false alarms (higher precision) across 108 different PII types.
    • Why it matters: Instead of throwing away or over-redacting data, GDR can safely rewrite it so it remains useful for training.
  • Smaller models can work with a little help:
    • With a few examples in the prompt (few-shot) or a small amount of fine-tuning, a smaller AI matched or beat a larger one for anonymization.
    • Why it matters: This keeps costs down and makes the method practical at scale.
  • The refined data stays useful:
    • After anonymizing, models trained on the refined data still learned the public facts but did not memorize the private ones.
    • Why it matters: You get safety without losing the value of the data.
  • Works on real code at scale:
    • On 1.2 million lines of code, GDR more accurately identified and rewrote sensitive pieces at the line level, while the detector-based method had many false positives (marking safe lines as unsafe).
    • Why it matters: You can safely reuse large codebases for training without breaking them or losing tons of useful code.
  • Strong detoxification while keeping information:
    • GDR rewrote a large set of very toxic forum messages into safer versions with much lower toxicity scores than the original—and even lower than fully synthetic chat data from the same model.
    • A model trained on the detoxified set learned real-world facts present in the original data, and its answers sounded more human-like.
    • Why it matters: You can learn from messy, real sources without spreading harmful content.
  • More natural diversity than fully synthetic data:
    • Because GDR rewrites real examples instead of inventing them from scratch, the refined data kept the variety and “feel” of real-world data, avoiding mode collapse (where generated data becomes repetitive).
    • Why it matters: Diverse training data helps models be more robust and capable.

Why does this matter?

  • Safe scaling of training data: GDR can turn risky, non-public, or messy data into safe, high-quality training material, easing the “data shortage” problem.
  • Privacy and safety: It removes private details and toxic content directly in the data before training, reducing the chance a model will memorize and repeat them.
  • Practical and flexible: It works across many types of sensitive content and is compatible with other methods (like using reward models or additional filtering).
  • Cost-aware: Although rewriting data costs compute, the refined datasets can be reused many times. Smaller models can be adapted to cut costs further.

Key takeaways

  • GDR is like asking an AI editor to clean and anonymize real data, not to invent new data from scratch.
  • It removes private and harmful content while keeping useful information and natural variety.
  • It beats common detector-only approaches, works on text and code at scale, and keeps data useful for training good, safer models.
  • This approach can significantly expand the amount of safe, high-quality data available for future AI systems.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 13 tweets with 474 likes about this paper.

alphaXiv