Generative Data Refinement: Just Ask for Better Data
Abstract: For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web, leading to projected data exhaustion over the next decade. Much more data exists as user-generated content that is not publicly indexed, but incorporating such data comes with considerable risks, such as leaking private information and other undesirable content. We introduce a framework, Generative Data Refinement (GDR), for using pretrained generative models to transform a dataset with undesirable content into a refined dataset that is more suitable for training. Our experiments show that GDR can outperform industry-grade solutions for dataset anonymization, as well as enable direct detoxification of highly unsafe datasets. Moreover, we show that by generating synthetic data that is conditioned on each example in the real dataset, GDR's refined outputs naturally match the diversity of web scale datasets, and thereby avoid the often challenging task of generating diverse synthetic data via model prompting. The simplicity and effectiveness of GDR make it a powerful tool for scaling up the total stock of training data for frontier models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces a simple idea called Generative Data Refinement (GDR). It’s a way to ask a powerful AI to rewrite existing text or code to remove “bad” parts (like private details or rude language) while keeping the “good” parts (useful facts, structure, and style). The goal is to safely unlock lots more data for training future AI models.
Why this matters: Big AI models get better when they train on more and better data. But clean, safe data on the public web is running out, and lots of non-public data can be risky to use because it may include private information or toxic content. GDR helps turn risky data into safe, useful data.
What questions did the researchers ask?
In simple terms, they asked:
- Can an AI carefully rewrite real data so it’s safe to use, without throwing away useful information?
- Is this better than current tools that only detect and redact (black out) sensitive pieces?
- Can this work at large scale for different types of data, like text and computer code?
- Will the cleaned data still help train good models?
- Can this approach keep the variety and “realness” of natural data better than fully made-up (synthetic) data?
How did they do it?
The key idea: Generative Data Refinement (GDR)
Think of having a messy school essay that includes personal details and some rude phrases. Instead of crossing things out with a marker (which can ruin the flow), you ask a smart editor to rewrite the essay: remove private names and rude words, but keep the main ideas and facts. That’s GDR.
- “Generative” means the AI writes new text.
- “Refinement” means it edits each original example to be safe and clean, not inventing something unrelated.
- “Grounded” means each rewrite is based on the specific original example, so the results stay realistic and diverse.
What they used GDR for
- Anonymizing text: removing personally identifiable information (PII), like real names, ID numbers, emails, or keys.
- Anonymizing code: replacing secrets in software (API keys, passwords, private URLs) with safe placeholders.
- Detoxifying text: rewriting toxic or hateful messages into safe versions while keeping any useful facts.
How they checked success (explained simply)
- Precision: Of the things the system changed or flagged, how many truly needed changing? (Fewer false alarms is better.)
- Recall: Of all the bad or sensitive things present, how many did the system actually fix? (Missing fewer problems is better.)
- Toxicity scores: Independent tools rated how toxic a text is; lower is better.
- Usefulness tests: They trained small models on the refined datasets to see if the models learned the public facts but not the private ones.
- Diversity: They compared how varied the refined data is versus data that’s purely made up by prompting a model from scratch.
What did they find?
Here are the main results and why they matter:
- Better anonymization than common industry tools:
- GDR outperformed a commercial detector-and-redactor service (they call it DIRS) on both catching sensitive info (higher recall) and avoiding false alarms (higher precision) across 108 different PII types.
- Why it matters: Instead of throwing away or over-redacting data, GDR can safely rewrite it so it remains useful for training.
- Smaller models can work with a little help:
- With a few examples in the prompt (few-shot) or a small amount of fine-tuning, a smaller AI matched or beat a larger one for anonymization.
- Why it matters: This keeps costs down and makes the method practical at scale.
- The refined data stays useful:
- After anonymizing, models trained on the refined data still learned the public facts but did not memorize the private ones.
- Why it matters: You get safety without losing the value of the data.
- Works on real code at scale:
- On 1.2 million lines of code, GDR more accurately identified and rewrote sensitive pieces at the line level, while the detector-based method had many false positives (marking safe lines as unsafe).
- Why it matters: You can safely reuse large codebases for training without breaking them or losing tons of useful code.
- Strong detoxification while keeping information:
- GDR rewrote a large set of very toxic forum messages into safer versions with much lower toxicity scores than the original—and even lower than fully synthetic chat data from the same model.
- A model trained on the detoxified set learned real-world facts present in the original data, and its answers sounded more human-like.
- Why it matters: You can learn from messy, real sources without spreading harmful content.
- More natural diversity than fully synthetic data:
- Because GDR rewrites real examples instead of inventing them from scratch, the refined data kept the variety and “feel” of real-world data, avoiding mode collapse (where generated data becomes repetitive).
- Why it matters: Diverse training data helps models be more robust and capable.
Why does this matter?
- Safe scaling of training data: GDR can turn risky, non-public, or messy data into safe, high-quality training material, easing the “data shortage” problem.
- Privacy and safety: It removes private details and toxic content directly in the data before training, reducing the chance a model will memorize and repeat them.
- Practical and flexible: It works across many types of sensitive content and is compatible with other methods (like using reward models or additional filtering).
- Cost-aware: Although rewriting data costs compute, the refined datasets can be reused many times. Smaller models can be adapted to cut costs further.
Key takeaways
- GDR is like asking an AI editor to clean and anonymize real data, not to invent new data from scratch.
- It removes private and harmful content while keeping useful information and natural variety.
- It beats common detector-only approaches, works on text and code at scale, and keeps data useful for training good, safer models.
- This approach can significantly expand the amount of safe, high-quality data available for future AI systems.
Collections
Sign up for free to add this paper to one or more collections.