Context Sanitization & Provenance Filtering
- Context Sanitization and Provenance Filtering is defined as complementary strategies to reduce irrelevant, redundant, or sensitive data while maintaining system utility, privacy, and integrity.
- Techniques include formal models using provenance graphs, nonconformity scoring in RAG systems, and policy-based filtering to ensure efficient data management.
- Recent advances leverage graph-based segmentation, multimedia forensic filtering, and adversarial sanitization methods to improve accuracy, scalability, and security in data systems.
Context sanitization and provenance filtering are complementary strategies for reducing irrelevant, redundant, or sensitive information in data systems—often with the goal of preserving utility, privacy, or explainability. These techniques span domains from large-scale data management to AI-centric retrieval-augmented generation (RAG), workflow lineage, whole-system provenance, and multimedia forensics. The central aim is to present a distilled representation of the "context" or "history" of data, while satisfying constraints on privacy, efficiency, statistical coverage, or security. This entry surveys the main formal models, system architectures, algorithms, and empirical results underlying context sanitization and provenance filtering.
1. Formal Models and Security Objectives
Sanitization and filtering operate primarily on provenance graphs: directed acyclic graphs whose nodes may represent data items (entities), processes (activities), and agents. Two central security objectives—obfuscation and disclosure—govern transformations:
- Obfuscation: Ensures that, given a sanitized view , an adversary cannot infer certain secrets expressible as predicates . Formally, there exist such that but .
- Disclosure: Guarantees that the sanitized view preserves all information needed to decide designated predicates.
Classic integrity constraints maintained by sanitization/filtering include acyclicity, bipartiteness, port-typing (in workflow graphs), and preservation or reflection of dependency paths. Quantitative privacy—such as -privacy for workflow modules—gives adversarial uncertainty guarantees: for any visible port set , a module is -private if an adversary's success at inferring hidden outputs is no better than (Cheney et al., 2014).
2. Coverage-Controlled Context Filtering in RAG Systems
Retrieval-Augmented Generation, which integrates retrieved evidence into LLM prompts, suffers when the prompt context is overly long or noisy. Principled context filtering using conformal prediction addresses this via the following pipeline (Chakraborty et al., 22 Nov 2025):
- Nonconformity Scoring: For candidate snippets , assign via either an embedding-based metric or an LLM-based rating.
- Threshold Calibration: On labeled calibration data, select a quantile threshold so that at least of relevant snippets are retained, using split conformal prediction.
- Filtering: For each query, retain only those with .
- Statistical Guarantees: For all , empirical coverage matches or slightly exceeds on NeuCLIR/RAGTIME datasets.
- Downstream Accuracy: Strict filtering (e.g., ) improves factual accuracy; moderate filtering maintains it, with 2–3 reduction in token count.
This coverage-controlled filtering is model-agnostic and balances a user-specified tradeoff between context length and recall of supporting evidence.
3. Policy-Based and Algorithmic Filtering in Provenance Systems
System provenance often results in voluminous, noisy records. Efficient filtering and context sanitization can be realized via declarative or algorithmic mechanisms:
- CamFlow's Policy API (Pasquier et al., 2017): Defines a tuple partitioning objects into Tracked, Opaque, and Propagating sets, augmented with node- and edge-type filters. Per-tenant isolation uses kernel security contexts and cgroups, ensuring that graphs for different tenants are disjoint; cross-tenant flows can be made opaque at capture. Selective policies are enforced in-kernel, reducing data volume by 20–orders of magnitude, with microbenchmark overheads of only 18–28%.
- OneProvenance Filtering Suite (Psallidas et al., 2022): Log-based provenance extraction suffers from spurious events (loops, system traffic, irrelevant users). Optimizations include:
- Loop Compression: Only retain the last occurrences of repeated queries.
- Predicate Filtering: Drop nodes/events by SQL fingerprint, user, or client app.
- Connection/Activity Filtering: Prune activities from uninteresting connections.
- Aggregation Control: Optionally emit only high-level (procedure) edges.
- Cost Metrics: Up to extraction speedups and reduction in graph size on TPC-C workloads.
These strategies are placed as early as possible in the pipeline to minimize downstream data-processing costs.
4. Graph-Based Segmentation and Summarization for Provenance Sanitization
Graph query operators offer fine-grained, user-driven context sanitization:
- Segmentation (SEG) (Miao et al., 2018): Given a provenance property-graph , segments are induced subgraphs extracted between source () and destination () nodes with user-defined boundary criteria (). SEG supports path-pattern grammars, node/edge exclusion, and iterative expansion. Sanitization is achieved by setting to mask sensitive node/edge types (e.g., hiding Agents with blacklisted names or redacting files matching "*.cfg").
- Summarization (SUM): Multiple segmented runs can be merged into an abstracted summary graph, collapsing similar nodes (as determined by -hop isomorphism), property aggregation, and simulation-based greedy merging. The result preserves workflow shape but not sensitive details.
This paradigm supports modular, policy-driven provenance sanitization: select sources/destinations, specify filters, segment, and optionally summarize.
5. Multimedia Phylogeny and Context-Guided Provenance Filtering
Multimedia provenance filtering isolates the ancestral components of forgeries or composites in large image galleries (Pinto et al., 2017):
- Two-Tier Retrieval:
1. Tier 1 uses approximate nearest-neighbor voting on SURF descriptors to retrieve the likely "host" donor. 2. Tier 2 computes a contextual mask between the query and major donor (align, difference, morphological filtering), then restricts patch-based voting to mask-derived regions to capture small ("secondary") donors.
- Scalability & Accuracy: For NC2016, Tier 1 finds the background donor with recall in M-image galleries; Tier 2 improves small donor recall by 4.7% in Recall@10.
- Contextuality: Focusing the search on contextual differences, rather than the whole image, increases precision and reduces false positives, illustrating the value of context-driven filtering.
6. Post-Retrieval Context Sanitization in Adversarial RAG Pipelines
Recent adversarial models expose subtle vulnerabilities in context construction, particularly in RAG databases (Wu et al., 30 Nov 2025):
- Bias Injection Attacks: Attackers inject semantically-biased yet factually-correct passages engineered to have high relevance and perspective shift, thus crowding out opposing views in retrieval.
- BiasDef Filtering: A post-retrieval filter, BiasDef, detects dense clusters of high-relevance, high-polarization-score passages via Kullback-Leibler divergence analysis and Mahalanobis-distance-based recovery, filtering those with adversarial characteristics.
- Performance: BiasDef achieves a 15% absolute reduction in adversarial passage recall, a drop in perspective shift, and preserves 62% more benign content than prior defenses, all measured on realistic RAG QA benchmarks.
The absence of explicit provenance-aware filtering in current LLM-based RAG pipelines indicates an opportunity for integrating provenance scores directly into context sanitization objectives.
7. Comparative Evaluation and Open Challenges
A systematic survey (Cheney et al., 2014) places provenance sanitization approaches on an axis of expressiveness, integrity, confidentiality, support for formal policy, and scalability:
| System | Expressiveness | Integrity | Confidentiality | Context Hidden |
|---|---|---|---|---|
| ZOOM | ≈ | ✔ | – | block-level |
| SecurityViews | – | ≈ | ✔ | port-level |
| Surrogates | ≈ | – | ≈ | node/edge |
| ProPub | ✔ | ✔ | ✔ | arbitrary |
| ProvViews | ≈ | – | ✔ (quantitative) | attribute/port |
| ProvAbs | ≈ | ✔ | ≈ | subgraph |
Major open challenges include:
- Achieving general, compositional semantic foundations that link provenance views and sanitizations back to workflow semantics.
- Quantitative leakage measurement (beyond -privacy, opacity, or empirical utility).
- Scalable enforcement of global integrity constraints (acyclicity, typing) following sanitization.
- Real-time, dynamic, and multi-user policy support; provenance of provenance (meta-provenance) for tracing sanitization activity.
These directions underline the critical need for theoretically-founded, efficient, and flexible context sanitization and provenance-filtering mechanisms across data-intensive domains.