Personal Data Flow Graph Overview
- Personal Data Flow Graphs are formal models representing the lifecycle of personal data as nodes and directed edges, enabling detailed privacy compliance and risk analysis.
- They integrate legal requirements, such as GDPR, through precise node and edge annotations that enforce consent, data minimization, and purpose limitations.
- Construction methods range from static program analysis to automated privacy policy parsing, with applications in telehealth, Android apps, and breach risk evaluation.
A Personal Data Flow Graph (PDFG) is a formal, graph-theoretic model for representing, analyzing, and reasoning about the lifecycle of personal data—including its collection, processing, storage, transmission, and sharing—within and across sociotechnical systems. PDFGs are foundational for privacy engineering, risk assessment, regulatory compliance (notably GDPR), and automated or semi-automated privacy policy analysis. They abstract system or organizational activities as nodes (with fine-grained typing and annotations) and data flows as labelled, directed edges, supporting integration of policies, risk propagation, and graph-theoretic control/optimization.
1. Formal Models and Core Definitions
The canonical formalization defines a PDFG as a labelled directed multigraph , where:
- is a set of nodes, typically partitioned as follows (Azam et al., 2024, Alshareef et al., 2020, Yuan et al., 2023):
- : Processes (data processors/algorithms)
- : Data Stores (databases, logs)
- : Data Subjects (individuals)
- : Third Parties (external recipients)
- : directed edges, capturing data flows, dependencies, or relationships
- : assigns attribute sets to each node, to encode roles, actions, implementation artifacts (e.g., Role {DS, DC, DP, SA}, Properties {ConsentRequestForm, CleanData})
- : assigns labels to edges, often reflecting GDPR/legal or system-specific semantics (ConsentProvided, PurposeAssigned, Encryption, SharingType)
This schema can be extended for program-level flows (nodes = code statements or program variables; edges = data/control dependencies) (Khedkar et al., 20 Mar 2025, Tang et al., 2022), or organization-level flows (nodes = actors/services/data categories; edges = collection/sharing relationships) (Yuan et al., 2023, Yuan et al., 15 Jan 2026).
For empirical privacy risk modeling, the Identity Ecosystem graph (Niu et al., 6 Aug 2025) takes each node as a specific PII attribute, with edge weighted by empirical disclosure probability.
2. Integration of Legal and Privacy Principles
PDFGs are foundational for encoding data protection principles and regulatory requirements. GDPR compliance is encoded via node and edge annotations:
- Lawfulness, Fairness, Transparency: Each process node carries consent, contract, etc., and edges have ConsentProvided labels (Azam et al., 2024, Alshareef et al., 2020).
- Purpose Limitation: Each data-flow edge is annotated with , and no process may legally repurpose data without an explicit new purpose annotation (Azam et al., 2024).
- Data Minimization: Each process node specifies schema, constrained to , ensuring only necessary data flow (Azam et al., 2024).
- Storage Limitation: Data store nodes have , and edges can trigger erasure sub-processes within deadline windows (Azam et al., 2024, Alshareef et al., 2020).
- Accountability: Nodes for Supervisory Authority (SA) or Reporting Mechanisms (RM) are embedded, and processing steps are linked with responsibility attributions (data controller/processor) (Azam et al., 2024).
- Integrity and Confidentiality: Edges encode technical controls—e.g., (high, medium, low), (encryption, signing) (Azam et al., 2024).
Privacy-aware DFDs (PA-DFDs) further elaborate fine-grained enforcement via insertion of “Limit”, “Reason”, “Request”, “PolicyDB”, “Log”, and “Cleaner” nodes, ensuring that every personal-data flow passes through policy enforcement checkpoints, and that all actions are logged for accountability (Alshareef et al., 2020).
3. Construction Methodologies
Programmatic and Policy-based Extraction
- Static Program Analysis: PDFGs are synthesized via taint analysis and data-flow tracking on source code or bytecode. Nodes represent statements, methods, or variables; edges track control/data dependencies. PFGs (Privacy Flow-Graphs) in this setting model the propagation and transformation of private values from input (“source”) through processing to output (“sink”) (Tang et al., 2022, Khedkar et al., 20 Mar 2025).
- Manual and Automated Policy Parsing: PDFGs can be constructed by extracting collection, usage, and sharing statements from privacy policies (Yuan et al., 2023). Automated frameworks such as LADFA use LLMs and retrieval-augmented generation to segment policy text, identify entities and flows, and map them to a standardized graph model (Yuan et al., 15 Jan 2026).
Graph Algorithms and Enforcement Computation
PDFGs also serve as the substrate for automated enforcement—e.g., in consent management, the goal is to disconnect user-data nodes and forbidden purpose nodes to enforce fine-grained privacy constraints. Formally, this involves identifying edge cuts to remove all paths from sensitive user-data vertices to purpose vertices while maximizing retained utility, reducing to a variant of the Minimum Multicut problem (Filipczuk et al., 2024).
4. Analysis, Compliance Check, and Risk Propagation
Automated Reasoning and Threat Detection
PDFGs, when enriched with rule-driven knowledge bases (e.g., SWRL, RuleML rules as in (Azam et al., 2024)), support forward-chaining inference to automatically detect non-compliance threats:
- NonConsent: flagged if no ConsentProvided/RequestFormProvided edge exists for a DS→DC flow.
- NonPurposeLimitation: triggered by repurposing without proper annotation.
- NonDataMinimization: triggered if exceeds minimal set for purpose.
- NonStorageLimitation: flagged if data is retained beyond declared retention period.
- NonIntegrityConfidentiality and NonAccountability: detected by missing technical or procedural controls.
Automated reasoning outputs a report of all fact patterns matching GDPR principle violations, as demonstrated in telehealth and retail systems (Azam et al., 2024, Alshareef et al., 2020).
Privacy Risk Estimation
In empirical setups, PDFGs/identity graphs encode observed relationships between types of PII attributes. Edge weights model the conditional probability that disclosure of attribute leads to risk for ; contextual PageRank and graph neural network link-prediction allow prioritization of high-risk attributes for protection (Niu et al., 6 Aug 2025).
Metrics
Key metrics supported by PDFGs include degree centrality, betweenness, path length, risk-propagation scores, slice depth (for program slices), and clustering to highlight hubs and regulatory bottlenecks (Niu et al., 6 Aug 2025, Khedkar et al., 20 Mar 2025, Yuan et al., 15 Jan 2026, Yuan et al., 2023).
5. Applications and Case Studies
PDFGs have been instantiated in multiple contexts:
- Telehealth systems: Systematic detection of missing erasure flows, default purposes, and accountability gaps (Azam et al., 2024).
- Android apps: SliceViz employs PDFGs to visualize program slices from privacy sources to sinks, supporting developer audits aligned with GDPR (Khedkar et al., 20 Mar 2025).
- Empirical privacy impact analysis: Identity Ecosystem graph dissects +5,000 real-world breach cases, exposing high-risk PII nodes (e.g., SSN as high out-degree hub) and multihop compromise pathways (Niu et al., 6 Aug 2025).
- Privacy policy analysis: Bookings.com case study constructs a PDFG from publicly available policy text to map explicit and implicit collections and complex third-party sharing (Yuan et al., 2023). LADFA enables scalable, automated generation and evaluation of PDFGs from privacy policy corpora (Yuan et al., 15 Jan 2026).
- Consent management: PDFGs are optimized to disconnect forbidden user-purpose pairs under utility constraints via advanced graph cut algorithms (Filipczuk et al., 2024).
6. Visualization, Abstraction, and Usability
Due to the inherent complexity of real-world PDFGs, abstraction and visualization techniques are deployed:
- Node Clustering and Role-based Coloring: Combine non-critical or functionally similar nodes, differentiate roles by shape/color, and accentuate hubs or critical edges (Yuan et al., 2023, Khedkar et al., 20 Mar 2025).
- Layout Algorithms: Force-directed, radial, and partitioned layouts enable the analysis of structural bottlenecks, hubs, and privacy-relevant subgraphs.
- Summary Views: Abstract “flow storyboard” models retain only critical paths for non-technical stakeholders or regulatory reporting (Tang et al., 2022).
7. Limitations, Open Challenges, and Directions
Challenges persist regarding data completeness (policy-based graphs often omit actual flows), scaling to industrial-size systems (manual DFD annotation is costly), ambiguity in natural-language sources, and computational hardness in optimal enforcement (directed minimum multicut problem is provably hard) (Yuan et al., 2023, Filipczuk et al., 2024).
Advances in LLM-based extraction, standardized ontologies, and hybrid static/dynamic analysis pipelines are underway. Automation of PDFG construction and analysis is trending toward sector-wide, interoperable frameworks that bridge software engineering, regulatory compliance, and privacy management (Yuan et al., 15 Jan 2026).
References:
- (Azam et al., 2024): Modelling Technique for GDPR-compliance: Toward a Comprehensive Solution
- (Niu et al., 6 Aug 2025): Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape
- (Yuan et al., 2023): Visualising Personal Data Flows: Insights from a Case Study of Booking.com
- (Alshareef et al., 2020): Transforming Data Flow Diagrams for Privacy Compliance (Long Version)
- (Tang et al., 2022): Assessing Software Privacy using the Privacy Flow-Graph
- (Khedkar et al., 20 Mar 2025): Visualizing Privacy-Relevant Data Flows in Android Applications
- (Yuan et al., 15 Jan 2026): LADFA: A Framework of Using LLMs and Retrieval-Augmented Generation for Personal Data Flow Analysis in Privacy Policies
- (Filipczuk et al., 2024): Graph Theory for Consent Management: A New Approach for Complex Data Flows
- (Al-Fedaghi, 2018): Privacy Things: Systematic Approach to Privacy and Personal Identifiable Information