Pattern-Centric Sankey Diagram
- Pattern-centric Sankey diagram is a visualization technique that anchors flows on recurring patterns, enabling detailed analysis of sequential and hierarchical data.
- It employs systematic extraction, pattern mining, and aggregation methods to uncover structural insights in random forests and behavioral sequences.
- The design integrates multivariate encoding to visually represent flow frequency, uncertainty, and transition dynamics for enhanced analytic clarity.
A pattern-centric Sankey diagram is a class of layered flow visualizations in which the constituent ribbons or flows are structured with respect to discrete or frequent "patterns" rather than regular time bins or pre-defined categories. Pattern-centric Sankeys have been developed as a means to reveal, in one compact figure, the precise structural organization or behavioral context of complex sequential or hierarchical data—such as random forest path structures, mined user behavior patterns, or flows with compositional uncertainty—focusing user attention on subsequences or motifs of analytic interest and their immediate context (Fitzpatrick et al., 2017, Sheng et al., 3 Dec 2025, He et al., 4 Aug 2025). Instead of restricting flows to uniform progressions, these diagrams aggregate and align flows on occurrences of a chosen subsequence ("pattern"), or encode multivariate attributes by richly structured patterns within each ribbon section.
1. Formal Definitions and Motivating Use Cases
A pattern-centric Sankey diagram generalizes the standard Sankey diagram by changing the basis of alignment from fixed categorical or temporal axes to pattern-anchored or pattern-encoded axes. In the context of random forests, pattern-centric Sankeys summarize the entire spectrum of root-to-leaf split-sequences (“patterns”) as a single layered graph, capturing frequencies and hierarchies of covariate interactions beyond aggregate measures (Fitzpatrick et al., 2017). In behavioral sequence mining, such as in online mental health communities, the pattern-centric Sankey is "docked" at a specific behavioral stage pattern, aligning all sampled sequences relative to that motif to reveal the pre- and post-pattern context and thereby directly support context-focused analysis requests (Sheng et al., 3 Dec 2025).
The need for a pattern-centric approach arises when:
- Key analytic questions concern the context of certain event patterns rather than overall group-level progression.
- The data exhibits rich hierarchical or compositional structure not reducible to time or simple category sequences.
- Experts require immediate visualization of the most common antecedents and consequences surrounding a subsequence or interaction motif.
2. Data Extraction, Pattern Mining, and Aggregation Methods
The construction of a pattern-centric Sankey begins with extracting relevant compositional or sequential patterns from the underlying data.
In random forests (Fitzpatrick et al., 2017), the extraction algorithm comprises:
- Enumeration of all root-to-leaf paths per tree, encoded as ordered sequences of covariate splits.
- Aggregation of identical path-patterns across the forest via counts, yielding link weights for transitions at depth to at depth , and node sizes for each variable per rank.
- Construction of the global network model as a layered directed graph, where each layer indexed by rank corresponds to split depth.
In mined behavioral sequences (Sheng et al., 3 Dec 2025):
- Raw event streams (e.g., two-week windows of Reddit posts) are converted to event vectors via behavioral and mental health assessments, quantized by clustering.
- Event sequences per user are segmented into "behavior stages" using Greedy Gaussian Segmentation, with feature vectors further clustered into stage types.
- All stage-label sequences are searched for frequent patterns (subsequences), which are then used as focal points for Sankey alignment.
- Contextual impact of any given pattern is computed via pre–post comparison metrics , leveraging frequencies in recovery and deterioration groups.
General table of pattern-centric Sankey pipeline approaches:
| Domain | Extraction unit | Aggregation | Alignment/focus |
|---|---|---|---|
| Random forest | Root–leaf paths | Covariate sequences | By path-pattern/rank |
| User behavior (EMINDS) | Behavior stages | Stage patterns | Docked on motif |
3. Diagram Construction and Mathematical Encoding
The core of a pattern-centric Sankey is a layered directed network , where:
- Each node represents covariate at position (rank) in a pattern (random forest), or a stage at relative offset from a docked pattern (sequence mining).
- Each link encodes the frequency (flow volume) of transitions between patterns, with width proportional to .
- Layers correspond to split depths, relative windows around a pattern, or user-defined compositional axes.
For context-focused pattern Sankeys (Sheng et al., 3 Dec 2025), the layout is as follows:
- Sequences containing focal pattern are realigned such that appears at (central column).
- Nodes to the left () and right () display stage frequencies pre- and post-pattern across sequences, with flows connecting adjacent positions.
- Node heights are proportional to normalized or raw occurrence counts, and node ordering within columns sorts by an auxiliary metric (e.g., positivity).
Link and node formulas (random forest context (Fitzpatrick et al., 2017)):
with and being local counts of adjacent covariate transitions and per-rank appearances, respectively.
For ribbon internals and multivariate encoding, pattern-based filling is formally expressed as:
- A 2D lattice determining primitive arrangement, where , control spacing, the internal lattice angle, and overall ribbon orientation (He et al., 4 Aug 2025).
- Group assignments for composite or subflow-encoded ribbons, with ratios and retinal variables (color, shape, size) per group.
4. Visual Encoding, Pattern Theory, and Best Design Practices
Recent formalizations of pattern as a visual variable extend the expressive range of pattern-centric Sankeys (He et al., 4 Aug 2025). In this framework:
- Spatial arrangement of primitives, group assignment style (clustered, interleaved, randomized), and retinal variables (hue, shape, size, orientation) determine the legibility, discriminability, and perceptual salience of multivariate flow ribbons.
- Category, flow quantity, and auxiliary attributes (such as uncertainty) are mapped to pattern parameters (ribbon width, primitive density, group composition, hue and shape assignments).
- Design guidelines mandate visible minimum primitive size, controlled spacing ( px), and category separation by hue difference ().
Patterns should not overlay more than two simultaneous variable dimensions, and the alignment of lattice orientation to flow direction optimizes preattentive association of pattern with semantic meaning.
Practical implementations provide for:
- Normalization options for node and link sizing.
- Interactive inspection of node and edge metrics (e.g., raw count, positivity, d(p)).
- Color mappings for executive semantic indicators (e.g., green for positive, orange for negative progression).
- Flexible filtering and windowing to enable exploration at both coarse and fine contextual resolutions.
5. Applications and Case Studies
Pattern-centric Sankeys have demonstrated significant utility in two research areas:
Random Forest Interpretation (Fitzpatrick et al., 2017)
- Revealing all root-to-leaf split-sequence frequencies as aggregated multi-layer flows.
- Disclosing the most common covariate interaction hierarchies, supporting transparent model auditing.
- Software implementation in R ("forestviews") automates extraction, aggregation, and visualization of pattern-centric Sankeys.
Behavioral Pattern Analysis in Social Media (Sheng et al., 3 Dec 2025)
- Allowing experts to "dock" the visualization on a focus pattern (e.g., a risky or recovery stage sequence) and immediately inspect the most common pre- and post-contexts.
- Exposing subtle differences in trajectory polarity (computed via and ) for cohorts moving toward mental health improvement or decline.
- Qualitatively, domain experts have reported improved ability to flag risky motifs and prioritize patterns for deeper investigation.
Composite Visual Variable Encoding (He et al., 4 Aug 2025)
- Utilizing pattern-centric Sankey ribbons as a vehicle for high-density, multivariate categorical data visualization.
- Algorithmic and perceptual guidelines enable encoding of uncertainty, subflow structure, and secondary measures within flow diagrams.
6. Comparative Analysis and Limitations
Pattern-centric Sankey diagrams differ fundamentally from standard Sankey diagrams in the following respects:
- Alignment is by occurrence or structure of user- or analyst-chosen patterns, not by fixed bins or categories.
- Variable-length and irregular context windows can be naturally represented, providing greater analytic flexibility.
- Multivariate attributes may be carried visually within ribbon patterns, not only via width, color, or endpoint categorization.
Standard Sankeys remain appropriate when global transition rates between fixed classes or intervals are required. Pattern-centric variants, however, substantially enhance the exploratory power when context, motif embedding, or compositional uncertainty are the analytic focus.
A plausible implication is that the steep increase in multivariate encoding afforded by pattern-centric design may require careful tuning and perceptual validation to avoid overloading users' discriminatory capabilities, particularly when multiple composite variables and window sizes are presented simultaneously (He et al., 4 Aug 2025).
7. Future Directions and Best Practice Recommendations
Current research suggests several promising directions and best practice lessons:
- Consistent use of color semantics (e.g., green–orange for polarity) and visual ordering across all coordinated views is recommended to reduce cognitive overhead (Sheng et al., 3 Dec 2025).
- Exposing high-level pattern metrics (e.g., , ) alongside raw frequencies supports expert trust and allows verification.
- Flexible windowing, top- filtering, and interactive docking empower diverse analytic workflows.
- Aggregation on arbitrary subsequences, and use of pattern-centric axes, may generalize to other domains involving variable-length event, topic, or structure motifs.
- Pattern-theoretic specification of ribbon internals should be further validated in relation to perceptual limits and cross-modal interpretation (He et al., 4 Aug 2025).
The pattern-centric Sankey paradigm establishes a mathematical, algorithmic, and design foundation for exploratory and explanatory visualization of complex sequential, hierarchical, and compositional data, expanding the toolset for high-dimensional visual analytics.