Pattern-Centric Sankey Diagram

Updated 11 December 2025

Pattern-centric Sankey diagram is a visualization technique that anchors flows on recurring patterns, enabling detailed analysis of sequential and hierarchical data.
It employs systematic extraction, pattern mining, and aggregation methods to uncover structural insights in random forests and behavioral sequences.
The design integrates multivariate encoding to visually represent flow frequency, uncertainty, and transition dynamics for enhanced analytic clarity.

A pattern-centric Sankey diagram is a class of layered flow visualizations in which the constituent ribbons or flows are structured with respect to discrete or frequent "patterns" rather than regular time bins or pre-defined categories. Pattern-centric Sankeys have been developed as a means to reveal, in one compact figure, the precise structural organization or behavioral context of complex sequential or hierarchical data—such as random forest path structures, mined user behavior patterns, or flows with compositional uncertainty—focusing user attention on subsequences or motifs of analytic interest and their immediate context (Fitzpatrick et al., 2017, Sheng et al., 3 Dec 2025, He et al., 4 Aug 2025). Instead of restricting flows to uniform progressions, these diagrams aggregate and align flows on occurrences of a chosen subsequence ("pattern"), or encode multivariate attributes by richly structured patterns within each ribbon section.

1. Formal Definitions and Motivating Use Cases

A pattern-centric Sankey diagram generalizes the standard Sankey diagram by changing the basis of alignment from fixed categorical or temporal axes to pattern-anchored or pattern-encoded axes. In the context of random forests, pattern-centric Sankeys summarize the entire spectrum of root-to-leaf split-sequences (“patterns”) as a single layered graph, capturing frequencies and hierarchies of covariate interactions beyond aggregate measures (Fitzpatrick et al., 2017). In behavioral sequence mining, such as in online mental health communities, the pattern-centric Sankey is "docked" at a specific behavioral stage pattern, aligning all sampled sequences relative to that motif to reveal the pre- and post-pattern context and thereby directly support context-focused analysis requests (Sheng et al., 3 Dec 2025).

The need for a pattern-centric approach arises when:

Key analytic questions concern the context of certain event patterns rather than overall group-level progression.
The data exhibits rich hierarchical or compositional structure not reducible to time or simple category sequences.
Experts require immediate visualization of the most common antecedents and consequences surrounding a subsequence or interaction motif.

2. Data Extraction, Pattern Mining, and Aggregation Methods

The construction of a pattern-centric Sankey begins with extracting relevant compositional or sequential patterns from the underlying data.

Enumeration of all root-to-leaf paths per tree, encoded as ordered sequences of covariate splits.
Aggregation of identical path-patterns across the forest via counts, yielding link weights $w_{i,j}^{(r)}$ for transitions $x_i$ at depth $r$ to $x_j$ at depth $r+1$ , and node sizes $s_i^{(r)}$ for each variable per rank.
Construction of the global network model as a layered directed graph, where each layer indexed by rank $r$ corresponds to split depth.

Raw event streams (e.g., two-week windows of Reddit posts) are converted to event vectors via behavioral and mental health assessments, quantized by clustering.
Event sequences per user are segmented into "behavior stages" using Greedy Gaussian Segmentation, with feature vectors further clustered into stage types.
All stage-label sequences are searched for frequent patterns (subsequences), which are then used as focal points for Sankey alignment.
Contextual impact of any given pattern $p$ is computed via pre–post comparison metrics $d(p)$ , leveraging frequencies in recovery and deterioration groups.

General table of pattern-centric Sankey pipeline approaches:

Domain	Extraction unit	Aggregation	Alignment/focus
Random forest	Root–leaf paths	Covariate sequences	By path-pattern/rank
User behavior (EMINDS)	Behavior stages	Stage patterns	Docked on motif $p$

3. Diagram Construction and Mathematical Encoding

The core of a pattern-centric Sankey is a layered directed network $G = (V,E)$ , where:

Each node $v_{i,r}$ represents covariate $x_i$ at position (rank) $r$ in a pattern (random forest), or a stage at relative offset $t$ from a docked pattern (sequence mining).
Each link $(v_{i,r} \to v_{j,r+1})$ encodes the frequency (flow volume) of transitions between patterns, with width proportional to $w_{i,j}^{(r)}$ .
Layers correspond to split depths, relative windows around a pattern, or user-defined compositional axes.

For context-focused pattern Sankeys (Sheng et al., 3 Dec 2025), the layout is as follows:

Sequences containing focal pattern $p$ are realigned such that $p$ appears at $t=0$ (central column).
Nodes to the left ( $t<0$ ) and right ( $t>0$ ) display stage frequencies pre- and post-pattern across sequences, with flows connecting adjacent positions.
Node heights are proportional to normalized or raw occurrence counts, and node ordering within columns sorts by an auxiliary metric (e.g., positivity).

Link and node formulas (random forest context (Fitzpatrick et al., 2017)):

$w_{i,j}^{(r)} = \sum_{t=1}^T f_{i,j}^{(t),r} \ s_i^{(r)} = \sum_{t=1}^T g_{i,r}^{(t)}$

with $f$ and $g$ being local counts of adjacent covariate transitions and per-rank appearances, respectively.

For ribbon internals and multivariate encoding, pattern-based filling is formally expressed as:

A 2D lattice $L = \{ a, b, \theta, \alpha \}$ determining primitive arrangement, where $a$ , $b$ control spacing, $\theta$ the internal lattice angle, and $\alpha$ overall ribbon orientation (He et al., 4 Aug 2025).
Group assignments $G$ for composite or subflow-encoded ribbons, with ratios $r_i$ and retinal variables (color, shape, size) per group.

4. Visual Encoding, Pattern Theory, and Best Design Practices

Recent formalizations of pattern as a visual variable extend the expressive range of pattern-centric Sankeys (He et al., 4 Aug 2025). In this framework:

Spatial arrangement of primitives, group assignment style (clustered, interleaved, randomized), and retinal variables (hue, shape, size, orientation) determine the legibility, discriminability, and perceptual salience of multivariate flow ribbons.
Category, flow quantity, and auxiliary attributes (such as uncertainty) are mapped to pattern parameters (ribbon width, primitive density, group composition, hue and shape assignments).
Design guidelines mandate visible minimum primitive size, controlled spacing ( $a \in [4,12]$ px), and category separation by hue difference ( $\Delta H \geq 30^\circ$ ).

Patterns should not overlay more than two simultaneous variable dimensions, and the alignment of lattice orientation to flow direction optimizes preattentive association of pattern with semantic meaning.

Practical implementations provide for:

Normalization options for node and link sizing.
Interactive inspection of node and edge metrics (e.g., raw count, positivity, d(p)).
Color mappings for executive semantic indicators (e.g., green for positive, orange for negative progression).
Flexible filtering and windowing to enable exploration at both coarse and fine contextual resolutions.

5. Applications and Case Studies

Pattern-centric Sankeys have demonstrated significant utility in two research areas:

Revealing all root-to-leaf split-sequence frequencies as aggregated multi-layer flows.
Disclosing the most common covariate interaction hierarchies, supporting transparent model auditing.
Software implementation in R ("forestviews") automates extraction, aggregation, and visualization of pattern-centric Sankeys.

Allowing experts to "dock" the visualization on a focus pattern (e.g., a risky or recovery stage sequence) and immediately inspect the most common pre- and post-contexts.
Exposing subtle differences in trajectory polarity (computed via $w(p)$ and $d(p)$ ) for cohorts moving toward mental health improvement or decline.
Qualitatively, domain experts have reported improved ability to flag risky motifs and prioritize patterns for deeper investigation.

Utilizing pattern-centric Sankey ribbons as a vehicle for high-density, multivariate categorical data visualization.
Algorithmic and perceptual guidelines enable encoding of uncertainty, subflow structure, and secondary measures within flow diagrams.

6. Comparative Analysis and Limitations

Pattern-centric Sankey diagrams differ fundamentally from standard Sankey diagrams in the following respects:

Alignment is by occurrence or structure of user- or analyst-chosen patterns, not by fixed bins or categories.
Variable-length and irregular context windows can be naturally represented, providing greater analytic flexibility.
Multivariate attributes may be carried visually within ribbon patterns, not only via width, color, or endpoint categorization.

Standard Sankeys remain appropriate when global transition rates between fixed classes or intervals are required. Pattern-centric variants, however, substantially enhance the exploratory power when context, motif embedding, or compositional uncertainty are the analytic focus.

A plausible implication is that the steep increase in multivariate encoding afforded by pattern-centric design may require careful tuning and perceptual validation to avoid overloading users' discriminatory capabilities, particularly when multiple composite variables and window sizes are presented simultaneously (He et al., 4 Aug 2025).

7. Future Directions and Best Practice Recommendations

Current research suggests several promising directions and best practice lessons:

Consistent use of color semantics (e.g., green–orange for polarity) and visual ordering across all coordinated views is recommended to reduce cognitive overhead (Sheng et al., 3 Dec 2025).
Exposing high-level pattern metrics (e.g., $w(p)$ , $d(p)$ ) alongside raw frequencies supports expert trust and allows verification.
Flexible windowing, top- $k$ filtering, and interactive docking empower diverse analytic workflows.
Aggregation on arbitrary subsequences, and use of pattern-centric axes, may generalize to other domains involving variable-length event, topic, or structure motifs.
Pattern-theoretic specification of ribbon internals should be further validated in relation to perceptual limits and cross-modal interpretation (He et al., 4 Aug 2025).

The pattern-centric Sankey paradigm establishes a mathematical, algorithmic, and design foundation for exploratory and explanatory visualization of complex sequential, hierarchical, and compositional data, expanding the toolset for high-dimensional visual analytics.

Markdown Report Issue Upgrade to Chat

References (3)

A network flow approach to visualising the roles of covariates in random forests (2017)

EMINDS: Understanding User Behavior Progression for Mental Health Exploration on Social Media (2025)

Reframing Pattern: A Comprehensive Approach to a Composite Visual Variable (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pattern-Centric Sankey Diagram.

Pattern-Centric Sankey Diagram

1. Formal Definitions and Motivating Use Cases

2. Data Extraction, Pattern Mining, and Aggregation Methods

In random forests (Fitzpatrick et al., 2017), the extraction algorithm comprises:

In mined behavioral sequences (Sheng et al., 3 Dec 2025):

General table of pattern-centric Sankey pipeline approaches:

3. Diagram Construction and Mathematical Encoding

4. Visual Encoding, Pattern Theory, and Best Design Practices

5. Applications and Case Studies

Random Forest Interpretation (Fitzpatrick et al., 2017)

Composite Visual Variable Encoding (He et al., 4 Aug 2025)

6. Comparative Analysis and Limitations

7. Future Directions and Best Practice Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Pattern-Centric Sankey Diagram

1. Formal Definitions and Motivating Use Cases

2. Data Extraction, Pattern Mining, and Aggregation Methods

In random forests (Fitzpatrick et al., 2017), the extraction algorithm comprises:

In mined behavioral sequences (Sheng et al., 3 Dec 2025):

General table of pattern-centric Sankey pipeline approaches:

3. Diagram Construction and Mathematical Encoding

4. Visual Encoding, Pattern Theory, and Best Design Practices

5. Applications and Case Studies

Random Forest Interpretation (Fitzpatrick et al., 2017)

Behavioral Pattern Analysis in Social Media (Sheng et al., 3 Dec 2025)

Composite Visual Variable Encoding (He et al., 4 Aug 2025)

6. Comparative Analysis and Limitations

7. Future Directions and Best Practice Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics