Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pattern-Centric Sankey Diagram

Updated 11 December 2025
  • Pattern-centric Sankey diagram is a visualization technique that anchors flows on recurring patterns, enabling detailed analysis of sequential and hierarchical data.
  • It employs systematic extraction, pattern mining, and aggregation methods to uncover structural insights in random forests and behavioral sequences.
  • The design integrates multivariate encoding to visually represent flow frequency, uncertainty, and transition dynamics for enhanced analytic clarity.

A pattern-centric Sankey diagram is a class of layered flow visualizations in which the constituent ribbons or flows are structured with respect to discrete or frequent "patterns" rather than regular time bins or pre-defined categories. Pattern-centric Sankeys have been developed as a means to reveal, in one compact figure, the precise structural organization or behavioral context of complex sequential or hierarchical data—such as random forest path structures, mined user behavior patterns, or flows with compositional uncertainty—focusing user attention on subsequences or motifs of analytic interest and their immediate context (Fitzpatrick et al., 2017, Sheng et al., 3 Dec 2025, He et al., 4 Aug 2025). Instead of restricting flows to uniform progressions, these diagrams aggregate and align flows on occurrences of a chosen subsequence ("pattern"), or encode multivariate attributes by richly structured patterns within each ribbon section.

1. Formal Definitions and Motivating Use Cases

A pattern-centric Sankey diagram generalizes the standard Sankey diagram by changing the basis of alignment from fixed categorical or temporal axes to pattern-anchored or pattern-encoded axes. In the context of random forests, pattern-centric Sankeys summarize the entire spectrum of root-to-leaf split-sequences (“patterns”) as a single layered graph, capturing frequencies and hierarchies of covariate interactions beyond aggregate measures (Fitzpatrick et al., 2017). In behavioral sequence mining, such as in online mental health communities, the pattern-centric Sankey is "docked" at a specific behavioral stage pattern, aligning all sampled sequences relative to that motif to reveal the pre- and post-pattern context and thereby directly support context-focused analysis requests (Sheng et al., 3 Dec 2025).

The need for a pattern-centric approach arises when:

  • Key analytic questions concern the context of certain event patterns rather than overall group-level progression.
  • The data exhibits rich hierarchical or compositional structure not reducible to time or simple category sequences.
  • Experts require immediate visualization of the most common antecedents and consequences surrounding a subsequence or interaction motif.

2. Data Extraction, Pattern Mining, and Aggregation Methods

The construction of a pattern-centric Sankey begins with extracting relevant compositional or sequential patterns from the underlying data.

  • Enumeration of all root-to-leaf paths per tree, encoded as ordered sequences of covariate splits.
  • Aggregation of identical path-patterns across the forest via counts, yielding link weights wi,j(r)w_{i,j}^{(r)} for transitions xix_i at depth rr to xjx_j at depth r+1r+1, and node sizes si(r)s_i^{(r)} for each variable per rank.
  • Construction of the global network model as a layered directed graph, where each layer indexed by rank rr corresponds to split depth.
  • Raw event streams (e.g., two-week windows of Reddit posts) are converted to event vectors via behavioral and mental health assessments, quantized by clustering.
  • Event sequences per user are segmented into "behavior stages" using Greedy Gaussian Segmentation, with feature vectors further clustered into stage types.
  • All stage-label sequences are searched for frequent patterns (subsequences), which are then used as focal points for Sankey alignment.
  • Contextual impact of any given pattern pp is computed via pre–post comparison metrics d(p)d(p), leveraging frequencies in recovery and deterioration groups.

General table of pattern-centric Sankey pipeline approaches:

Domain Extraction unit Aggregation Alignment/focus
Random forest Root–leaf paths Covariate sequences By path-pattern/rank
User behavior (EMINDS) Behavior stages Stage patterns Docked on motif pp

3. Diagram Construction and Mathematical Encoding

The core of a pattern-centric Sankey is a layered directed network G=(V,E)G = (V,E), where:

  • Each node vi,rv_{i,r} represents covariate xix_i at position (rank) rr in a pattern (random forest), or a stage at relative offset tt from a docked pattern (sequence mining).
  • Each link (vi,rvj,r+1)(v_{i,r} \to v_{j,r+1}) encodes the frequency (flow volume) of transitions between patterns, with width proportional to wi,j(r)w_{i,j}^{(r)}.
  • Layers correspond to split depths, relative windows around a pattern, or user-defined compositional axes.

For context-focused pattern Sankeys (Sheng et al., 3 Dec 2025), the layout is as follows:

  • Sequences containing focal pattern pp are realigned such that pp appears at t=0t=0 (central column).
  • Nodes to the left (t<0t<0) and right (t>0t>0) display stage frequencies pre- and post-pattern across sequences, with flows connecting adjacent positions.
  • Node heights are proportional to normalized or raw occurrence counts, and node ordering within columns sorts by an auxiliary metric (e.g., positivity).

Link and node formulas (random forest context (Fitzpatrick et al., 2017)):

wi,j(r)=t=1Tfi,j(t),r si(r)=t=1Tgi,r(t)w_{i,j}^{(r)} = \sum_{t=1}^T f_{i,j}^{(t),r} \ s_i^{(r)} = \sum_{t=1}^T g_{i,r}^{(t)}

with ff and gg being local counts of adjacent covariate transitions and per-rank appearances, respectively.

For ribbon internals and multivariate encoding, pattern-based filling is formally expressed as:

  • A 2D lattice L={a,b,θ,α}L = \{ a, b, \theta, \alpha \} determining primitive arrangement, where aa, bb control spacing, θ\theta the internal lattice angle, and α\alpha overall ribbon orientation (He et al., 4 Aug 2025).
  • Group assignments GG for composite or subflow-encoded ribbons, with ratios rir_i and retinal variables (color, shape, size) per group.

4. Visual Encoding, Pattern Theory, and Best Design Practices

Recent formalizations of pattern as a visual variable extend the expressive range of pattern-centric Sankeys (He et al., 4 Aug 2025). In this framework:

  • Spatial arrangement of primitives, group assignment style (clustered, interleaved, randomized), and retinal variables (hue, shape, size, orientation) determine the legibility, discriminability, and perceptual salience of multivariate flow ribbons.
  • Category, flow quantity, and auxiliary attributes (such as uncertainty) are mapped to pattern parameters (ribbon width, primitive density, group composition, hue and shape assignments).
  • Design guidelines mandate visible minimum primitive size, controlled spacing (a[4,12]a \in [4,12] px), and category separation by hue difference (ΔH30\Delta H \geq 30^\circ).

Patterns should not overlay more than two simultaneous variable dimensions, and the alignment of lattice orientation to flow direction optimizes preattentive association of pattern with semantic meaning.

Practical implementations provide for:

  • Normalization options for node and link sizing.
  • Interactive inspection of node and edge metrics (e.g., raw count, positivity, d(p)).
  • Color mappings for executive semantic indicators (e.g., green for positive, orange for negative progression).
  • Flexible filtering and windowing to enable exploration at both coarse and fine contextual resolutions.

5. Applications and Case Studies

Pattern-centric Sankeys have demonstrated significant utility in two research areas:

  • Revealing all root-to-leaf split-sequence frequencies as aggregated multi-layer flows.
  • Disclosing the most common covariate interaction hierarchies, supporting transparent model auditing.
  • Software implementation in R ("forestviews") automates extraction, aggregation, and visualization of pattern-centric Sankeys.
  • Allowing experts to "dock" the visualization on a focus pattern (e.g., a risky or recovery stage sequence) and immediately inspect the most common pre- and post-contexts.
  • Exposing subtle differences in trajectory polarity (computed via w(p)w(p) and d(p)d(p)) for cohorts moving toward mental health improvement or decline.
  • Qualitatively, domain experts have reported improved ability to flag risky motifs and prioritize patterns for deeper investigation.
  • Utilizing pattern-centric Sankey ribbons as a vehicle for high-density, multivariate categorical data visualization.
  • Algorithmic and perceptual guidelines enable encoding of uncertainty, subflow structure, and secondary measures within flow diagrams.

6. Comparative Analysis and Limitations

Pattern-centric Sankey diagrams differ fundamentally from standard Sankey diagrams in the following respects:

  • Alignment is by occurrence or structure of user- or analyst-chosen patterns, not by fixed bins or categories.
  • Variable-length and irregular context windows can be naturally represented, providing greater analytic flexibility.
  • Multivariate attributes may be carried visually within ribbon patterns, not only via width, color, or endpoint categorization.

Standard Sankeys remain appropriate when global transition rates between fixed classes or intervals are required. Pattern-centric variants, however, substantially enhance the exploratory power when context, motif embedding, or compositional uncertainty are the analytic focus.

A plausible implication is that the steep increase in multivariate encoding afforded by pattern-centric design may require careful tuning and perceptual validation to avoid overloading users' discriminatory capabilities, particularly when multiple composite variables and window sizes are presented simultaneously (He et al., 4 Aug 2025).

7. Future Directions and Best Practice Recommendations

Current research suggests several promising directions and best practice lessons:

  • Consistent use of color semantics (e.g., green–orange for polarity) and visual ordering across all coordinated views is recommended to reduce cognitive overhead (Sheng et al., 3 Dec 2025).
  • Exposing high-level pattern metrics (e.g., w(p)w(p), d(p)d(p)) alongside raw frequencies supports expert trust and allows verification.
  • Flexible windowing, top-kk filtering, and interactive docking empower diverse analytic workflows.
  • Aggregation on arbitrary subsequences, and use of pattern-centric axes, may generalize to other domains involving variable-length event, topic, or structure motifs.
  • Pattern-theoretic specification of ribbon internals should be further validated in relation to perceptual limits and cross-modal interpretation (He et al., 4 Aug 2025).

The pattern-centric Sankey paradigm establishes a mathematical, algorithmic, and design foundation for exploratory and explanatory visualization of complex sequential, hierarchical, and compositional data, expanding the toolset for high-dimensional visual analytics.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pattern-Centric Sankey Diagram.