Persistence Silhouettes in TDA
- Persistence Silhouettes are functional summaries of persistence diagrams, constructed as weighted averages of piecewise-linear tent functions that capture topological feature lifetimes.
- They offer a highly regularized summary with provable stability guarantees and Lipschitz continuity, enabling robust statistical inference and hypothesis testing.
- They are efficiently computed and seamlessly integrated into machine learning pipelines, supporting applications in classification, graph analysis, and functional data analysis.
A persistence silhouette is a functional summary of a persistence diagram—a central construct in topological data analysis (TDA) that encodes the birth and death of topological features as a multiset of points in the plane. The silhouette transforms this diagram into a single real-valued, piecewise-linear function using a weighted average of tent functions associated with each persistence pair, controlled by an explicit weighting parameter. This construction offers a highly regularized summary of topological information, with provable statistical properties and stability guarantees, and serves as a foundational tool for statistical inference and machine learning pipelines in TDA (Chazal et al., 2013, Berry et al., 2018, Segovia-Dominguez et al., 2024).
1. Formal Definition and Construction
Given a persistence diagram comprising points with birth and death times, the tent (triangle) function associated to is
Each is continuous, piecewise-linear, and $1$-Lipschitz with respect to .
A nonnegative weight function , commonly of the form for , is assigned to each persistence pair. The general weighted silhouette is defined as
The most frequent choice is -silhouette: This function is always $1$-Lipschitz in , and the weights allow continuous interpolation between emphasizing all features equally ( small) and focusing on the most persistent ( large) (Chazal et al., 2013, Berry et al., 2018, Segovia-Dominguez et al., 2024).
2. The Silhouette in the Landscape-Unification Framework
The silhouette belongs to the class of functional summaries , mapping diagrams into a Banach space of real functions (Berry et al., 2018). Unlike the persistence landscape, which forms a sequence of the th-largest tent values at , the silhouette is a single function, which makes it the minimal-dimensional continuous functional summary of a diagram. This property enables seamless application of functional data analysis and machine learning techniques, including averaging, hypothesis testing, and classification in the space of functions.
The parameter gives fine control over the information retained:
- As , approaches the unweighted mean of the .
- As , converges to the tent associated to the persistence pair with the longest lifetime.
This construction provides a consistent interface for statistical learning, clustering, and permutation testing on diagrams via their functional image (Berry et al., 2018).
3. Statistical Properties and Stochastic Convergence
The silhouette enjoys rigorous stochastic-process theory under mild conditions:
- Uniform boundedness: is uniformly bounded over , since each tent is bounded by half the corresponding lifetime (Berry et al., 2018).
- Lipschitz equicontinuity: Since each is 1-Lipschitz and normalization by positive weights preserves this property, the family is equicontinuous. This ensures strong uniform laws of large numbers and consistency:
where is the sample mean silhouette and is the population mean.
- Central limit theorem (CLT): The empirical process
converges weakly to a mean-zero Gaussian process indexed by , with explicit rates of convergence (Chazal et al., 2013).
- Bootstrap consistency: Bootstrap samples of yield valid -confidence bands for with asymptotic coverage up to order (Chazal et al., 2013, Berry et al., 2018). Both uniform and studentized (variable-width) confidence bands may be constructed.
These results allow direct application of permutation tests and prediction regions in the function space, using classical metrics such as and (Berry et al., 2018).
4. Stability and Robustness
The silhouette inherits strong stability properties from persistence landscapes:
- Lipschitz stability: For diagrams , the uniform distance between their silhouettes is bounded by the bottleneck distance:
where denotes the standard bottleneck distance (Chazal et al., 2013, Segovia-Dominguez et al., 2024). This property implies robustness to noise and small perturbations in the data.
- All stability results proved for landscapes (in particular stability w.r.t. -Wasserstein distance) carry over directly to silhouettes (Chazal et al., 2013, Segovia-Dominguez et al., 2024).
- EMP framework extension: In the Effective Multidimensional Persistence (EMP) extension, one computes families of silhouettes across slices of multidimensional parameter grids. The EMP silhouette inherits the single-parameter silhouette’s stability: the sum of uniform deviations across slices is bounded above by the corresponding sum of Wasserstein distances between diagrams (Segovia-Dominguez et al., 2024).
5. Algorithmic Implementation
The silhouette can be computed as follows:
- Input: Persistence diagram , weight function , evaluation grid .
- Feature computation: For each , compute lifetime , set .
- Tent evaluation: For each , compute .
- Weighted sum: For each , form numerator , denominator , then output .
- Complexity: The algorithm requires flops for features and grid points (Berry et al., 2018).
In the EMP framework, this computation is repeated across slices, and resulting silhouette vectors are assembled into a matrix or higher-dimensional array (Segovia-Dominguez et al., 2024).
6. Applied Usage and Empirical Results
- Functional inference and classification: Silhouettes have been used as features in -nearest neighbor classification, for example in the analysis of simulated Gleason histology, yielding a test error of 11.75% on a four-class task with 400 held-out regions of interest (Berry et al., 2018).
- Two-sample testing: The silhouette fits directly into permutation-test frameworks for comparing two populations via or distances between sample mean silhouettes (Berry et al., 2018).
- Machine learning on graphs: EMP silhouettes have been evaluated as input features for standard classifiers (Random Forest, SVM, CNN) on benchmark graph classification datasets (e.g., BZR_MD, COX2_MD, DHFR_MD, MUTAG, REDDIT-B), achieving competitive or state-of-the-art accuracy (Segovia-Dominguez et al., 2024). For instance, combined - and -EMP silhouettes gave 88.1% accuracy on MUTAG and 88.6% on REDDIT-B.
- Statistical rigour: All such applications benefit from the silhouette’s stability and the availability of functional CLTs, uniform confidence bands, and valid asymptotic inference (Chazal et al., 2013, Berry et al., 2018).
7. Limitations and Interpretive Considerations
- Information compression: By averaging over all tent functions, the silhouette summarizes a persistence diagram as a single function, potentially losing multimodal information captured in higher levels () of the persistence landscape. Thus, secondary and tertiary modes of feature persistence may be obscured (Chazal et al., 2013).
- Weight sensitivity: Selection of the weight parameter (or general weight function ) directly impacts the prominence given to features of varying persistence. Empirical tuning or application-specific guidance may be required (Chazal et al., 2013, Berry et al., 2018).
- Implementation: For diagrams with only short-lived features (nearly zero lifetime), normalization may be unstable; thresholds on lifetimes or addition of a small may be necessary (Berry et al., 2018).
In summary, the persistence silhouette provides a one-Lipschitz, single-function summary of a persistence diagram, interpolating between uniform averaging of topological features and maximal emphasis on the longest bars. It is theoretically underpinned by stability, stochastic process convergence, and direct applicability to hypothesis testing and machine learning tasks. The silhouette integrates elegantly into frameworks for both single- and multi-parameter persistent homology, supporting a broad spectrum of statistical and computational pipelines (Chazal et al., 2013, Berry et al., 2018, Segovia-Dominguez et al., 2024).