Data Complexity Signatures Overview

Updated 27 January 2026

Data Complexity Signatures are quantitative measures that define datasets through entropy, mutual information, and resource divergence.
They distinguish ergodic from nonergodic processes via scaling behaviors such as logarithmic and power-law growth in information measures.
They enable practical model selection and efficient computation by applying dimensionality reduction and recursive feature estimation.

Data complexity signatures provide principled, quantitative descriptions of the structural, statistical, and informational intricacies within datasets and stochastic processes. They manifest as multivariate feature sets, resource divergences, and dimension-reduced representations that capture core aspects of predictability, model selection, and computational load. These signatures operate across domains, appearing as divergent scaling laws in stationary processes, as predictive models in quantum machine learning, and as universal features for dynamical systems and sequential data streams.

1. Fundamental Definitions and Measures

Data complexity signatures are collections of scalar and functional quantities assessing aspects such as entropy, class-separability, non-Gaussianity, feature redundancy, and multivariate correlation.

For stationary stochastic processes, principal measures include:

Block Entropy: $H[X] = -\sum_x P(X=x)\log P(X=x)$
Mutual Information: $I[X;Y] = H[X] + H[Y] - H[X,Y]$
Past–Future Mutual Information ("Excess Entropy"): $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$
Statistical Complexity: $C_\mu = H[S]$ , where $S$ is the causal state (minimal sufficient predictor)
Resource Divergence: Any intrinsic quantity $R$ (e.g., $C_\mu$ , $I(\ell)$ , sample complexity) diverging as $\ell\to\infty$ or data size $T\to\infty$ , with scaling forms such as $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 0, $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 1, or $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 2 (Crutchfield et al., 2015)

In high-dimensional tabular or clinical datasets, complexity signatures may comprise:

Shannon Entropy (class and feature-wise): $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 3
Fisher Discriminant Ratio (FDR)
Standard Deviation of Kurtosis (Std κ)
Number of Low-Variance Features
Total Correlation (Multi-information) (Rhrissorrakrai et al., 21 Jan 2026)

For sequential data and dynamical systems:

Path Signature: Infinite sequence of iterated integrals, truncated to level $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 4, with exponential ambient dimension.
Log-Signature (Signature Cumulant): Noncommutative logarithm of the expected signature, taking values in a free Lie algebra of much lower dimension (Friz et al., 2024).

2. Structural Divergence: Ergodic, Nonergodic, and Hierarchical Processes

Complexity signatures in stationary processes arise through asymptotic divergences in information-theoretic quantities. Decomposition theorems attribute distinct scaling behaviors to ergodic structure versus nonergodic parameter mixing:

Nonergodic Mixtures: Continuous latent parameter $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 5 yields informational divergence; excess entropy grows as $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 6 (Crutchfield et al., 2015).
Ergodic Processes at Criticality: At the onset of chaotic phenomena (e.g., period-doubling bifurcations), $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 7 with model-dependent slope.
Long-Memory Ergodic Processes: Power-law growth $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 8, often with $I[X;Y] = H[X] + H[Y] - H[X,Y]$ 9 in natural language tasks (“Hilberg’s law”).

Divergence character is used to distinguish parameter-induced uncertainty from intrinsic long-range correlation. Co-information analysis quantifies the portion of excess entropy attributable to hidden parameters versus temporal dependencies.

3. Multivariate Complexity Signatures for Model Selection

In empirical machine learning, multivariate data complexity signatures predict algorithmic efficacy and guide model assignment.

Construction: Five measures (entropy, FDR, Std κ, number of low-variance features, total correlation) are computed per data split, forming a vector $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 0 (Rhrissorrakrai et al., 21 Jan 2026).
Discrimination Model: Logistic regression with elastic net combines these measures to predict, e.g., quantum-projected learning (QPL) benefit versus classical baselines. Training employs cross-validation (5-fold stratified), regularization grid search, and recursive feature elimination.
Empirical Results: In antibiotic-resistance prediction, the five-dimensional signature yielded AUC $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 1 and $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 2 in distinguishing QPL efficacy (Rhrissorrakrai et al., 21 Jan 2026). Specific combinations of high entropy and correlation, together with pronounced tail and class-separation structure, favored quantum kernel benefit.

Adaptive workflow assignment arises: datasets whose signature exceeds a threshold (e.g., $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 3 for logistic score) are sent to the QPL workflow; otherwise, classical models are used.

4. Dimensionality Reduction and Universal Approximation

The ambient dimension of complexity-related features, especially in path signatures and sequential data, grows exponentially with truncation order. Dimension-reduction schemes address tractability while retaining universal approximation property:

Random Projection (Johnson–Lindenstrauss): Reduces $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 4-dimensional signature to $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 5 with controllable distortion, where $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 6 for $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 7 sample paths.
Principal Component Analysis (PCA): Projects onto top $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 8 eigenvectors of covariance, retaining $I(\ell) = I[X_{-\ell:0}; X_{0:\ell}]$ 9-fractional error with $C_\mu = H[S]$ 0 determined by eigenvalue decay.
Balanced Truncation (Reachability/Observability Gramians): Constructs balancing transforms $C_\mu = H[S]$ 1 so that the reduced signature $C_\mu = H[S]$ 2 lives in a lower-dimensional SDE, with relative error bounded by Hankel singular value thresholds.

Universal approximation property persists in the reduced signature regime: both truncation and reduction are continuous linear operators, so any continuous functional of path signatures remains uniformly approximable up to additive error $C_\mu = H[S]$ 3 (Bayer et al., 2024).

5. Computational Complexity and Feature Scaling

Signature-based and cumulant-based complexity features admit polynomial and recursive algorithms, reducing computational burden:

Signature Truncation: Dimension grows as $C_\mu = H[S]$ 4 for path signature up to level $C_\mu = H[S]$ 5 in $C_\mu = H[S]$ 6 dimensions.
Log-Signature Dimension: Number of free Lie algebra basis elements $C_\mu = H[S]$ 7, providing a substantial reduction.
Recursive Algorithms: Level-by-level recursions (Magnus expansions, diamond product series) compute signatures and cumulants efficiently, leveraging precomputed covariances and terminal increments.
Empirical Error Bounds: Balanced truncations in financial models (Bergomi, rough Bergomi) demonstrate machine-precision fit with $C_\mu = H[S]$ 8 dimensionality; $C_\mu = H[S]$ 9-output errors fall below $S$ 0 for $S$ 1 or $S$ 2, depending on model (Bayer et al., 2024).

6. Practical Estimation and Application Domains

Estimation recipes vary by context:

Finite State Processes: Direct block entropy and mutual information plug-in for small $S$ 3; compression-based estimators (Lempel–Ziv), K-nearest neighbors for high-dimensional entropy estimation.
Statistical Complexity: $S$ 4-machine reconstruction (CSSR algorithm), Bayesian HMM structure inference, and calculation of $S$ 5.
Latent-Parameter Dimension: Obtain $S$ 6 by regressing $S$ 7 against $S$ 8 in nonergodic mixtures, ensuring ergodic component stabilization.
Signature and Log-Signature Estimation: Compute path signatures via iterated integrals or SDE solution, and log-signatures via level-by-level Magnus or diamond product recursion.
Cross-Validation and MDL: Model selection via hold-out prediction with minimum description length penalization for detecting resource divergence (Crutchfield et al., 2015).

Application domains include quantum-classical hybrid machine learning for clinical data (Rhrissorrakrai et al., 21 Jan 2026), stream learning and sequential data regression (Friz et al., 2024), and stochastic process calibration in finance and physics (Bayer et al., 2024).

In summary, data complexity signatures span divergent resource scaling laws, multivariate statistical descriptors, and feature-reducing transformations, arising in theoretical, empirical, and algorithmic analyses. Their construction, estimation, and operational role allow researchers to quantify and leverage intrinsic data complexity for model selection, inference, and computational efficiency across a spectrum of contemporary data-driven scientific problems.