Binary Intrinsic Dimension (BID)
- Binary Intrinsic Dimension (BID) is a set of mathematical methodologies that quantify the effective number of degrees of freedom in binary data using fractal, graph-based, and geometric approaches.
- It employs estimators based on correlation, random connection, formal concept, and separability models to reveal dependencies, sparsity, and scaling properties with theoretical guarantees.
- Practical implementations of BID enable scalable analysis for high-dimensional, sparse datasets, with applications in clustering, manifold learning, and physical systems.
Binary Intrinsic Dimension (BID) is a mathematically rigorous set of methodologies for quantifying the effective or intrinsic dimension of data residing in binary or more broadly discrete metric spaces. BID characterizes, in a scalar—or interval-valued—form, the minimal effective number of degrees of freedom required to describe the statistical behavior of data in spaces such as or , capturing dependencies, sparsity, and structure in complex discrete datasets. The metric foundations, estimator families, theoretical guarantees, and empirical behaviors of BID have been developed and compared across several lines of research, including normalized fractal dimensions (Tatti et al., 2019, Hanika et al., 2024), random connection models (Serra et al., 2017), formal-concept/geometric frameworks (Hanika et al., 2024), separability notions in learning (Sutton et al., 2023), and scaling analyses in physical systems (Verdel et al., 2 May 2025).
1. Mathematical Foundations and Definitions
The concept of intrinsic dimension in discrete spaces arises from the need to measure the effective number of independent coordinates or degrees of freedom in binary datasets, often much lower than the ambient space's dimension due to dependencies and redundancy.
Correlation-Based and Fractal Approaches:
For , the correlation (fractal) dimension is defined from the distribution of pairwise Hamming distances:
Letting and as quantile-based radii, the (raw) correlation dimension is:
To improve interpretability, the normalized correlation dimension is defined as the number such that a synthetic dataset of independent Bernoulli coordinates with matched marginals achieves the same raw correlation dimension as 0:
1
where 2 is solved numerically (Tatti et al., 2019, Hanika et al., 2024).
Given only binary neighborhood graphs (adjacency matrices), the BID estimator is
3
where 4 and 5 are estimates of pairwise connection probabilities at scales 6 and 7 (Serra et al., 2017).
Formal Concept-Based Geometric Dimension:
For a binary context 8 (objects, attributes, incidence), consider the formal concepts 9, which are maximal rectangles in the 0 table. The geometric intrinsic dimension is
1
with 2 reflecting the maximal size of object extents for intent masses in the interval 3 (Hanika et al., 2024).
Separability-Based (Learning-Theoretic) BID:
For distributions 4, 5 in 6, the intrinsic dimension 7 is defined via
8
and the relative intrinsic dimension 9 is
0
which controls bounds on classifier performance and quantifies linear class separability (Sutton et al., 2023).
Scaling BID in Physical Systems:
For binarized 1-valued configurations (e.g., of interfaces or spin states), the empirical Hamming distance histogram 2 is fitted by the ansatz
3
and BID is defined as 4 (Verdel et al., 2 May 2025).
2. Estimation Algorithms and Computational Aspects
Correlation Dimension Estimation
- Compute empirical distribution of pairwise Hamming distances.
- Fit 5 vs. 6 by least squares within quantiles 7.
- Normalize via binary search in 8 (dimension of i.i.d. Bernoulli model) and 9 (matched marginal) (Tatti et al., 2019).
- Optimized for sampling, 0 or 1 with 2 the number of ones.
Connection-Graph (Adjacency-Based) Estimator
- Form two adjacency matrices at scales 3.
- For 4 randomly chosen pivots, count neighbors, estimate 5, 6.
- Compute 7 explicitly.
- Overall complexity 8 for sparse graphs; suitable for scalable dimension estimation in large datasets (Serra et al., 2017).
Formal Concept Lattice Approach
- For context 9, enumerate (or threshold-mine) formal concepts with extent support condition 0.
- Tabulate 1 pairs: intent mass vs. extent support.
- Reconstruct 2 via two-pointer scan; compute integral for 3 for bound interval 4.
- Complexity controlled by minimum support 5; efficient for 6–0.2 on massive datasets (Hanika et al., 2024).
BID via Binarized Physical Configurations
- Binarize real-valued profiles, e.g., by sign relative to mean.
- Form empirical pairwise distance distribution 7.
- Fit the ansatz 8 via Kullback-Leibler divergence minimization in 9.
- Applies to massive bit-strings via sampling (Verdel et al., 2 May 2025).
Separability-Based Estimator
- For finite samples, select center 0 (mean or extremal point).
- Draw 1 random pairs across classes; compute indicator 2 for separability condition.
- Estimate relative dimension via 3.
- Computation is 4 and trivially parallel (Sutton et al., 2023).
3. Theoretical Properties and Guarantees
- Monotonicity Under Dependency: For any 5, 6, i.e., positive dependency reduces observed dimension (Tatti et al., 2019).
- Scale-Sensitivity and Hyperparameters: All methods feature intrinsic scale parameters: quantile thresholds 7, support 8, or radius 9. Practical guidelines are available: e.g., set support to 0–0.2 for feasible computation with close bound width (Hanika et al., 2024).
- Asymptotic Normality: Connection-based 1 admits explicit CLTs and rate 2 for moderate 3 (Serra et al., 2017).
- Unbiasedness and Variance Bounds: Under Poisson or uniform hypotheses, estimators such as I4D are unbiased and admit Cramér–Rao bounds for variance (Macocco et al., 2022).
- Explicit Scaling Laws: In physical, binarized growth dynamics, BID reveals dynamical scaling exponents that match those of continuous surface width (Family-Vicsek universality) (Verdel et al., 2 May 2025).
4. Empirical Performance and Comparative Analysis
Benchmarking and Real Data
- BID recovers true dimension on synthetic lattices, fractals, and mixtures, often outperforming box-counting and continuous-space fractal estimators (Table 1 in (Macocco et al., 2022)).
- On high-dimensional, sparse real-world binary tables (accidents, retail, text corpora), normalized correlation dimension is much less than ambient, aligning with known data complexities (Tatti et al., 2019).
- On benchmark continuous and discrete datasets, adjacency-based BID matches or exceeds nearest-neighbor and correlation-integral methods, with better runtime scaling (Serra et al., 2017).
- On nonequilibrium interface data, BID captures scaling collapse and dynamic exponents, matching continuous-variable order parameters (Verdel et al., 2 May 2025).
Comparison with PCA and Variance-Based Methods
- For many binary datasets, normalized fractal dimension (ncd5) gives estimates that explain a large fraction of PCA variance but can diverge in “hard cases,” indicating the distinct structural information BID encodes (Tatti et al., 2019).
- In clustering analyses, mixtures of clusters exhibit higher BID than their parts—merging increases dimension, consistent with increased heterogeneity (Tatti et al., 2019).
Computational Benchmarks
- Concept-based BID yields practical integral bounds within a factor of 2–3 of the true value for 6–0.2 on datasets with tens of thousands of dimensions and millions of samples (Hanika et al., 2024).
- Connection-based BID operates in 7 on large proximity graphs—orders of magnitude faster than quadratic distance-matrix estimators (Serra et al., 2017).
5. Applications in Data Science, Physics, and Learning Theory
- Binary Table Analysis: Market-basket, text, genomic, survey, and network data: normalized correlation dimension and concept-based BID quantify degrees of freedom beyond marginal sparsity (Tatti et al., 2019, Hanika et al., 2024).
- Manifold Learning: Discrete-metric estimators (I8D, adjacency-BID) sidestep biases of continuous-space fractal methods for binary/categorical datasets (Macocco et al., 2022).
- Classifier Separability: Relative BID predicts few-shot classification performance and linear separability, providing tight theoretical bounds on error rates using intrinsic dimension formulas (Sutton et al., 2023).
- Physical Systems: In nonequilibrium surface growth or statistical mechanics, BID tracks the emergence of spatial correlations, gives access to dynamical exponents, and retains information after severe binarization (Verdel et al., 2 May 2025).
- Neural Networks: BID can be applied to hidden-layer activations or weight spaces to elucidate data compression and capacity transitions.
6. Practical Guidelines and Methodological Considerations
- Choice of Scale/Threshold: Set minimum-support 9 in concept mining to largest affordable value ensuring tight bounds; for correlation dimension, set 0, 1 for stability (Tatti et al., 2019, Hanika et al., 2024).
- Computation and Resource Management: Use sampling for large datasets, sparse matrix acceleration for pairwise distances, and thresholded concept mining for high-cardinality contexts.
- Interpretation Cautions: BID is always lower than or equal to ambient dimension, drops under dependencies, and may not align precisely with variance-based metrics; envelope bounds 2, 3 should be checked for tightness (Hanika et al., 2024).
- Algorithm Availability: Open-source code for I4D (Macocco et al., 2022), and bound computation algorithms (Hanika et al., 2024) are available.
7. Theoretical and Empirical Limitations, Extensions, and Open Questions
- Dependence on Assumptions: Some estimators (e.g., I5D, connection-BID) rely on uniformity or Poisson process assumptions for unbiasedness; heterogeneities may necessitate localized estimation (Macocco et al., 2022, Serra et al., 2017).
- Estimator Choice: No single BID definition dominates; correlation, adjacency, and concept-based approaches target different structural aspects and may diverge especially in structured, real-world data (Hanika et al., 2024).
- Interval Output: Formal-concept geometric BID outputs intervals 6 representing intrinsic dimension—the bound width is a practical diagnostic of sufficiency of concept mining (Hanika et al., 2024).
- Extensions: BID has been extended to categorical, sequence, and spin system spaces and can incorporate variable binarizations and different (pseudo-)metrics (Verdel et al., 2 May 2025, Macocco et al., 2022).
- Unified Frameworks: Recent works suggest, but do not yet fully provide, a unified formalism connecting fractal, adjacency, separability-based, and concept-based ID in binary/discrete settings (Tatti et al., 2019, Hanika et al., 2024, Sutton et al., 2023).
The concept of Binary Intrinsic Dimension provides a robust, scalable, and mathematically transparent toolkit for quantifying the effective dimension of high-dimensional discrete data, with rigorous theoretical underpinnings and practical scalability. It enables principled comparison between binary datasets, complements variance-based approaches, and underpins modern statistical and learning-theoretic analyses in discrete and binary domains (Tatti et al., 2019, Hanika et al., 2024, Serra et al., 2017, Macocco et al., 2022, Sutton et al., 2023, Verdel et al., 2 May 2025).