Binary Intrinsic Dimension (BID)

Updated 28 January 2026

Binary Intrinsic Dimension (BID) is a set of mathematical methodologies that quantify the effective number of degrees of freedom in binary data using fractal, graph-based, and geometric approaches.
It employs estimators based on correlation, random connection, formal concept, and separability models to reveal dependencies, sparsity, and scaling properties with theoretical guarantees.
Practical implementations of BID enable scalable analysis for high-dimensional, sparse datasets, with applications in clustering, manifold learning, and physical systems.

Binary Intrinsic Dimension (BID) is a mathematically rigorous set of methodologies for quantifying the effective or intrinsic dimension of data residing in binary or more broadly discrete metric spaces. BID characterizes, in a scalar—or interval-valued—form, the minimal effective number of degrees of freedom required to describe the statistical behavior of data in spaces such as $\{0,1\}^K$ or $\{-1,+1\}^N$ , capturing dependencies, sparsity, and structure in complex discrete datasets. The metric foundations, estimator families, theoretical guarantees, and empirical behaviors of BID have been developed and compared across several lines of research, including normalized fractal dimensions (Tatti et al., 2019, Hanika et al., 2024), random connection models (Serra et al., 2017), formal-concept/geometric frameworks (Hanika et al., 2024), separability notions in learning (Sutton et al., 2023), and scaling analyses in physical systems (Verdel et al., 2 May 2025).

1. Mathematical Foundations and Definitions

The concept of intrinsic dimension in discrete spaces arises from the need to measure the effective number of independent coordinates or degrees of freedom in binary datasets, often much lower than the ambient space's dimension due to dependencies and redundancy.

Correlation-Based and Fractal Approaches:

For $D\subset\{0,1\}^K$ , the correlation (fractal) dimension is defined from the distribution $Z_D$ of pairwise Hamming distances:

$Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$

Letting $f(r) = P[Z_D < r]$ and $r_1,r_2$ as quantile-based radii, the (raw) correlation dimension is:

$\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$

To improve interpretability, the normalized correlation dimension is defined as the number $H$ such that a synthetic dataset of $H$ independent Bernoulli coordinates with matched marginals achieves the same raw correlation dimension as $\{-1,+1\}^N$ 0:

$\{-1,+1\}^N$ 1

where $\{-1,+1\}^N$ 2 is solved numerically (Tatti et al., 2019, Hanika et al., 2024).

Random Connection Model BID:

Given only binary neighborhood graphs (adjacency matrices), the BID estimator is

$\{-1,+1\}^N$ 3

where $\{-1,+1\}^N$ 4 and $\{-1,+1\}^N$ 5 are estimates of pairwise connection probabilities at scales $\{-1,+1\}^N$ 6 and $\{-1,+1\}^N$ 7 (Serra et al., 2017).

Formal Concept-Based Geometric Dimension:

For a binary context $\{-1,+1\}^N$ 8 (objects, attributes, incidence), consider the formal concepts $\{-1,+1\}^N$ 9, which are maximal rectangles in the $D\subset\{0,1\}^K$ 0 table. The geometric intrinsic dimension is

$D\subset\{0,1\}^K$ 1

with $D\subset\{0,1\}^K$ 2 reflecting the maximal size of object extents for intent masses in the interval $D\subset\{0,1\}^K$ 3 (Hanika et al., 2024).

Separability-Based (Learning-Theoretic) BID:

For distributions $D\subset\{0,1\}^K$ 4, $D\subset\{0,1\}^K$ 5 in $D\subset\{0,1\}^K$ 6, the intrinsic dimension $D\subset\{0,1\}^K$ 7 is defined via

$D\subset\{0,1\}^K$ 8

and the relative intrinsic dimension $D\subset\{0,1\}^K$ 9 is

$Z_D$ 0

which controls bounds on classifier performance and quantifies linear class separability (Sutton et al., 2023).

Scaling BID in Physical Systems:

For binarized $Z_D$ 1-valued configurations (e.g., of interfaces or spin states), the empirical Hamming distance histogram $Z_D$ 2 is fitted by the ansatz

$Z_D$ 3

and BID is defined as $Z_D$ 4 (Verdel et al., 2 May 2025).

2. Estimation Algorithms and Computational Aspects

Correlation Dimension Estimation

Compute empirical distribution of pairwise Hamming distances.
Fit $Z_D$ 5 vs. $Z_D$ 6 by least squares within quantiles $Z_D$ 7.
Normalize via binary search in $Z_D$ 8 (dimension of i.i.d. Bernoulli model) and $Z_D$ 9 (matched marginal) (Tatti et al., 2019).
Optimized for sampling, $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 0 or $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 1 with $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 2 the number of ones.

Connection-Graph (Adjacency-Based) Estimator

Form two adjacency matrices at scales $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 3.
For $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 4 randomly chosen pivots, count neighbors, estimate $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 5, $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 6.
Compute $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 7 explicitly.
Overall complexity $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 8 for sparse graphs; suitable for scalable dimension estimation in large datasets (Serra et al., 2017).

Formal Concept Lattice Approach

For context $Z_D = \|x-y\|_1, \quad x,y \sim \textrm{uniform}(D)$ 9, enumerate (or threshold-mine) formal concepts with extent support condition $f(r) = P[Z_D < r]$ 0.
Tabulate $f(r) = P[Z_D < r]$ 1 pairs: intent mass vs. extent support.
Reconstruct $f(r) = P[Z_D < r]$ 2 via two-pointer scan; compute integral for $f(r) = P[Z_D < r]$ 3 for bound interval $f(r) = P[Z_D < r]$ 4.
Complexity controlled by minimum support $f(r) = P[Z_D < r]$ 5; efficient for $f(r) = P[Z_D < r]$ 6–0.2 on massive datasets (Hanika et al., 2024).

BID via Binarized Physical Configurations

Binarize real-valued profiles, e.g., by sign relative to mean.
Form empirical pairwise distance distribution $f(r) = P[Z_D < r]$ 7.
Fit the ansatz $f(r) = P[Z_D < r]$ 8 via Kullback-Leibler divergence minimization in $f(r) = P[Z_D < r]$ 9.
Applies to massive bit-strings via sampling (Verdel et al., 2 May 2025).

Separability-Based Estimator

For finite samples, select center $r_1,r_2$ 0 (mean or extremal point).
Draw $r_1,r_2$ 1 random pairs across classes; compute indicator $r_1,r_2$ 2 for separability condition.
Estimate relative dimension via $r_1,r_2$ 3.
Computation is $r_1,r_2$ 4 and trivially parallel (Sutton et al., 2023).

3. Theoretical Properties and Guarantees

Monotonicity Under Dependency: For any $r_1,r_2$ 5, $r_1,r_2$ 6, i.e., positive dependency reduces observed dimension (Tatti et al., 2019).
Scale-Sensitivity and Hyperparameters: All methods feature intrinsic scale parameters: quantile thresholds $r_1,r_2$ 7, support $r_1,r_2$ 8, or radius $r_1,r_2$ 9. Practical guidelines are available: e.g., set support to $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 0–0.2 for feasible computation with close bound width (Hanika et al., 2024).
Asymptotic Normality: Connection-based $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 1 admits explicit CLTs and rate $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 2 for moderate $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 3 (Serra et al., 2017).
Unbiasedness and Variance Bounds: Under Poisson or uniform hypotheses, estimators such as I $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 4D are unbiased and admit Cramér–Rao bounds for variance (Macocco et al., 2022).
Explicit Scaling Laws: In physical, binarized growth dynamics, BID reveals dynamical scaling exponents that match those of continuous surface width (Family-Vicsek universality) (Verdel et al., 2 May 2025).

4. Empirical Performance and Comparative Analysis

Benchmarking and Real Data

BID recovers true dimension on synthetic lattices, fractals, and mixtures, often outperforming box-counting and continuous-space fractal estimators (Table 1 in (Macocco et al., 2022)).
On high-dimensional, sparse real-world binary tables (accidents, retail, text corpora), normalized correlation dimension is much less than ambient, aligning with known data complexities (Tatti et al., 2019).
On benchmark continuous and discrete datasets, adjacency-based BID matches or exceeds nearest-neighbor and correlation-integral methods, with better runtime scaling (Serra et al., 2017).
On nonequilibrium interface data, BID captures scaling collapse and dynamic exponents, matching continuous-variable order parameters (Verdel et al., 2 May 2025).

Comparison with PCA and Variance-Based Methods

For many binary datasets, normalized fractal dimension (ncd $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 5) gives estimates that explain a large fraction of PCA variance but can diverge in “hard cases,” indicating the distinct structural information BID encodes (Tatti et al., 2019).
In clustering analyses, mixtures of clusters exhibit higher BID than their parts—merging increases dimension, consistent with increased heterogeneity (Tatti et al., 2019).

Computational Benchmarks

Concept-based BID yields practical integral bounds within a factor of 2–3 of the true value for $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 6–0.2 on datasets with tens of thousands of dimensions and millions of samples (Hanika et al., 2024).
Connection-based BID operates in $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 7 on large proximity graphs—orders of magnitude faster than quadratic distance-matrix estimators (Serra et al., 2017).

5. Applications in Data Science, Physics, and Learning Theory

Binary Table Analysis: Market-basket, text, genomic, survey, and network data: normalized correlation dimension and concept-based BID quantify degrees of freedom beyond marginal sparsity (Tatti et al., 2019, Hanika et al., 2024).
Manifold Learning: Discrete-metric estimators (I $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 8D, adjacency-BID) sidestep biases of continuous-space fractal methods for binary/categorical datasets (Macocco et al., 2022).
Classifier Separability: Relative BID predicts few-shot classification performance and linear separability, providing tight theoretical bounds on error rates using intrinsic dimension formulas (Sutton et al., 2023).
Physical Systems: In nonequilibrium surface growth or statistical mechanics, BID tracks the emergence of spatial correlations, gives access to dynamical exponents, and retains information after severe binarization (Verdel et al., 2 May 2025).
Neural Networks: BID can be applied to hidden-layer activations or weight spaces to elucidate data compression and capacity transitions.

6. Practical Guidelines and Methodological Considerations

Choice of Scale/Threshold: Set minimum-support $\mathrm{cd}_A(D; \alpha_1,\alpha_2) = \frac{ \log f(r_2) - \log f(r_1) }{ \log r_2 - \log r_1 }$ 9 in concept mining to largest affordable value ensuring tight bounds; for correlation dimension, set $H$ 0, $H$ 1 for stability (Tatti et al., 2019, Hanika et al., 2024).
Computation and Resource Management: Use sampling for large datasets, sparse matrix acceleration for pairwise distances, and thresholded concept mining for high-cardinality contexts.
Interpretation Cautions: BID is always lower than or equal to ambient dimension, drops under dependencies, and may not align precisely with variance-based metrics; envelope bounds $H$ 2, $H$ 3 should be checked for tightness (Hanika et al., 2024).
Algorithm Availability: Open-source code for I $H$ 4D (Macocco et al., 2022), and bound computation algorithms (Hanika et al., 2024) are available.

7. Theoretical and Empirical Limitations, Extensions, and Open Questions

Dependence on Assumptions: Some estimators (e.g., I $H$ 5D, connection-BID) rely on uniformity or Poisson process assumptions for unbiasedness; heterogeneities may necessitate localized estimation (Macocco et al., 2022, Serra et al., 2017).
Estimator Choice: No single BID definition dominates; correlation, adjacency, and concept-based approaches target different structural aspects and may diverge especially in structured, real-world data (Hanika et al., 2024).
Interval Output: Formal-concept geometric BID outputs intervals $H$ 6 representing intrinsic dimension—the bound width is a practical diagnostic of sufficiency of concept mining (Hanika et al., 2024).
Extensions: BID has been extended to categorical, sequence, and spin system spaces and can incorporate variable binarizations and different (pseudo-)metrics (Verdel et al., 2 May 2025, Macocco et al., 2022).
Unified Frameworks: Recent works suggest, but do not yet fully provide, a unified formalism connecting fractal, adjacency, separability-based, and concept-based ID in binary/discrete settings (Tatti et al., 2019, Hanika et al., 2024, Sutton et al., 2023).

The concept of Binary Intrinsic Dimension provides a robust, scalable, and mathematically transparent toolkit for quantifying the effective dimension of high-dimensional discrete data, with rigorous theoretical underpinnings and practical scalability. It enables principled comparison between binary datasets, complements variance-based approaches, and underpins modern statistical and learning-theoretic analyses in discrete and binary domains (Tatti et al., 2019, Hanika et al., 2024, Serra et al., 2017, Macocco et al., 2022, Sutton et al., 2023, Verdel et al., 2 May 2025).

Markdown Report Issue Upgrade to Chat

References (6)

What is the dimension of your binary data? (2019)

What is the $\textit{intrinsic}$ dimension of your binary data? -- and how to compute it quickly (2024)

Dimension Estimation Using Random Connection Models (2017)

Relative intrinsic dimensionality is intrinsic to learning (2023)

Family-Vicsek universality of the binary intrinsic dimension of nonequilibrium data (2025)

Intrinsic dimension estimation for discrete metrics (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Binary Intrinsic Dimension (BID).

Binary Intrinsic Dimension (BID)

1. Mathematical Foundations and Definitions

2. Estimation Algorithms and Computational Aspects

Correlation Dimension Estimation

Connection-Graph (Adjacency-Based) Estimator

Formal Concept Lattice Approach

BID via Binarized Physical Configurations

Separability-Based Estimator

3. Theoretical Properties and Guarantees

4. Empirical Performance and Comparative Analysis

Benchmarking and Real Data

Comparison with PCA and Variance-Based Methods

Computational Benchmarks

5. Applications in Data Science, Physics, and Learning Theory

6. Practical Guidelines and Methodological Considerations

7. Theoretical and Empirical Limitations, Extensions, and Open Questions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Binary Intrinsic Dimension (BID)

1. Mathematical Foundations and Definitions

2. Estimation Algorithms and Computational Aspects

Correlation Dimension Estimation

Connection-Graph (Adjacency-Based) Estimator

Formal Concept Lattice Approach

BID via Binarized Physical Configurations

Separability-Based Estimator

3. Theoretical Properties and Guarantees

4. Empirical Performance and Comparative Analysis

Benchmarking and Real Data

Comparison with PCA and Variance-Based Methods

Computational Benchmarks

5. Applications in Data Science, Physics, and Learning Theory

6. Practical Guidelines and Methodological Considerations

7. Theoretical and Empirical Limitations, Extensions, and Open Questions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research