- The paper introduces a normalized correlation dimension that adapts fractal concepts to measure the effective dimensionality of sparse binary datasets.
- It details efficient computation methods using direct calculation, sparse optimizations, and sampling to estimate pairwise L1 distances.
- The approach enables comparing dataset complexity and serves as a complementary tool to PCA by revealing intrinsic structural dependencies.
This paper (1902.01480) addresses the challenge of defining a meaningful "effective dimension" for binary datasets, which are often high-dimensional but sparse and structured. Traditional dimensionality reduction methods like PCA and SVD are designed for real-valued data and are not directly suitable. The authors propose adapting concepts from fractal dimension, specifically the correlation dimension, for binary data and introduce a normalized correlation dimension to make the measure more interpretable.
The core idea is to analyze the distribution of pairwise distances between points in the binary dataset. For a dataset D with K binary variables, the L1 distance (Manhattan distance) is used between two points x,y∈D. The random variable ZD represents the L1 distance between two randomly chosen points from D. The correlation dimension is based on the probability P(ZD<r), which is the fraction of point pairs with an L1 distance less than r.
The authors define the correlation dimension, denoted K0, as the slope of a line fitted to the log-log plot of K1 for various radii K2. To handle the discrete nature of binary distances, they linearly interpolate K3 between integer values of K4 to create a continuous function K5. The dimension K6 is the slope of the least-squares linear fit to the points K7 for K8 in a range K9. A related definition, L10, uses radii L11 such that L12 and L13, effectively focusing on quantiles of the distance distribution. The paper primarily uses L14, based on the distances between the first and third quartiles of the pairwise distance distribution.
Implementation of Correlation Dimension:
To compute L15, you need to calculate L16 for various integer values of L17 from L18 to L19. This involves calculating the x,y∈D0 distance between all pairs of points in x,y∈D1.
The x,y∈D2 distance between two binary vectors x,y∈D3 is the number of positions where they differ: x,y∈D4. Since x,y∈D5, x,y∈D6 is 1 if x,y∈D7 and 0 if x,y∈D8.
The number of pairs with distance x,y∈D9 is ZD0, where ZD1 is the indicator function. ZD2 is this count divided by ZD3.
- Direct Computation: Calculating all pairwise ZD4 distances takes ZD5 time. For sparse binary data, the ZD6 distance between vectors ZD7 and ZD8 is ZD9, where L10 is element-wise multiplication (AND). Summing these up takes L11 or L12 time where L13 is the average number of 1s per vector. A sparse matrix representation improves pairwise distance calculation. The total number of 1s in L14 is L15. Calculating all pairwise distances is L16, where L17 is the number of 1s in L18. The paper gives an L19 bound for calculating all pairwise distances in total across all pairs, which seems more efficient if you sum over distances per point pair: D0 which is D1. This suggests computing distances efficiently.
- Approximation via Sampling: For very large datasets (large D2), direct computation is too slow. The paper proposes estimating D3 using a random subset D4. Two estimation methods are given:
- Pick D5, D6: D7.
- Pick D8, D9: P(ZD<r)0.
The experiments used the first method with P(ZD<r)1 points. The time complexity using sampling is roughly P(ZD<r)2 if using the first method and computing distances efficiently, or P(ZD<r)3 roughly. The experimental section Table 3 suggests the time is proportional to the total number of 1s (P(ZD<r)4) when sampling P(ZD<r)5, with a factor related to P(ZD<r)6. This indicates an efficient sparse distance calculation is crucial.
Once P(ZD<r)7 is estimated for relevant integer P(ZD<r)8, you find P(ZD<r)9 such that L10 and L11. Then compute the slope of L12 vs L13 for L14 using linear regression. The paper uses L15 points within L16 for the linear fit L17.
Normalization: Normalized Correlation Dimension (NCD):
The raw correlation dimension (L18) can be small and hard to interpret. The NCD aims to provide a more intuitive scale. The NCD of dataset L19, r0, is defined as the number of columns (r1) a synthetic dataset r2 with r3 independent binary variables (each '1' with probability r4) would need to have the same correlation dimension as r5, i.e., r6. The marginal probability r7 is chosen such that r8, where r9 is a dataset with K00 independent columns having the same marginal probabilities as K01.
1. Calculate K02 using the method described above.
2. Create a synthetic dataset K03 by randomizing each column of K04 independently (or generating data with the same marginal probabilities).
3. Calculate K05. This involves estimating K06 by generating random pairs from the independent distribution implied by K07's marginals.
4. Find a probability K08 such that K09 using binary search. K10 can be approximated theoretically (Proposition 1) or estimated by generating synthetic data.
5. Find an integer K11 such that K12 using binary search. This K13 is the normalized correlation dimension K14.
- Approximation for NCD: Proposition 2 offers a direct approximation: K15. This avoids the binary search for K16 and K17. The empirical results suggest this approximation works well for synthetic data but can be less accurate for sparse real-world data.
Practical Applications and Interpretation:
Complexity Measure: NCD provides a single number describing the "effective complexity" or intrinsic dimensionality of a binary dataset. A high NCD relative to the number of variables K18 suggests less structure/more independence, while a low NCD suggests strong dependencies or sparsity patterns reducing the effective degrees of freedom.
- Dataset Comparison: NCD allows comparing the structural complexity of different binary datasets, even if they have different numbers of variables or sparsity. For instance, the paper shows NCD varies significantly across real datasets (Table 2). Retail (K=16k) has ncd ~1.8k (11% of K), while Accidents (K=469) has ncd ~220 (47% of K), indicating Retail has more structure per variable than Accidents.
- Alternative to PCA for Binary Data: The paper compares NCD to PCA (number of components for 90% variance). They correlate positively, but there are differences. For 'Paleo' data, PCA suggests higher dimension than NCD and average correlation, indicating PCA might overestimate complexity for some binary data structures, especially those with homogeneous margins (which NCD might handle better). This suggests NCD can be a complementary or better measure for certain binary datasets.
- Analyzing Subgroups: Studying the NCD of clusters or subgroups can reveal how dimensionality changes locally. The experiments show clusters can have higher dimensions than the combined dataset, implying the structure reducing the overall dimension might be due to the relationships between clusters.
Implementation Considerations:
- Sparsity: Binary data is often sparse. Efficiently computing K19 distances and sums of distances is crucial. Use sparse matrix libraries (e.g., SciPy in Python) and optimize distance calculations for binary data (popcount, bitwise operations if applicable).
- Sampling: For datasets with millions of rows, sampling is essential to make the computation feasible. The choice of sample size K20 affects accuracy and runtime.
- Linear Regression: Fitting the line to K21 is a standard linear regression task. Need to select appropriate K22 values or K23 quantiles. The paper used K24 and K25 points.
- Binary Search (for NCD without approximation): Implementing the binary search for K26 and K27 requires an efficient way to calculate K28 or its approximation. Proposition 1 provides an analytical form based on the normal approximation of sums of Bernoulli variables, which can be used.
- Computational Resources: Calculating pairwise distances (even with sampling) can be memory and CPU intensive. Distributed computing frameworks could be beneficial for large datasets.
The paper highlights that this method, unlike PCA or SVD, does not provide a low-dimensional embedding. Its purpose is to measure intrinsic dimension, not to map data to a lower space for visualization or feature reduction. However, knowing the intrinsic dimension can be useful for model selection, algorithm choice (e.g., which indexing structures or distance metrics might work well), or simply understanding the underlying structure of the data.