Papers
Topics
Authors
Recent
Search
2000 character limit reached

What is the dimension of your binary data?

Published 4 Feb 2019 in cs.LG and stat.ML | (1902.01480v1)

Abstract: Many 0/1 datasets have a very large number of variables; on the other hand, they are sparse and the dependency structure of the variables is simpler than the number of variables would suggest. Defining the effective dimensionality of such a dataset is a nontrivial problem. We consider the problem of defining a robust measure of dimension for 0/1 datasets, and show that the basic idea of fractal dimension can be adapted for binary data. However, as such the fractal dimension is difficult to interpret. Hence we introduce the concept of normalized fractal dimension. For a dataset $D$, its normalized fractal dimension is the number of columns in a dataset $D'$ with independent columns and having the same (unnormalized) fractal dimension as $D$. The normalized fractal dimension measures the degree of dependency structure of the data. We study the properties of the normalized fractal dimension and discuss its computation. We give empirical results on the normalized fractal dimension, comparing it against baseline measures such as PCA. We also study the relationship of the dimension of the whole dataset and the dimensions of subgroups formed by clustering. The results indicate interesting differences between and within datasets.

Citations (53)

Summary

  • The paper introduces a normalized correlation dimension that adapts fractal concepts to measure the effective dimensionality of sparse binary datasets.
  • It details efficient computation methods using direct calculation, sparse optimizations, and sampling to estimate pairwise L1 distances.
  • The approach enables comparing dataset complexity and serves as a complementary tool to PCA by revealing intrinsic structural dependencies.

This paper (1902.01480) addresses the challenge of defining a meaningful "effective dimension" for binary datasets, which are often high-dimensional but sparse and structured. Traditional dimensionality reduction methods like PCA and SVD are designed for real-valued data and are not directly suitable. The authors propose adapting concepts from fractal dimension, specifically the correlation dimension, for binary data and introduce a normalized correlation dimension to make the measure more interpretable.

The core idea is to analyze the distribution of pairwise distances between points in the binary dataset. For a dataset DD with KK binary variables, the L1L_1 distance (Manhattan distance) is used between two points x,yDx, y \in D. The random variable ZDZ_D represents the L1L_1 distance between two randomly chosen points from DD. The correlation dimension is based on the probability P(ZD<r)P(Z_D < r), which is the fraction of point pairs with an L1L_1 distance less than rr.

The authors define the correlation dimension, denoted KK0, as the slope of a line fitted to the log-log plot of KK1 for various radii KK2. To handle the discrete nature of binary distances, they linearly interpolate KK3 between integer values of KK4 to create a continuous function KK5. The dimension KK6 is the slope of the least-squares linear fit to the points KK7 for KK8 in a range KK9. A related definition, L1L_10, uses radii L1L_11 such that L1L_12 and L1L_13, effectively focusing on quantiles of the distance distribution. The paper primarily uses L1L_14, based on the distances between the first and third quartiles of the pairwise distance distribution.

Implementation of Correlation Dimension:

To compute L1L_15, you need to calculate L1L_16 for various integer values of L1L_17 from L1L_18 to L1L_19. This involves calculating the x,yDx, y \in D0 distance between all pairs of points in x,yDx, y \in D1. The x,yDx, y \in D2 distance between two binary vectors x,yDx, y \in D3 is the number of positions where they differ: x,yDx, y \in D4. Since x,yDx, y \in D5, x,yDx, y \in D6 is 1 if x,yDx, y \in D7 and 0 if x,yDx, y \in D8. The number of pairs with distance x,yDx, y \in D9 is ZDZ_D0, where ZDZ_D1 is the indicator function. ZDZ_D2 is this count divided by ZDZ_D3.

  • Direct Computation: Calculating all pairwise ZDZ_D4 distances takes ZDZ_D5 time. For sparse binary data, the ZDZ_D6 distance between vectors ZDZ_D7 and ZDZ_D8 is ZDZ_D9, where L1L_10 is element-wise multiplication (AND). Summing these up takes L1L_11 or L1L_12 time where L1L_13 is the average number of 1s per vector. A sparse matrix representation improves pairwise distance calculation. The total number of 1s in L1L_14 is L1L_15. Calculating all pairwise distances is L1L_16, where L1L_17 is the number of 1s in L1L_18. The paper gives an L1L_19 bound for calculating all pairwise distances in total across all pairs, which seems more efficient if you sum over distances per point pair: DD0 which is DD1. This suggests computing distances efficiently.
  • Approximation via Sampling: For very large datasets (large DD2), direct computation is too slow. The paper proposes estimating DD3 using a random subset DD4. Two estimation methods are given:

    1. Pick DD5, DD6: DD7.
    2. Pick DD8, DD9: P(ZD<r)P(Z_D < r)0. The experiments used the first method with P(ZD<r)P(Z_D < r)1 points. The time complexity using sampling is roughly P(ZD<r)P(Z_D < r)2 if using the first method and computing distances efficiently, or P(ZD<r)P(Z_D < r)3 roughly. The experimental section Table 3 suggests the time is proportional to the total number of 1s (P(ZD<r)P(Z_D < r)4) when sampling P(ZD<r)P(Z_D < r)5, with a factor related to P(ZD<r)P(Z_D < r)6. This indicates an efficient sparse distance calculation is crucial.

Once P(ZD<r)P(Z_D < r)7 is estimated for relevant integer P(ZD<r)P(Z_D < r)8, you find P(ZD<r)P(Z_D < r)9 such that L1L_10 and L1L_11. Then compute the slope of L1L_12 vs L1L_13 for L1L_14 using linear regression. The paper uses L1L_15 points within L1L_16 for the linear fit L1L_17.

Normalization: Normalized Correlation Dimension (NCD):

The raw correlation dimension (L1L_18) can be small and hard to interpret. The NCD aims to provide a more intuitive scale. The NCD of dataset L1L_19, rr0, is defined as the number of columns (rr1) a synthetic dataset rr2 with rr3 independent binary variables (each '1' with probability rr4) would need to have the same correlation dimension as rr5, i.e., rr6. The marginal probability rr7 is chosen such that rr8, where rr9 is a dataset with KK00 independent columns having the same marginal probabilities as KK01.

  • Implementation of NCD:

1. Calculate KK02 using the method described above. 2. Create a synthetic dataset KK03 by randomizing each column of KK04 independently (or generating data with the same marginal probabilities). 3. Calculate KK05. This involves estimating KK06 by generating random pairs from the independent distribution implied by KK07's marginals. 4. Find a probability KK08 such that KK09 using binary search. KK10 can be approximated theoretically (Proposition 1) or estimated by generating synthetic data. 5. Find an integer KK11 such that KK12 using binary search. This KK13 is the normalized correlation dimension KK14.

  • Approximation for NCD: Proposition 2 offers a direct approximation: KK15. This avoids the binary search for KK16 and KK17. The empirical results suggest this approximation works well for synthetic data but can be less accurate for sparse real-world data.

Practical Applications and Interpretation:

  • Complexity Measure: NCD provides a single number describing the "effective complexity" or intrinsic dimensionality of a binary dataset. A high NCD relative to the number of variables KK18 suggests less structure/more independence, while a low NCD suggests strong dependencies or sparsity patterns reducing the effective degrees of freedom.

  • Dataset Comparison: NCD allows comparing the structural complexity of different binary datasets, even if they have different numbers of variables or sparsity. For instance, the paper shows NCD varies significantly across real datasets (Table 2). Retail (K=16k) has ncd ~1.8k (11% of K), while Accidents (K=469) has ncd ~220 (47% of K), indicating Retail has more structure per variable than Accidents.
  • Alternative to PCA for Binary Data: The paper compares NCD to PCA (number of components for 90% variance). They correlate positively, but there are differences. For 'Paleo' data, PCA suggests higher dimension than NCD and average correlation, indicating PCA might overestimate complexity for some binary data structures, especially those with homogeneous margins (which NCD might handle better). This suggests NCD can be a complementary or better measure for certain binary datasets.
  • Analyzing Subgroups: Studying the NCD of clusters or subgroups can reveal how dimensionality changes locally. The experiments show clusters can have higher dimensions than the combined dataset, implying the structure reducing the overall dimension might be due to the relationships between clusters.

Implementation Considerations:

  • Sparsity: Binary data is often sparse. Efficiently computing KK19 distances and sums of distances is crucial. Use sparse matrix libraries (e.g., SciPy in Python) and optimize distance calculations for binary data (popcount, bitwise operations if applicable).
  • Sampling: For datasets with millions of rows, sampling is essential to make the computation feasible. The choice of sample size KK20 affects accuracy and runtime.
  • Linear Regression: Fitting the line to KK21 is a standard linear regression task. Need to select appropriate KK22 values or KK23 quantiles. The paper used KK24 and KK25 points.
  • Binary Search (for NCD without approximation): Implementing the binary search for KK26 and KK27 requires an efficient way to calculate KK28 or its approximation. Proposition 1 provides an analytical form based on the normal approximation of sums of Bernoulli variables, which can be used.
  • Computational Resources: Calculating pairwise distances (even with sampling) can be memory and CPU intensive. Distributed computing frameworks could be beneficial for large datasets.

The paper highlights that this method, unlike PCA or SVD, does not provide a low-dimensional embedding. Its purpose is to measure intrinsic dimension, not to map data to a lower space for visualization or feature reduction. However, knowing the intrinsic dimension can be useful for model selection, algorithm choice (e.g., which indexing structures or distance metrics might work well), or simply understanding the underlying structure of the data.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.