Model selection and clustering in stochastic block models with the exact integrated complete data likelihood

Published 12 Mar 2013 in stat.ME | (1303.2962v2)

Abstract: The stochastic block model (SBM) is a mixture model used for the clustering of nodes in networks. It has now been employed for more than a decade to analyze very different types of networks in many scientific fields such as Biology and social sciences. Because of conditional dependency, there is no analytical expression for the posterior distribution over the latent variables, given the data and model parameters. Therefore, approximation strategies, based on variational techniques or sampling, have been proposed for clustering. Moreover, two SBM model selection criteria exist for the estimation of the number K of clusters in networks but, again, both of them rely on some approximations. In this paper, we show how an analytical expression can be derived for the integrated complete data log likelihood. We then propose an inference algorithm to maximize this exact quantity. This strategy enables the clustering of nodes as well as the estimation of the number clusters to be performed at the same time and no model selection criterion has to be computed for various values of K. The algorithm we propose has a better computational cost than existing inference techniques for SBM and can be employed to analyze large networks with ten thousand nodes. Using toy and true data sets, we compare our work with other approaches.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a new inference strategy that maximizes an exact integrated complete data likelihood (ICLₑₓ) to jointly determine node clusters and the optimal number of clusters.
A scalable greedy algorithm is developed, which iteratively reassigns nodes and merges clusters without relying on asymptotic or variational approximations.
Empirical results demonstrate significant improvements in clustering accuracy and model selection, outperforming traditional techniques in large, complex networks.

Model Selection and Clustering in Stochastic Block Models with the Exact Integrated Complete Data Likelihood

Overview of the Problem and Contributions

This paper addresses the inference and model selection problem for the classical stochastic block model (SBM) applied to network clustering. SBMs, widely used for analyzing network data in various fields, posit that a graph's nodes are divided into $K$ clusters, with connection probabilities between nodes determined by the cluster assignments. While extensions allow for mixed and overlapping memberships, the focus here is on the standard SBM, where each node belongs to exactly one cluster and edge probabilities are governed by a $K \times K$ connectivity matrix.

Classic inference procedures for SBMs fall short in two key aspects: scalability and robustness of cluster number estimation. Existing algorithms such as variational EM require repeated runs for different cluster numbers, relying on asymptotic approximations (e.g., ICL with Laplace/Stirling) for model selection, or on slow-mixing Markov chain Monte Carlo methods such as collapsed Gibbs samplers, which tend to overestimate $K$ in large, sparse networks due to poor mixing and low-acceptance moves.

The main contributions of this paper are:

Introduction of a new inference strategy that eschews asymptotic approximations, based on maximizing an exact analytical expression of the marginal complete data log-likelihood, denoted $ICL_{ex}$ .
Development of a scalable, greedy optimization procedure that directly maximizes $ICL_{ex}$ with respect to both cluster assignments ( $Z$ ) and the number of clusters ( $K$ ) simultaneously, starting from an oversegmented solution and iteratively merging or eliminating clusters as warranted.
Demonstration, via extensive synthetic and real-world experiments, that this approach outperforms prior techniques for both clustering accuracy and model selection, especially in large and complex networks.

Exact Integrated Complete Data Likelihood Derivation

The $ICL_{ex}$ criterion builds on the full Bayesian marginalization of the SBM parameters, using non-informative conjugate priors for both the cluster proportions (Dirichlet) and the block-model connectivity probabilities (Beta). This allows analytic integration over the nuisance parameters $\alpha$ (mixing proportions) and $\Pi$ (connectivity matrix). The resulting marginal likelihood is given by:

$ICL_{ex}(Z,K) = \log p(X,Z|K) = \sum_{k,l}^{K} \log\left( \frac{\Gamma(\eta_{kl}^{0}+\zeta_{kl}^{0})\Gamma(\eta_{kl})\Gamma(\zeta_{kl})}{\Gamma(\eta_{kl}+\zeta_{kl})\Gamma(\eta_{kl}^{0})\Gamma(\zeta_{kl}^{0})}\right) + \log\left(\frac{\Gamma(\sum_{k=1}^{K}n_{k}^{0})\prod_{k=1}^{K}\Gamma(n_{k})}{\Gamma(\sum_{k=1}^{K}n_{k})\prod_{k=1}^{K}\Gamma(n_{k}^{0})}\right).$

Here, $n_k$ involves the assignment counts to cluster $k$ , and $(\eta_{kl}, \zeta_{kl})$ serve as pseudo-counts for edges/non-edges between cluster $k$ and $l$ .

Unlike prior approximations (e.g., [Daudin et al., 2008]), $ICL_{ex}$ introduces no asymptotic or variational approximations. The penalization for model complexity is inherently structured by the use of the Gamma function, providing a decisive advantage in model selection without tuning penalty terms.

Greedy Optimization Algorithm

Direct maximization of $ICL_{ex}$ with respect to the combinatorial node assignment matrix $Z$ and $K$ is computationally infeasible for all but the smallest graphs. The authors propose a scalable greedy iterative procedure:

Initialization: Start with an upper-bound on clusters $K_{up}$ , assigning nodes via random or k-means initialization.
Iterative Update: For each node, evaluate the change in $ICL_{ex}$ induced by transferring the node to any other cluster, efficiently updating the objective using local computations leveraging the conjugate structure. Clusters with zero membership are removed, decreasing $K$ dynamically.
Convergence: Repeat until a pass over all nodes yields no further $ICL_{ex}$ improvement.
Hierarchical Postprocessing: Optionally, attempt cluster merges as a final refinement.

The computational complexity is dominated by per-node updates ( $\mathcal{O}(l + K^2)$ per node, where $l$ is the average degree), yielding total cost $\mathcal{O}(NK_{up}^{2} + L)$ , significantly outperforming variational or sampling-based procedures, especially since $K$ is adaptively reduced during the run.

Experimental Evaluation

The paper conducts extensive synthetic and real-data experiments, focusing on clustering quality and accuracy in estimating $K$ . Key findings include:

Synthetic Networks: Across settings varying in node count, cluster structure (including non-community settings with hubs and block structure), and noise, greedy $ICL_{ex}$ matches or outperforms collapsed Gibbs sampling, variational EM, and spectral methods. Notably, in large and complex graphs ( $N=10^4$ nodes, $K=50$ planted clusters), $ICL_{ex}$ achieves normalized mutual information (NMI) of 0.88 versus 0.67 for the closest competitor, demonstrating resilience to overfitting and undersegmentation.
Model Selection: The algorithm robustly infers $K$ , unlike MCMC approaches which oversegment under poor mixing (high $K$ ), and variational methods which often collapse distinct structures due to reliance on lower-bound maximization and asymptotic criteria.
Scalability: The method is empirically validated on networks with tens of thousands of nodes and millions of edges without convergence or memory issues.

Real Network Analysis

On a hyperlink network of 1,360 blogs, the greedy $ICL_{ex}$ algorithm discovers 37 fine-grained clusters, including not only classical communities but also small subgroups and hub clusters corresponding to real-world entities, such as prominent illustrators and school cohorts. This reveals more nuanced structures than community-detection methods based on modularity, which suffer from the well-known resolution limit and produce only 8 broad groups on the same data.

Practical and Theoretical Implications

This work sets a new standard for unsupervised network clustering in SBMs, by:

Demonstrating the benefits of using exact marginal likelihoods for both node assignment and cluster number selection, thereby avoiding the pitfalls of asymptotic and variational approximations.
Providing a strategy that admits scaling to large networks, a crucial requirement as network datasets continue to grow in size and complexity.
Enabling the automatic and integrated selection of $K$ , which is particularly valuable in exploratory analysis where domain knowledge may not be sufficient to constrain the model class a priori.

Looking ahead, the exact $ICL_{ex}$ criterion and corresponding greedy inference could be readily generalized to richer SBM variants (valued, overlapping, or degree-corrected models) by adapting the conjugate prior framework. This opens pathways for robust, scalable community detection algorithms applicable to diverse and complex real-world networks.

Conclusion

The paper presents a theoretically principled and computationally efficient solution to the joint clustering and model selection problem in SBM-based network analysis, by leveraging exact Bayesian integration and local greedy optimization. Empirical evidence confirms superiority or parity with existing approaches across multiple settings and justifies the broader adoption of the $ICL_{ex}$ -based paradigm for large-scale network clustering and blockmodel selection.