Groupby Neighbors Random Walk (GNRW)
- The paper introduces GNRW, a higher-order Markov chain sampling method that leverages transition history to stratify neighbor groups and reduce query counts.
- It preserves the stationary distribution of simple random walks while achieving lower asymptotic variance and faster convergence.
- Empirical results on social networks demonstrate a 30–50% reduction in estimation error compared to baseline methods, validating its efficiency improvements.
Groupby Neighbors Random Walk (GNRW) is a higher-order Markov chain-based sampling method designed to improve the efficiency of random walk-based analytics over large online social networks, where the available query primitives typically only expose node neighbor queries. Unlike the baseline simple random walk (SRW), which suffers from slow mixing and high estimator variance due to its memoryless selection of neighbors, GNRW leverages the walk’s transition history to induce systematic stratification over neighbor groups, thereby reducing the number of queries required to achieve a specified estimation accuracy. The method achieves this efficiency gain without altering the stationary distribution of the walk, providing a statistically valid “drop-in” alternative to SRW for network sampling tasks (Zhou et al., 2015).
1. Formal Construction and Transition Dynamics
Let represent an undirected graph. The GNRW defined on is a Markov chain whose current state at step is , with history-sensitive structures assigned to every directed edge .
Given a fixed grouping function and node , partition the neighborhood as , where each is disjoint and . These groups are typically defined by a measure attribute (e.g., degree, attribute values).
For each directed pair , maintain:
- : the set of neighbor groups from already chosen following prior transitions.
- For every , a set of neighbors within visited after (both structures are sampling "without replacement").
The transition at step —from , —selects by:
- Identifying remaining eligible groups .
- If , select with probability:
- Within , let . If , pick uniformly from ; otherwise, reset and sample uniformly from . Update and accordingly.
The resulting two-level stratification--first over groups, then within groups--defines the transition kernel:
where .
Both and each reset when exhausted.
2. Implementation: Pseudocode
The GNRW sampler is initialized with arbitrary and a grouping function :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
Input: x₀, x₁ (starting nodes); grouping function g; N (sample_size) Data structures: For each directed edge (u,v): S(u,v) ← ∅ # groups used so far For each Sᵢ∈g(N(v)): bᵢ(u,v) ← ∅ # neighbors chosen so far from Sᵢ for i = 2 … N do u ← x_{i-2}; v ← x_{i-1} {S₁,…,S_m} ← g(N(v)) CS ← { Sᵢ ∈ {S₁…S_m} : Sᵢ ∉ S(u,v) } if CS == ∅: S(u,v) ← ∅ CS ← {S₁,…,S_m} # pick group S* ∈ CS total_size ← sum(|Sᵢ| for Sᵢ in CS) choose S* in CS with probability |S*| / total_size # within S*, pick neighbor U ← S* \ bₖ(u,v) if U ≠ ∅: w ← Uniform(U) bₖ(u,v) ← bₖ(u,v) ∪ {w} S(u,v) ← S(u,v) ∪ {S*} else: w ← Uniform(S*) bₖ(u,v) ← ∅ # reset this group’s memory S(u,v) ← S(u,v) ∪ {S*} x_i ← w Output: {x₂,…,x_N} |
3. Stationarity, Asymptotic Variance, and Statistical Properties
GNRW provably preserves the stationary distribution of the SRW: . This is established via path block analysis: every return to a directed edge subdivides the walk into blocks. Under GNRW, all blocks associated with outgoing neighbors of are selected in a stratified "without replacement" fashion both across groups and within each group. Over the long run, every neighbor is visited with frequency $1/|N(v)|$, so the marginal frequency at any node matches SRW.
For any bounded , the empirical mean under GNRW achieves asymptotic variance , due to stratified block sampling [Neal 2004]. This guarantees that the estimator is at least as efficient as that from SRW and strictly improves for attributes aligned with the chosen grouping.
4. Computational and Query-Efficiency Analysis
GNRW issues one "get neighbors" API query per transition, identical to SRW and related methods such as NB-SRW and CNRW. The additional bookkeeping—principally, maintaining S(u,v) and the per directed edge —can be implemented using two hash-maps with amortized access and update, giving total space and time overhead after steps.
Crucially, since GNRW reduces both burn-in and estimator variance, the number of transitions (i.e., queries) required for a target estimation bias is strictly lower than or equal to SRW. Both “burn-in” to stationary and sample count for the desired estimator confidence are reduced in practice by GNRW-induced stratification (Zhou et al., 2015).
5. Empirical Performance and Results
Experimental comparisons on real and synthetic datasets demonstrate the empirical gains of GNRW over SRW, NB-SRW, and CNRW. Relative estimation errors after a fixed number of queries are consistently smallest for GNRW:
| Dataset / Task | SRW | NB‐SRW | CNRW | GNRW |
|---|---|---|---|---|
| Google Plus (avg. degree @ 500 q.) | 0.085 | 0.080 | 0.058 | 0.048 |
| Yelp (avg. degree @ 300 q.) | 0.33 | 0.29 | 0.26 | 0.20 |
| Yelp (avg. review count @ 300 q.) | 0.38 | 0.34 | 0.31 | 0.23 |
| Facebook subgraph (KL div. @ 400 q.) | 0.18 | 0.16 | 0.11 | 0.08 |
For synthetic barbell graphs, estimation error reductions of 30–50% are observed for GNRW versus SRW and NB‐SRW. In all tests:
- GNRW < CNRW ≤ NB‐SRW < SRW <<< MHRW (MH Random Walk).
- The maximum improvement is achieved when the grouping attribute matches the estimation target attribute (e.g., degree for average degree estimation).
6. Impact of Grouping Function and Practical Considerations
The grouping function critically determines stratification depth and estimator efficiency. When (i.e., each group is a singleton), GNRW reduces to CNRW. Grouping by relevant node attribute—such as degree (“GNRW‐by‐Degree”) for degree-based analytics or review count for review-based estimation—substantially improves mixing across targeted features and thus accelerates convergence. Randomized groupings (e.g., via hashing) offer limited improvements.
Group counts () that are too large (over-stratification) lead to unnecessary overhead; too small and stratification loses benefits. Empirically, values work well for social networks with typical node degrees in the range $50$–$200$.
A plausible implication is that for practical deployment, aligning groupings with the primary statistics of interest yields the greatest sampling efficiency (Zhou et al., 2015).