Groupby Neighbors Random Walk (GNRW)

Updated 23 January 2026

The paper introduces GNRW, a higher-order Markov chain sampling method that leverages transition history to stratify neighbor groups and reduce query counts.
It preserves the stationary distribution of simple random walks while achieving lower asymptotic variance and faster convergence.
Empirical results on social networks demonstrate a 30–50% reduction in estimation error compared to baseline methods, validating its efficiency improvements.

Groupby Neighbors Random Walk (GNRW) is a higher-order Markov chain-based sampling method designed to improve the efficiency of random walk-based analytics over large online social networks, where the available query primitives typically only expose node neighbor queries. Unlike the baseline simple random walk (SRW), which suffers from slow mixing and high estimator variance due to its memoryless selection of neighbors, GNRW leverages the walk’s transition history to induce systematic stratification over neighbor groups, thereby reducing the number of queries required to achieve a specified estimation accuracy. The method achieves this efficiency gain without altering the stationary distribution of the walk, providing a statistically valid “drop-in” alternative to SRW for network sampling tasks (Zhou et al., 2015).

1. Formal Construction and Transition Dynamics

Let $G=(V,E)$ represent an undirected graph. The GNRW defined on $G$ is a Markov chain whose current state at step $n$ is $X_n \in V$ , with history-sensitive structures assigned to every directed edge $(u \to v)$ .

Given a fixed grouping function $g$ and node $v \in V$ , partition the neighborhood $N(v)$ as $g(N(v)) = \{S_1, S_2, ..., S_m\}$ , where each $S_i$ is disjoint and $\bigcup_i S_i = N(v)$ . These groups are typically defined by a measure attribute (e.g., degree, attribute values).

For each directed pair $(u,v)$ , maintain:

$S(u,v)$ : the set of neighbor groups from $g(N(v))$ already chosen following prior $u \to v$ transitions.
For every $S_i$ , a set $b_{S_i}(u,v)$ of neighbors within $S_i$ visited after $u \to v$ (both structures are sampling "without replacement").

The transition at step $n$ —from $u = X_{n-2}$ , $v = X_{n-1}$ —selects $w \in N(v)$ by:

Identifying remaining eligible groups $CS = \{ S_i \notin S(u,v) \}$ .
If $CS \neq \emptyset$ , select $S^* \in CS$ with probability:

$\mathbb{P}\{S^* = S_i \mid u \to v\} = \frac{|S_i|}{|N(v)| - \sum_{S_j \in S(u,v)}|S_j|}$

Within $S^*$ , let $U = S^* \setminus b_{S^*}(u,v)$ . If $U \neq \emptyset$ , pick $w$ uniformly from $U$ ; otherwise, reset $b_{S^*}(u,v) = \emptyset$ and sample $w$ uniformly from $S^*$ . Update $b_{S^*}(u,v)$ and $S(u,v)$ accordingly.

The resulting two-level stratification--first over groups, then within groups--defines the transition kernel:

$P[X_n = w | X_{n-2}=u, X_{n-1}=v] = \frac{|S_i|}{|N(v)| - \sum_{S_j \in S(u,v)} |S_j|} \times \frac{1}{|S_i| - |b_{S_i}(u,v)|}$

where $w \in S_i$ .

Both $S(u,v)$ and each $b_{S_i}(u,v)$ reset when exhausted.

2. Implementation: Pseudocode

The GNRW sampler is initialized with arbitrary $x_0, x_1$ and a grouping function $g$ :

Input: x₀, x₁ (starting nodes); grouping function g; N (sample_size)

Data structures:
    For each directed edge (u,v):
        S(u,v) ← ∅         # groups used so far
        For each Sᵢ∈g(N(v)):
            bᵢ(u,v) ← ∅    # neighbors chosen so far from Sᵢ

for i = 2 … N do
    u ← x_{i-2};  v ← x_{i-1}
    {S₁,…,S_m} ← g(N(v))
    CS ← { Sᵢ ∈ {S₁…S_m} : Sᵢ ∉ S(u,v) }
    if CS == ∅:
        S(u,v) ← ∅
        CS ← {S₁,…,S_m}
    # pick group S* ∈ CS
    total_size ← sum(|Sᵢ| for Sᵢ in CS)
    choose S* in CS with probability |S*| / total_size
    # within S*, pick neighbor
    U ← S* \ bₖ(u,v)
    if U ≠ ∅:
        w ← Uniform(U)
        bₖ(u,v) ← bₖ(u,v) ∪ {w}
        S(u,v)  ← S(u,v) ∪ {S*}
    else:
        w ← Uniform(S*)
        bₖ(u,v) ← ∅        # reset this group’s memory
        S(u,v)  ← S(u,v) ∪ {S*}
    x_i ← w
Output: {x₂,…,x_N}

3. Stationarity, Asymptotic Variance, and Statistical Properties

GNRW provably preserves the stationary distribution of the SRW: $\pi(v) = \deg(v) / (2|E|)$ . This is established via path block analysis: every return to a directed edge $u\to v$ subdivides the walk into blocks. Under GNRW, all blocks associated with outgoing neighbors of $v$ are selected in a stratified "without replacement" fashion both across groups and within each group. Over the long run, every neighbor $j \in N(v)$ is visited with frequency $1/|N(v)|$, so the marginal frequency at any node $v$ matches SRW.

For any bounded $f:V \to \mathbb{R}$ , the empirical mean $\hat{\mu}_n = (1/n) \sum_{t=1}^n f(X_t)$ under GNRW achieves asymptotic variance $V_\infty^{GNRW}(\hat{\mu}) \leq V_\infty^{SRW}(\hat{\mu})$ , due to stratified block sampling [Neal 2004]. This guarantees that the estimator is at least as efficient as that from SRW and strictly improves for attributes aligned with the chosen grouping.

4. Computational and Query-Efficiency Analysis

GNRW issues one "get neighbors" API query per transition, identical to SRW and related methods such as NB-SRW and CNRW. The additional bookkeeping—principally, maintaining S(u,v) and the $b_{S_i}(u,v)$ per directed edge $(u,v)$ —can be implemented using two hash-maps with amortized $O(1)$ access and update, giving total space and time overhead $O(K)$ after $K$ steps.

Crucially, since GNRW reduces both burn-in and estimator variance, the number of transitions (i.e., queries) required for a target estimation bias $\epsilon$ is strictly lower than or equal to SRW. Both “burn-in” to stationary and sample count for the desired estimator confidence are reduced in practice by GNRW-induced stratification (Zhou et al., 2015).

5. Empirical Performance and Results

Experimental comparisons on real and synthetic datasets demonstrate the empirical gains of GNRW over SRW, NB-SRW, and CNRW. Relative estimation errors after a fixed number of queries are consistently smallest for GNRW:

Dataset / Task	SRW	NB‐SRW	CNRW	GNRW
Google Plus (avg. degree @ 500 q.)	0.085	0.080	0.058	0.048
Yelp (avg. degree @ 300 q.)	0.33	0.29	0.26	0.20
Yelp (avg. review count @ 300 q.)	0.38	0.34	0.31	0.23
Facebook subgraph (KL div. @ 400 q.)	0.18	0.16	0.11	0.08

For synthetic barbell graphs, estimation error reductions of 30–50% are observed for GNRW versus SRW and NB‐SRW. In all tests:

GNRW < CNRW ≤ NB‐SRW < SRW <<< MHRW (MH Random Walk).
The maximum improvement is achieved when the grouping attribute matches the estimation target attribute (e.g., degree for average degree estimation).

6. Impact of Grouping Function and Practical Considerations

The grouping function $g$ critically determines stratification depth and estimator efficiency. When $m = |N(v)|$ (i.e., each group is a singleton), GNRW reduces to CNRW. Grouping by relevant node attribute—such as degree (“GNRW‐by‐Degree”) for degree-based analytics or review count for review-based estimation—substantially improves mixing across targeted features and thus accelerates convergence. Randomized groupings (e.g., via hashing) offer limited improvements.

Group counts ( $m$ ) that are too large (over-stratification) lead to unnecessary overhead; too small and stratification loses benefits. Empirically, values $m \in [5, 20]$ work well for social networks with typical node degrees in the range $50$–$200$.

A plausible implication is that for practical deployment, aligning groupings with the primary statistics of interest yields the greatest sampling efficiency (Zhou et al., 2015).

Markdown Report Issue Upgrade to Chat

References (1)

Leveraging History for Faster Sampling of Online Social Networks (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Groupby Neighbors Random Walk (GNRW).

Groupby Neighbors Random Walk (GNRW)

1. Formal Construction and Transition Dynamics

2. Implementation: Pseudocode

3. Stationarity, Asymptotic Variance, and Statistical Properties

4. Computational and Query-Efficiency Analysis

5. Empirical Performance and Results

6. Impact of Grouping Function and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Groupby Neighbors Random Walk (GNRW)

1. Formal Construction and Transition Dynamics

2. Implementation: Pseudocode

3. Stationarity, Asymptotic Variance, and Statistical Properties

4. Computational and Query-Efficiency Analysis

5. Empirical Performance and Results

6. Impact of Grouping Function and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research