BREC Benchmark: Realized GNN Expressiveness

Updated 3 February 2026

BREC Benchmark is a standardized evaluation suite that rigorously tests the practical expressiveness of GNNs by assessing their ability to distinguish non-isomorphic graph pairs beyond 1-WL limits.
It categorizes graph pairs into Basic, Regular, Extension, and CFI groups to provide granular insights into model performance across varying graph-theoretic challenges.
The protocol employs pair-distinguishing accuracy with advanced statistical tests, guiding improvements in GNN design and practical deployment.

The BREC (Benchmark for Realized GNN Expressiveness) benchmark is a standardized evaluation suite designed to rigorously probe the practical expressive power of Graph Neural Network (GNN) architectures beyond the limitations of the 1-dimensional Weisfeiler–Lehman (1-WL) test. BREC addresses critical issues in previous GNN expressivity benchmarks, including inadequate difficulty, lack of granularity, and limited scale, by synthesizing a fine-grained collection of graph pairs tailored to systematically differentiate the realized—rather than merely theoretical—expressiveness of advanced GNNs. The benchmark is widely used to illuminate the discrimination capacity of next-generation GNNs, unique node identifier (UID) methods, and global relational architectures, with an evaluation methodology informed by recent advances in both theoretical and empirical graph learning (Wang et al., 2023, Bechler-Speicher et al., 2024, Yu et al., 27 Jan 2026).

1. Formal Definition and Structure

BREC is composed entirely of (simple, undirected, unlabelled) graphs designed to be non-isomorphic yet operatively indistinguishable by the 1-WL test. Such pairs are particularly challenging for message-passing GNNs, which are classically limited to 1-WL expressiveness. The formal evaluation protocol for BREC is as follows:

Pairwise task: For each of 400 non-isomorphic test pairs $\{ (G_i, H_i) \}_{i=1}^{400}$ , the goal is to decide whether a GNN model $f_\theta$ can produce node or graph embeddings such that $f_\theta(G_i) \neq f_\theta(H_i)$ .
Expressiveness hierarchy: The distinction power is referenced to the $k$ -dimensional Weisfeiler–Lehman (k-WL) framework, which describes the refinement process by which graph isomorphism can be tested with increasing granularity by operating on $k$ -tuples of nodes.
Separation metric: For each pair, if the embeddings are not identical (according to a threshold, often via cosine distance or Hotelling’s $T^2$ statistic), the model “distinguishes” the pair. The main metric is “pair-distinguishing accuracy,” i.e., the proportion of 400 test pairs for which the model successfully distinguishes the graphs.

The dataset is subdivided into four curated categories, representing increasingly challenging graph-theoretic scenarios:

Category	# Pairs	Description
Basic	60	10-node graphs, 1-WL indistinguishable
Regular	140	Four families of regular graphs, 1-WL indistinguishable, some 3-WL hard
Extension	100	Graphs requiring >1-WL and up to 3-WL to separate
CFI	100	Cai–Fürer–Immerman constructions, several requiring up to 4-WL

All pairs are exhaustively checked for non-isomorphism using algorithms such as nauty/Traces (Wang et al., 2023).

2. Construction and Dataset Properties

Graph pairs in BREC are systematically generated to fill identified gaps in previous benchmarks:

Basic pairs: Enumerated exhaustively among 10-node graphs, grouped by 1-WL color histograms, then random non-isomorphic pairs are drawn.
Regular pairs: Drawn from (a) simple regular graphs, (b) strongly regular graphs (srg), (c) graphs satisfying the 4-vertex condition, (d) distance-regular graphs—parameters are sourced from standard catalogs to guarantee 1-WL indistinguishability but increased resistance to higher-order WL.
Extension pairs: Constructed using the theoretical axes of subgraph-counting, neighborhood-radius, and marking frameworks. 1-WL, $S_3$ , $S_4$ , and $N_1$ tests are applied; pairs are selected where only higher-order WL (e.g., 3-WL) succeeds.
CFI pairs: Generated using the Cai–Fürer–Immerman graphs across backbone sizes to yield instances indistinguishable by up to 4-WL. 60 pairs require 3-WL, 20 require 4-WL, and 20 are 4-WL-indistinguishable.

Graphs range from 10 nodes (Basic, Extension) up to 198 nodes (CFI), with corresponding variation in edge counts and diameters. The dataset contains 800 graphs in total, zero label imbalance, and each pair is a positive example for the “difficult-to-separate” class (Wang et al., 2023, Yu et al., 27 Jan 2026).

3. Experimental Protocol and Evaluation Methodology

The core evaluation task is “separating power” assessment:

Training setup: For GNNs, models are typically trained in a Siamese fashion on the full dataset, using a contrastive (cosine-margin) loss to force the separation of non-isomorphic pairs.
Reliable Paired Comparison (RPC): Each evaluated pair is subjected to 32 random node-permutations; embedding vectors’ difference distributions are tested via Hotelling’s $T^2$ statistic, with hypothesis rejection threshold $α = 0.05$ .
No split: All 23 baselines are both trained and evaluated over the same 400 pairs, using cross-validation or early stopping to avoid overfitting.
Metric: The only reported metric is “pair-distinguishing accuracy” over 400 pairs—i.e., the fraction for which $f_\theta(G_i) \neq f_\theta(H_i)$ according to RPC (Wang et al., 2023).

The protocol incorporates internal reliability checks by permuting single graphs to control for numerical artifacts, with the procedure repeated over multiple seeds to assess the upper bound of realized expressiveness.

4. Empirical Results and Comparative Analysis

BREC enables granular discrimination between advanced GNNs and theoretical or non-GNN baselines:

Expressiveness spectrum: Unlike prior tests (e.g., EXP, CSL), which were saturated by all beyond-1-WL models, BREC yields accuracies from 41.5% to 70.2%, providing meaningful granularity.
Upper bounds:
- 3-WL test (2-FWL): 67.5%
- KP-GNN: 68.8%
- I²-GNN: 70.2%
- GSN: 63.5%
- Best non-GNN (SPD-WL): 74.5% (computationally demanding)
Category difficulty: Most models achieve perfect separation only in Basic and Extension; CFI pairs and strongly-regular Regular pairs present significant challenges, often requiring high-order WL power.
Failure modes: Deep or large-radius models (e.g., PPGN simulating 3-WL) may degrade on large CFI graphs due to over-smoothing or computational limits. Certain hard CFI pairs remain indiscernible except by models matching or exceeding 4-WL (e.g., #1-FloydNet{4} achieves 99.8%).
UID-based approaches: Methods like SIRI, which regularize towards UID-invariance only at the final layer, substantially outperform pure random-UID approaches on Regular and CFI pairs (SIRI: 71.4% on Regular, 1.0% on CFI, 56.5% overall) (Bechler-Speicher et al., 2024).

Performance on BREC provides an operational test of the extent to which an architecture’s theoretical expressive power is actually realized in practice, exposing gaps and highlighting strengths not evident from theoretical analysis alone (Wang et al., 2023, Bechler-Speicher et al., 2024, Yu et al., 27 Jan 2026).

5. Theoretical Insights and Model Design Implications

BREC is specifically designed to diagnose the “realized” as opposed to “theoretical” expressiveness of GNNs:

k-WL expressiveness: The fine stratification by BREC allows one to empirically pin down where architectures falter relative to their position in the $k$ -WL hierarchy. For example, base FloydNet matches 3-WL performance exactly (270/400); hybrid models (FloydNet-KP) extend this with local subgraph features for up to 81.5%, while higher-order generalizations (#1-FloydNet{4}) achieve 99.8% (Yu et al., 27 Jan 2026).
Random-UID GNNs: Unregularized (fully invariant) UID-GNNs collapse to 1-WL expressiveness, while SIRI’s last-layer contrastive regularization allows miscellaneous UID features to contribute to true separation, but only at the output (Bechler-Speicher et al., 2024).
Limitations: Models that do not implement global all-pairs reasoning or lack sufficient radius/depth encounter dramatic performance drop-offs on CFI and Regular graph pairs, demonstrating the necessity of genuinely higher-order mechanisms.

This suggests BREC is optimally structured for detecting architectural bottlenecks related to global relational reasoning, permutation invariance, and substructure sensitivity (Wang et al., 2023, Bechler-Speicher et al., 2024, Yu et al., 27 Jan 2026).

6. Practical Recommendations and Usage

Key guidelines for employing BREC in empirical GNN research include:

Statistical rigor: Use RPC (with $q \approx 32$ , $d \approx 16$ , $α=0.05$ ), both test and reliability check, to ensure robust separation statistics.
Training regime: Adopt Siamese architectures with contrastive losses; tune GNN subgraph radii (typically 6–8), k-WL layers (5–6), and monitor over-smoothing, particularly for large or high-diameter graphs.
Reporting: Preferred accuracy reporting is the best over multiple random seeds, and researchers should fully document model tuning and pairwise decisions for reproducibility.
Verification: Always validate non-isomorphism using canonical labeling packages.
Benchmark extensibility: Researchers are encouraged to integrate new GNN designs into the BREC infrastructure for direct and standardized “pair-distinguishing accuracy” comparison over the 400 pairs (Wang et al., 2023).

As an expressive diagnostic, BREC has become the benchmark of record for empirical claims of beyond-1-WL expressiveness, serving as the canonical evaluation testbed for both theoretical advancements and practical deployments in advanced GNN research (Wang et al., 2023, Bechler-Speicher et al., 2024, Yu et al., 27 Jan 2026).