POBench-PDE Benchmark Suite

Updated 24 January 2026

POBench-PDE is a standardized family of benchmarks designed to test PDE solvers and scientific machine learning algorithms across diverse models.
It provides reproducible datasets, unified APIs, and detailed evaluation metrics for tasks such as operator learning, Bayesian inversion, mesh quality analysis, and dynamic poroelasticity.
The framework enables quantitative comparisons with high-fidelity ground-truth solutions and rigorous diagnostic protocols for both forward and inverse problems.

POBench-PDE denotes a family of systematically designed benchmarks for Partial Differential Equations (PDEs) that facilitate evaluation and comparison of algorithms in scientific machine learning, uncertainty quantification, mesh geometry, and numerical PDE solvers. Through its various forms—spanning data-driven operator learning, Bayesian inverse problems, mesh quality analysis for polytopal elements, handling partial observations, and dynamic poroelasticity—it provides reproducible, extensible platforms for standardized algorithmic assessment.

1. Scope and Rationale

The core objective of POBench-PDE is to address the need for rigorous, widely-adopted PDE benchmarks supporting diverse scientific machine learning and computational PDE research. These benchmarks feature:

Canonical and real-world PDE models (time-dependent, steady, 1D–3D, linear/nonlinear, scalar/vector, forward/inverse)
Large, off-the-shelf datasets with code for further customizable generation and parameter variation
Unified APIs for data access, model evaluation, and extension
Reference solutions and high-fidelity ground-truth statistics for quantitative, like-for-like algorithm comparison

Specific implementations include:

Data-driven emulation and surrogate learning for dynamical systems (Takamoto et al., 2022)
Bayesian inversion of spatially variable coefficients via MCMC (Aristoff et al., 2021)
Benchmarking mesh quality for polytopal element methods (PEM) (Attene et al., 2019)
Operator learning from partial observations (Hou et al., 22 Jan 2026)
Benchmarking dynamic poroelasticity solvers (Anselmann et al., 2023)

2. Benchmark Architectures and PDE Catalogs

Data-Driven Operator Learning Benchmarks

PDEBench includes eleven canonical and application-oriented PDEs characterized by varied spatial dimension ($1$D–$3$D), type, and solution complexity (Takamoto et al., 2022):

1D: Advection, Burgers’, Diffusion–Reaction, Diffusion–Sorption
2D: Diffusion–Reaction (FitzHugh–Nagumo), Darcy flow, Incompressible/Compressible Navier–Stokes, Shallow water
3D: Compressible Navier–Stokes

Each PDE is precisely specified by governing equations, initial/boundary conditions, and randomized parameters. For example, the 1D advection equation is

$\partial_t u(t,x) + \beta \,\partial_x u(t,x) = 0$

with periodic boundary conditions and randomly sampled initial conditions constructed as sums of sines.

Bayesian Inversion Benchmark (MCMC)

The problem is coefficient identification in the Poisson equation on $\Omega = (0,1)^2$ : $-\nabla\cdot[a(x)\nabla u(x)] = f(x), \quad u|_{\partial\Omega} = 0$ where $a(x)$ is piecewise constant, parameterized by $\theta\in\mathbb{R}^{64}$ over an $8\times 8$ grid. Observables are $u(x)$ at $169$ points with i.i.d. Gaussian noise. The task is to reconstruct $\theta$ from data using MCMC, with precisely specified priors and acceptance ratios (Aristoff et al., 2021).

Mesh Quality and Solver Performance for PEM

POBench-PDE analyzes the effect of polygonal mesh geometry on solver conditioning and approximation:

Eight parametric polygon families, each systematically degenerating in shape
Twelve per-cell quality metrics: kernel-area ratio (KAR), minimum angle (MA), circle ratio (CR), edge ratio (ER), perimeter-area ratio (PAR), etc.
Solver metrics: relative $L_\infty$ , $L_2$ , energy-norm errors; condition number; empirical convergence rates
Statistical Pearson correlations between shape descriptors and solver performance (Attene et al., 2019)

Partial-Observation Operator Learning

POBench-PDE (partial observation) assesses neural operator performance under missing data regimes for:

2D incompressible Navier–Stokes turbulence
Reaction–diffusion (Gray–Scott system)
Real-world climate fields (ERA5) Diverse missingness types (point- and patch-wise), sparsity up to $75\%$ , and a unified evaluation protocol (Hou et al., 22 Jan 2026).

Dynamic Poroelasticity (Biot Equations)

A coupled hyperbolic–parabolic system on a 2D L-shaped domain: $\begin{align*} \rho\,\partial_t^2 u - \nabla\cdot(C\epsilon(u)) + \alpha\nabla p &= \rho f \ c_0\,\partial_t p + \alpha\nabla\cdot(\partial_t u) - \nabla\cdot(K\nabla p) &= g \end{align*}$ with specific Dirichlet/Neumann/slip boundary regimes. Reference goal quantities are line integrals of $u$ and $p$ on predefined subdomains (Anselmann et al., 2023).

3. Dataset Generation, Formats, and Task Protocols

Datasets are distributed in HDF5 format, structured as arrays: $(N_\text{samples} \times N_\text{time} \times X \times Y \times Z \times V)$ where $V$ is the number of physical fields. Sample counts per PDE and resolutions are problem-specific (e.g., 1D advection: $10^4$ samples, $T=200-500$ , $X=1024$ ).

Initial conditions and PDE parameters are randomly sampled according to specified ranges or distributions. Metadata (PDE type, parameter values, BC types) is encoded as YAML attributes. Data splits for training/evaluation are user-configurable, with a typical 90/10 train/test division (Takamoto et al., 2022).

For Bayesian benchmarks, the full posterior is archived based on $2\times 10^{11}$ MCMC samples, ensuring statistical reference for new algorithms (Aristoff et al., 2021).

PEM mesh benchmarks provide C++ and MATLAB code for generating, measuring, and solving on polygonal mesh families, including full reproducibility for all experiments (Attene et al., 2019).

4. APIs, Extension Points, and Reproducibility

PDEBench (and related POBench-PDEs) expose unified Python/PyTorch APIs:

Data loading via specialized Dataset wrappers
Hydra-configured generation scripts for new cases
Uniform model evaluation interfaces: evaluate_model(model, dataset, metrics)
Extending to new PDEs: subclass BasePDE, implement methods for IC generation, stepping, and BC enforcement, register in the benchmark factory, and supply YAML configs

Reproducibility is ensured by version-controlled configs, random seed logging, and modular APIs for integrating new baseline models, metrics, or task variants (Takamoto et al., 2022 Attene et al., 2019).

5. Baseline Methods and Performance Summaries

PDEBench provides baselines across operator learning paradigms:

Fourier Neural Operator (FNO): Spectral convolutions, MLP update, high resolution invariance (Takamoto et al., 2022)
U-Net: Multiscale CNN, open/closed-loop training, "pushforward trick" for multi-step stability
PINN: Fully-connected networks (DeepXDE), physics-based losses, per-sample training
Gradient-Based Inverse: IC/parameter inference via differentiable surrogates (FNO/UNet)

Reported performance:

FNO: RMSE $10^{-3}$ – $10^{-1}$ , robust to frequency, best conservation and boundary error
U-Net: Strong on diffusion, less accurate for shocks/high frequency
PINN: Competitive at high frequency (costly, small domains)
Gradient inversion: FNO outperforms U-Net, errors at mid/high frequencies

Partial-observation variants benchmark LNO, LANO, and associated frameworks, with LANO achieving $\sim 18$ – $69\%$ reduction in relative $\ell_2$ error at moderate sparsity (Hou et al., 22 Jan 2026).

In the PEM benchmark, lowest-order VEM is stress-tested, revealing critical geometric metrics for solver stability and accuracy (notably KAR and MA) (Attene et al., 2019).

6. Evaluation Metrics and Diagnostic Protocols

PDEBench and related benchmarks define multi-faceted evaluation suites (Takamoto et al., 2022), including:

Metric	Definition	Targeted effect
nRMSE	$\\|u_\text{pred}–u_\text{true}\\|_2 / \\|u_\text{true}\\|_2$	Relative error norm
cRMSE	$(1/N)\\|\sum_x u_\text{pred}–\sum_x u_\text{true}\\|_2$	Conservation (mass/energy) fidelity
bRMSE	RMSE on boundary cells	Boundary condition enforcement
fRMSE	Frequency-banded RMSE via discrete Fourier transform	Multi-scale spatial fidelity
L2, Linf	Standard $L_2$ , $L_\infty$ norms (PME)	Error measures for solver analysis
Condition $\kappa_1$	Matrix condition number	Solver stability (PEM)
ESS	Effective sample size (MCMC)	Posterior sampling efficiency

Interpretations: cRMSE diagnoses conservation properties, bRMSE tests BC learning, fRMSE quantifies fidelity at various spatial frequencies, and error norms/conditioning expose mesh-related solver breakdowns.

For Bayesian inverse problems, the reference mean, covariance, convergence, and posterior histograms must agree within $2\sigma$ of archived ground-truth from $2\times 10^{11}$ MCMC samples to validate new algorithms (Aristoff et al., 2021).

7. Impact, Extensions, and Best Practices

POBench-PDE provides the first reproducible community standards for many classes of PDE learning and simulation challenges:

PDEBench enables rapid benchmarking of new neural surrogates, emulators, and inverse solvers, supporting plug-in extension of PDEs, models, and evaluation metrics (Takamoto et al., 2022).
In PEM, systematic analysis across geometric degeneracies guides relaxed mesh quality criteria, promoting flexible, practical meshing strategies (Attene et al., 2019).
The Bayesian inversion benchmark supports algorithmic research on high-dimensional inference by providing precise, high-fidelity reference posteriors (Aristoff et al., 2021).
For partial observation, POBench-PDE sets the standard for quantifying robustness to missing data across neural operator families, crucial for deployment in real-world sensing applications (Hou et al., 22 Jan 2026).
In dynamic poroelasticity, the benchmarked problem, discretization, and output functionals establish a common testbed for evaluating accuracy, efficiency, and robustness of space–time discretizations and iterative solvers (Anselmann et al., 2023).

Standard practices:

Explicitly subclass base benchmark modules when adding models/PDEs
Commit task-parameter configs and random seeds for reproducibility
Use published diagnostic protocols (e.g., error-vs-cost curves, correlation matrices, standardized evaluation metrics) for transparent, objective comparison

POBench-PDE benchmarks thus form an essential infrastructure for advancing data-driven scientific PDE modeling, uncertainty quantification, and numerical simulation.