Smooth Pooling Functions in Neural Networks

Updated 6 February 2026

Smooth pooling functions are differentiable, learnable operators that aggregate local activations with continuous mappings, ensuring robust gradient propagation.
They interpolate between max and average pooling using parameterized formulations such as gated, softmax, or power-based blends to enhance invariance and convergence.
Empirical studies demonstrate improved performance in image classification, graph learning, and low-precision tasks with minimal parameter and computational overhead.

A smooth pooling function refers to a class of pooling operators in neural network architectures that provide continuous, differentiable mappings from a set of local activations to a single output, in contrast to hard non-differentiable operators such as max-pooling. Smooth pooling functions are parameterized, learnable, and tunably interpolate between classical paradigms (max and average) while offering gradients everywhere, supporting more efficient end-to-end training with improved invariance, representational flexibility, and task performance. The term “smooth” characterizes both the analytic (mathematical) property of being differentiable almost everywhere and the practical effect of providing nonzero gradients to all activations within the pooling window.

1. Formal Definitions of Prominent Smooth Pooling Functions

The most widely studied smooth pooling functions generalize the max and average operations via convex and nonlinear mixtures, parameterized norm-like functions, softmax-weighting, or localized neural mappings. The following table summarizes core representatives and their mathematical forms:

Pooling Function	Mathematical Definition	Learnable Parameters
Mixed Max-Average	$f_\mathrm{mix}(x) = a\,\max_i x_i + (1-a)\,\frac{1}{N}\sum_{i=1}^N x_i$	$a\in[0,1]$ (per region/layer)
Gated Max-Average	$f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ , $\alpha(x)=\sigma(w^Tx)$	$w\in\mathbb{R}^N$
Tree Pooling	Recursively blends learned filters via sigmoid gates in a binary tree (see below)	$\{v_m,\,\omega_m\}$ (leaves/nodes)
Smooth Maximum Pooling	$f_\mathrm{SMP}^\tau(x) = \sum_{i=1}^N x_i\, \frac{e^{\tau x_i}}{\sum_{j=1}^N e^{\tau x_j}}$	$\tau$ (scalar or per-channel)
Power Pooling	$\hat y = \frac{\sum_i y_i^{p+1}}{\sum_i y_i^p}$	$p$ (per class)
Ordinal Pooling	$a\in[0,1]$ 0, $a\in[0,1]$ 1, $a\in[0,1]$ 2 (sorted $a\in[0,1]$ 3)	$a\in[0,1]$ 4 or logits $a\in[0,1]$ 5, $a\in[0,1]$ 6
Regularized Pooling	Max-pooling index is smoothed across neighborhoods via a spatial regularizer	$a\in[0,1]$ 7, $a\in[0,1]$ 8 (regularization, window size)
Perceptron/MLP Pooling	$a\in[0,1]$ 9 (optionally deep MLP)	$f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 0, $f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 1, hidden units
Generalized Norm-based Pool	$f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 2, splits pos/neg $f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 3 branches	$f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 4 (per pool)

Tree pooling computes $f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 5 recursively: $f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 6 where $f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 7 and $f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 8 are the left/right children.

The choice of function (mixing coefficient $f_\mathrm{gate}(x) = \alpha(x)\,\max x + (1-\alpha(x))\,\mathrm{avg}(x)$ 9, softmax scale $\alpha(x)=\sigma(w^Tx)$ 0, exponent $\alpha(x)=\sigma(w^Tx)$ 1, learned weights $\alpha(x)=\sigma(w^Tx)$ 2, etc.) determines the interpolation between rigid averaging (high invariance, low selectivity), extremal selection, and intermediate or data-dependent pooling regimes.

2. Differentiability, Smoothness, and Backpropagation

All smooth pooling functions are constructed to be continuously differentiable in the pooling window inputs, as well as in their internal parameters. This ensures effective gradient propagation during neural network training, in contrast to max-pooling, where only the maximizing activation receives nonzero gradient.

Mixed pooling: Differentiable in parameter $\alpha(x)=\sigma(w^Tx)$ 3, but only piecewise-continuous in $\alpha(x)=\sigma(w^Tx)$ 4, as it still depends on the explicit max; non-smooth events occur only when argmax ties.
Gated pooling, SMP, Power pooling, Ordinal, Perceptron pooling, GNP: Fully $\alpha(x)=\sigma(w^Tx)$ 5 in both inputs and internal parameters (away from degenerate points such as duplicate entries for ordinal pooling).
Regularized pooling: Smoothness arises from penalizing spatial discrepancy of selected indices; can be made soft and differentiable if using continuous selection, but typical implementations keep index selection hard and only regularize the direction.

For representative formulas, in the gated max-avg case: $\alpha(x)=\sigma(w^Tx)$ 6 Similarly, for smooth maximum pooling, the gradient with respect to $\alpha(x)=\sigma(w^Tx)$ 7 is: $\alpha(x)=\sigma(w^Tx)$ 8 And for power pooling: $\alpha(x)=\sigma(w^Tx)$ 9 These universal gradients guarantee that all elements within the window are adjusted smoothly during network optimization.

3. Empirical Performance and Task-Specific Trends

Across image classification, generative modeling, and graph-based learning, smooth pooling functions consistently improve over fixed max or average pooling in invariance, learning dynamics, and sometimes final accuracy, though the quantitative magnitude is task- and architecture-dependent.

Notable empirical results:

CIFAR-10 (CNN, gating/tree/mixed): Replacing two max-pool layers with gated pooling (one gate per layer) reduced test error from 9.10% to 7.90%. Mixed pooling yielded 8.09%, and a combined tree+gated configuration reached 7.62%, establishing a new single-model state of the art with a marginal parameter increase (+15%) (Lee et al., 2015).
DCASE sound event detection (Power Pooling): Power pooling raised the event F1 from 0.162 (linear softmax) to 0.196 (11.4% relative) (Liu et al., 2020).
Ordinal Pooling: On quantized ResNets (CIFAR-10 with 1–4b weights), ordinal pooling gave up to +3.5% absolute improvement in low-precision regimes, closing the accuracy gap to full-precision baselines (Deliège et al., 2021).
Graph tasks (Function-Space Pooling): Outperformed mean/sum/readout on MUTAG/PROTEINS/ENZYMES benchmarks, e.g., 0.83 accuracy on MUTAG (sum: 0.66) (Corcoran, 2019).
Edge-preserving pooling (LGCA/WADCA): Classified and segmented with greater robustness to noise, shift, and rotation; e.g., Cats-vs-Dogs accuracy gains +4–5% with MobileNetv2 (Sineesh et al., 2021).
VGG16 ImageNet (Smooth Maximum, Lp, Gated, Ordinal): No method outperformed average/max pooling by more than 0.2%; “fancy” smooth pooling (including SMP/Lp/gated) often regressed to max-like behavior under end-to-end optimization, with test accuracy indistinguishable from standard pooling (Bieder et al., 2021).

This suggests the main advantage of smooth pooling lies in improved invariance, robustness, convergence, and flexibility on lightweight, quantized, or small-to-moderate-scale tasks, without incurring significant computational burden.

4. Implementation, Complexity, and Integration Aspects

Most smooth pooling layers act as drop-in replacements for max- or average pooling. Integration requires only minor architectural adjustment—mainly allocation of additional parameters (scalars or small vectors per region/channel/layer) and, in some cases, augmenting the forward and backward passes to accommodate parameter learning.

Parameter overhead: Mixed and power pooling introduce 1–2 learnable scalars per pooling region. Gated, tree, or ordinal pooling introduces tens of weights per region at most. Perceptron pooling’s parameter count is $w\in\mathbb{R}^N$ 0 per window (window size $w\in\mathbb{R}^N$ 1, depth $w\in\mathbb{R}^N$ 2).
Computational cost: Typically, a 5–15% increase over standard pooling, primarily due to the additional dot products (gated, perceptron) or per-window sorting (ordinal). Softmax or exponentiated operations in SMP/power pooling scale as $w\in\mathbb{R}^N$ 3 per pooling window.
Initialization and training: Parameters can be initialized to yield average- or max-like regimes (e.g., $w\in\mathbb{R}^N$ 4 for mixed, $w\in\mathbb{R}^N$ 5 for SMP, uniform logits for ordinal). For stability, learning rates for pooling parameters may be decoupled or reduced compared to regular weights.
Granularity selection: Parameters can be global (per-layer), channel-wise, or spatially local, depending on parameter budget and learning capacity requirements.
Framework compatibility: All variants are compatible with standard deep learning optimization (SGD/Adam, weight decay, dropout, batch norm) and can be combined with auxiliary mechanisms (e.g., squeeze-and-excitation, anti-aliasing).
Regularization: For ordinal or tree pooling, standard $w\in\mathbb{R}^N$ 6 weight decay suffices. For regularized pooling, $w\in\mathbb{R}^N$ 7 for the spatial penalty is tuned by validation.

A representative pseudocode for ordinal pooling is as follows: $\{v_m,\,\omega_m\}$ 3

5. Comparative Analysis of Pooling Methodologies

The diverse array of smooth pooling functions can be organized by their underlying principles:

Family	Representative Methods	Key Principle
Convex Interpolation	Mixed, Gated, Tree	Weighted or gated sum of max and average
Parameterized Norm	$w\in\mathbb{R}^N$ 8, Power, GNP	Learnable exponent controlling mean–max transition
Softmax-based	SMP, LogSumExp	Expectation under softmax with tunable sharpness
Rank/Sort-based	Ordinal	Weighted sum over sorted input (piecewise-smooth)
Neural/Perceptron	Perceptron, MLP	Nonlinear mapping from window via neural subnetwork
Edge-preserving	LGCA, WADCA	Multi-branch fusion of low/high-frequency content
Regularized Max	Regularized Pooling	Hard selection smoothed by spatial penalties
Function-space (graph)	Function-Space Pooling, GNP	Mapping to function or higher-dimensional signature

Specialized variants address domain-specific needs. Function-Space Pooling creates graph-level smooth function signatures by placing Gaussian kernels on sigmoid-transformed per-node embeddings (Corcoran, 2019). GNP enables smooth, trainable $w\in\mathbb{R}^N$ 9–like behavior for both positive and negative pooling exponents, furnishing GNNs with robust extrapolation properties (Ko et al., 2021).

A comparative observation is that methods with learnable, data-dependent or region-adaptive weighting (gated, ordinal, perceptron, GNP) outperform rigid schemes (pure $\{v_m,\,\omega_m\}$ 0, log-sum-exp, fixed mixtures) in most real tasks—provided the extra computation and parameters remain modest.

6. Practical Guidelines and Trade-Offs

Empirical evidence supports the following recommendations for deployment of smooth pooling in practice:

Start with Gated Max-Average: For CNNs, per-layer (few parameters) gated pooling is simple, low-overhead, and consistently yields substantial gains (Lee et al., 2015).
Enhance with Tree or Ordinal Pooling: For applications requiring maximal local expressivity (e.g., small datasets, segmentation), a small tree or ordinal pooling in early pooling layers is effective.
Edge-Preservation: To preserve high-frequency structure (e.g., in segmentation, autoencoders), use LGCA or WADCA, which fuse anti-aliased (Gaussian/wavelet) and high-frequency branches with channel attention (Sineesh et al., 2021).
Smooth Norms in GNNs: For graph neural networks on tasks demanding smooth aggregation/readout, GNP or function-space pooling are recommended (Ko et al., 2021, Corcoran, 2019).
Power Pooling for MIL/SED: Adaptive power pooling is suitable for weakly labeled detection, automatically tuning between average- and max-like regimes per class; set $\{v_m,\,\omega_m\}$ 1 per class and initialize at $\{v_m,\,\omega_m\}$ 2 for stable learning (Liu et al., 2020).
Large-Scale Image Models: For deep architectures on large-scale natural image data, no smooth pooling design outperformed standard max/average, even with data-dependent temperature/weighting (Bieder et al., 2021). This suggests deployment in resource-sensitive or structure-critical applications is most impactful.
Quantized/Embedded Models: Ordinal or smooth pooling is particularly effective in lightweight or low-precision CNNs, closing accuracy gaps to larger baselines (Deliège et al., 2021).
Parameter Budget: All methods introduce minimal parameter overhead per layer (from a handful to a few dozen weights per pool), and the standard cost-benefit trade-off is favorable for most non-huge settings.

7. Domain Adaptations and Extensions

The smooth pooling paradigm generalizes across domains and input types:

Convolutional CNNs: All variants can be inserted transparently by replacing fixed pooling layers; per-channel or per-region pooling grants finer adaptivity.
Graph Neural Networks: GNP and function-space pooling are specifically constructed for set/graph-level aggregation, enabling extrapolation and fine-grained invariance control without Laplacian regularization.
MIL, SED, Quantized Models: Power pooling extends to any MIL problem; ordinal/power pooling enhance accuracy in low-resource or label-imprecise regimes.
Gen. Edge/Texture: Edge-preserving pooling is crucial in generative and restoration tasks where faithful reconstruction of fine structure is needed.

In summary, smooth pooling functions create a space of pooling operators that are end-to-end differentiable, parameter-efficient, and flexible enough to interpolate or adapt between canonical paradigms. The diversity of available forms—convex/gated, softmax, power, MLP/neural, sorted—permits tailoring to both the data domain and downstream task, while preserving computational tractability and robust gradient propagation throughout training (Lee et al., 2015, Deliège et al., 2021, Bieder et al., 2021, Liu et al., 2020, Sineesh et al., 2021, Corcoran, 2019, Fuhl et al., 2020, Ko et al., 2021, Otsuzuki et al., 2020).