Papers
Topics
Authors
Recent
Search
2000 character limit reached

Randomized Leaky ReLU (RReLU)

Updated 10 February 2026
  • RReLU is a stochastic activation function that randomizes negative slopes (via U(l, u)) to act as an implicit regularizer in convolutional networks.
  • During training, each negative activation uses a random slope, while inference replaces this with the expected value for deterministic outputs.
  • Experimental results on CIFAR-10/100 and NDSB demonstrate that RReLU reduces test error and combats overfitting better than ReLU, Leaky ReLU, and PReLU.

The Randomized Leaky Rectified Linear Unit (RReLU) is a stochastic activation function for neural networks, specifically designed to regularize convolutional architectures by introducing randomization in the negative slope of rectified nonlinearities. For a given pre-activation input xx, RReLU operates as f(x)=xf(x) = x if x0x \ge 0, and f(x)=axf(x) = a x for x<0x < 0, where aa is sampled independently for each activation from a uniform distribution U(l,u)\mathcal{U}(l, u) during training. This mechanism yields improved generalization across small and medium-scale datasets compared to deterministic alternatives, and is notably robust to overfitting in convolutional architectures (Xu et al., 2015).

1. Mathematical Formulation

RReLU is defined by the function

f(x)={x,x0 ax,x<0,aU(l,u)f(x) = \begin{cases} x, & x \ge 0 \ a x, & x < 0,\quad a \sim \mathcal{U}(l,u) \end{cases}

where each negative slope coefficient aa is drawn from a uniform distribution in the interval [l,u][l, u]. This sampling is performed independently for each sample in each channel (feature map) and for each forward pass. During inference (testing), all aa are replaced by their expected value aˉ=(l+u)/2\bar{a} = (l+u)/2, resulting in a deterministic negative slope. With l=3,u=8l=3, u=8 (as pragmatically recommended for image tasks), aˉ=5.5\bar{a}=5.5 (Xu et al., 2015).

2. Randomization Protocol and Implementation

The randomization protocol of RReLU comprises the following:

  • Training phase: For each channel ii and example jj, independently sample ajiU(3,8)a_{ji} \sim \mathcal{U}(3,8). The output is yji=xjiy_{ji} = x_{ji} if xji0x_{ji} \ge 0; otherwise yji=ajixjiy_{ji} = a_{ji} x_{ji}.
  • Testing phase: All stochasticity is removed; ajia_{ji} is fixed to the mean of its distribution, i.e., aji5.5a_{ji} \equiv 5.5.

In practical settings, the parameters ll and uu may be tuned, but preserving stochastic variation in the negative slope is essential for the observed regularization effect. All other network hyperparameters—learning rate, weight decay, dropout—are maintained consistent with control configurations to ensure direct comparability.

3. Comparison to ReLU, Leaky ReLU, and PReLU

The table below contrasts RReLU with major rectified activations as evaluated in (Xu et al., 2015):

Activation Negative Slope (x<0x<0) Stochasticity
ReLU $0$ None
Leaky ReLU Fixed aa (e.g., 0.01, 5.5, 100) None
PReLU Learnable aa (by back-propagation) None
RReLU Random aU(l,u)a \sim \mathcal{U}(l,u) Yes

ReLU achieves high sparsity but relies solely on thresholding for regularization. Leaky ReLU improves gradient flow in the negative regime by using a fixed aa, but can overfit when aa is excessively aggressive or learned globally. PReLU adapts aa through gradient descent for each neuron, achieving low training loss but often overfitting, especially on limited data. RReLU, through stochastic variation, prevents co-adaptation and acts as an implicit regularizer—akin to Dropout's role in fully connected layers.

4. Experimental Protocol and Benchmarks

Extensive experiments were conducted using the following canonical image datasets and architectures:

  • CIFAR-10, CIFAR-100: 50, ⁣00050,\!000 train/10, ⁣00010,\!000 test samples, 32×3232\times32 RGB images, $10/100$ classes. Employed a Network-In-Network architecture (six "mlpconv" layers, global average pooling, two dropout layers at 0.5).
  • Inception with BatchNorm: For CIFAR-100, a subset of the Inception architecture (starting at "inception-3a"), using batch normalization.
  • National Data Science Bowl (NDSB): 30, ⁣33630,\!336 grayscale images with $121$ classes (spatial pyramid pooling, five inception-like convolutional blocks, two $1024$-unit dense layers).

For all experiments: (l,u)=(3,8)(l,u)=(3,8), aa sampled per example per channel during training, fixed to $5.5$ at test time, all models trained using CXXNET with identical optimization protocols and without augmentation, ensembles, or multi-view test strategies.

5. Quantitative Performance

Summary of key quantitative metrics from (Xu et al., 2015):

Dataset ReLU Leaky ReLU (a=5.5a=5.5) PReLU RReLU
*CIFAR-10 (error, %) * 12.45 11.20 11.79 11.19
CIFAR-100 (error, %) 42.96 40.42 41.63 40.25
NDSB (val log-loss) 0.7727 0.7391 0.7454 0.7292

RReLU consistently achieves the lowest test error or validation log-loss among all evaluated activations. Notably, on CIFAR-100 with Inception and batch normalization, RReLU obtains 75.68% test accuracy using a single model and no ensemble. PReLU minimizes training error but its generalization on small datasets is inferior to RReLU. Leaky ReLU performs comparably when using aggressive aa, but without the regularization effect, is prone to overfitting.

6. Regularization Mechanisms and Overfitting

The key regularization mechanism of RReLU is the per-sample, per-channel stochasticity of the negative slope, which prevents weight co-adaptation in the negative activation regime. This is analogous, in spirit, to the function of Dropout regularization. The presence of randomness during training encourages robust feature development and reduces test error, particularly when labeled data is limited or overfitting is otherwise problematic. In all reported experiments, RReLU yields the best validation metrics despite converging slightly more slowly than deterministic counterparts—an empirically strong indicator of regularizing efficacy.

7. Practical Recommendations and Use Cases

RReLU is recommended as a direct substitute for ReLU, Leaky ReLU, or PReLU in convolutional networks trained on small or medium-scale data where overfitting is a concern. The principal guidelines are:

  • Sample negative slopes aa from a moderate uniform range (e.g., U(3,8)\mathcal{U}(3,8)).
  • Replace random aa with its mean at inference to ensure deterministic outputs.
  • Maintain all other hyperparameters and training protocols as with standard activations for fair comparison.

A plausible implication is that the regularization benefits of RReLU may diminish in very large data regimes where overfitting is inherently less severe, but its systematic superiority on modestly sized datasets has been clearly demonstrated (Xu et al., 2015).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Leaky Rectified Linear Unit (RReLU).