Randomized Leaky ReLU (RReLU)

Updated 10 February 2026

RReLU is a stochastic activation function that randomizes negative slopes (via U(l, u)) to act as an implicit regularizer in convolutional networks.
During training, each negative activation uses a random slope, while inference replaces this with the expected value for deterministic outputs.
Experimental results on CIFAR-10/100 and NDSB demonstrate that RReLU reduces test error and combats overfitting better than ReLU, Leaky ReLU, and PReLU.

The Randomized Leaky Rectified Linear Unit (RReLU) is a stochastic activation function for neural networks, specifically designed to regularize convolutional architectures by introducing randomization in the negative slope of rectified nonlinearities. For a given pre-activation input $x$ , RReLU operates as $f(x) = x$ if $x \ge 0$ , and $f(x) = a x$ for $x < 0$ , where $a$ is sampled independently for each activation from a uniform distribution $\mathcal{U}(l, u)$ during training. This mechanism yields improved generalization across small and medium-scale datasets compared to deterministic alternatives, and is notably robust to overfitting in convolutional architectures (Xu et al., 2015).

1. Mathematical Formulation

RReLU is defined by the function

$f(x) = \begin{cases} x, & x \ge 0 \ a x, & x < 0,\quad a \sim \mathcal{U}(l,u) \end{cases}$

where each negative slope coefficient $a$ is drawn from a uniform distribution in the interval $[l, u]$ . This sampling is performed independently for each sample in each channel (feature map) and for each forward pass. During inference (testing), all $a$ are replaced by their expected value $\bar{a} = (l+u)/2$ , resulting in a deterministic negative slope. With $l=3, u=8$ (as pragmatically recommended for image tasks), $\bar{a}=5.5$ (Xu et al., 2015).

2. Randomization Protocol and Implementation

The randomization protocol of RReLU comprises the following:

Training phase: For each channel $i$ and example $j$ , independently sample $a_{ji} \sim \mathcal{U}(3,8)$ . The output is $y_{ji} = x_{ji}$ if $x_{ji} \ge 0$ ; otherwise $y_{ji} = a_{ji} x_{ji}$ .
Testing phase: All stochasticity is removed; $a_{ji}$ is fixed to the mean of its distribution, i.e., $a_{ji} \equiv 5.5$ .

In practical settings, the parameters $l$ and $u$ may be tuned, but preserving stochastic variation in the negative slope is essential for the observed regularization effect. All other network hyperparameters—learning rate, weight decay, dropout—are maintained consistent with control configurations to ensure direct comparability.

3. Comparison to ReLU, Leaky ReLU, and PReLU

The table below contrasts RReLU with major rectified activations as evaluated in (Xu et al., 2015):

Activation	Negative Slope ( $x<0$ )	Stochasticity
ReLU	$0$	None
Leaky ReLU	Fixed $a$ (e.g., 0.01, 5.5, 100)	None
PReLU	Learnable $a$ (by back-propagation)	None
RReLU	Random $a \sim \mathcal{U}(l,u)$	Yes

ReLU achieves high sparsity but relies solely on thresholding for regularization. Leaky ReLU improves gradient flow in the negative regime by using a fixed $a$ , but can overfit when $a$ is excessively aggressive or learned globally. PReLU adapts $a$ through gradient descent for each neuron, achieving low training loss but often overfitting, especially on limited data. RReLU, through stochastic variation, prevents co-adaptation and acts as an implicit regularizer—akin to Dropout's role in fully connected layers.

4. Experimental Protocol and Benchmarks

Extensive experiments were conducted using the following canonical image datasets and architectures:

CIFAR-10, CIFAR-100: $50,\!000$ train/ $10,\!000$ test samples, $32\times32$ RGB images, $10/100$ classes. Employed a Network-In-Network architecture (six "mlpconv" layers, global average pooling, two dropout layers at 0.5).
Inception with BatchNorm: For CIFAR-100, a subset of the Inception architecture (starting at "inception-3a"), using batch normalization.
National Data Science Bowl (NDSB): $30,\!336$ grayscale images with $121$ classes (spatial pyramid pooling, five inception-like convolutional blocks, two $1024$-unit dense layers).

For all experiments: $(l,u)=(3,8)$ , $a$ sampled per example per channel during training, fixed to $5.5$ at test time, all models trained using CXXNET with identical optimization protocols and without augmentation, ensembles, or multi-view test strategies.

5. Quantitative Performance

Summary of key quantitative metrics from (Xu et al., 2015):

Dataset	ReLU	Leaky ReLU ( $a=5.5$ )	PReLU	RReLU
CIFAR-10 (error, %)	12.45	11.20	11.79	11.19
CIFAR-100 (error, %)	42.96	40.42	41.63	40.25
NDSB (val log-loss)	0.7727	0.7391	0.7454	0.7292

RReLU consistently achieves the lowest test error or validation log-loss among all evaluated activations. Notably, on CIFAR-100 with Inception and batch normalization, RReLU obtains 75.68% test accuracy using a single model and no ensemble. PReLU minimizes training error but its generalization on small datasets is inferior to RReLU. Leaky ReLU performs comparably when using aggressive $a$ , but without the regularization effect, is prone to overfitting.

6. Regularization Mechanisms and Overfitting

The key regularization mechanism of RReLU is the per-sample, per-channel stochasticity of the negative slope, which prevents weight co-adaptation in the negative activation regime. This is analogous, in spirit, to the function of Dropout regularization. The presence of randomness during training encourages robust feature development and reduces test error, particularly when labeled data is limited or overfitting is otherwise problematic. In all reported experiments, RReLU yields the best validation metrics despite converging slightly more slowly than deterministic counterparts—an empirically strong indicator of regularizing efficacy.

7. Practical Recommendations and Use Cases

RReLU is recommended as a direct substitute for ReLU, Leaky ReLU, or PReLU in convolutional networks trained on small or medium-scale data where overfitting is a concern. The principal guidelines are:

Sample negative slopes $a$ from a moderate uniform range (e.g., $\mathcal{U}(3,8)$ ).
Replace random $a$ with its mean at inference to ensure deterministic outputs.
Maintain all other hyperparameters and training protocols as with standard activations for fair comparison.

A plausible implication is that the regularization benefits of RReLU may diminish in very large data regimes where overfitting is inherently less severe, but its systematic superiority on modestly sized datasets has been clearly demonstrated (Xu et al., 2015).

Markdown Report Issue Upgrade to Chat

References (1)

Empirical Evaluation of Rectified Activations in Convolutional Network (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Leaky Rectified Linear Unit (RReLU).