Randomized Leaky ReLU (RReLU)
- RReLU is a stochastic activation function that randomizes negative slopes (via U(l, u)) to act as an implicit regularizer in convolutional networks.
- During training, each negative activation uses a random slope, while inference replaces this with the expected value for deterministic outputs.
- Experimental results on CIFAR-10/100 and NDSB demonstrate that RReLU reduces test error and combats overfitting better than ReLU, Leaky ReLU, and PReLU.
The Randomized Leaky Rectified Linear Unit (RReLU) is a stochastic activation function for neural networks, specifically designed to regularize convolutional architectures by introducing randomization in the negative slope of rectified nonlinearities. For a given pre-activation input , RReLU operates as if , and for , where is sampled independently for each activation from a uniform distribution during training. This mechanism yields improved generalization across small and medium-scale datasets compared to deterministic alternatives, and is notably robust to overfitting in convolutional architectures (Xu et al., 2015).
1. Mathematical Formulation
RReLU is defined by the function
where each negative slope coefficient is drawn from a uniform distribution in the interval . This sampling is performed independently for each sample in each channel (feature map) and for each forward pass. During inference (testing), all are replaced by their expected value , resulting in a deterministic negative slope. With (as pragmatically recommended for image tasks), (Xu et al., 2015).
2. Randomization Protocol and Implementation
The randomization protocol of RReLU comprises the following:
- Training phase: For each channel and example , independently sample . The output is if ; otherwise .
- Testing phase: All stochasticity is removed; is fixed to the mean of its distribution, i.e., .
In practical settings, the parameters and may be tuned, but preserving stochastic variation in the negative slope is essential for the observed regularization effect. All other network hyperparameters—learning rate, weight decay, dropout—are maintained consistent with control configurations to ensure direct comparability.
3. Comparison to ReLU, Leaky ReLU, and PReLU
The table below contrasts RReLU with major rectified activations as evaluated in (Xu et al., 2015):
| Activation | Negative Slope () | Stochasticity |
|---|---|---|
| ReLU | $0$ | None |
| Leaky ReLU | Fixed (e.g., 0.01, 5.5, 100) | None |
| PReLU | Learnable (by back-propagation) | None |
| RReLU | Random | Yes |
ReLU achieves high sparsity but relies solely on thresholding for regularization. Leaky ReLU improves gradient flow in the negative regime by using a fixed , but can overfit when is excessively aggressive or learned globally. PReLU adapts through gradient descent for each neuron, achieving low training loss but often overfitting, especially on limited data. RReLU, through stochastic variation, prevents co-adaptation and acts as an implicit regularizer—akin to Dropout's role in fully connected layers.
4. Experimental Protocol and Benchmarks
Extensive experiments were conducted using the following canonical image datasets and architectures:
- CIFAR-10, CIFAR-100: train/ test samples, RGB images, $10/100$ classes. Employed a Network-In-Network architecture (six "mlpconv" layers, global average pooling, two dropout layers at 0.5).
- Inception with BatchNorm: For CIFAR-100, a subset of the Inception architecture (starting at "inception-3a"), using batch normalization.
- National Data Science Bowl (NDSB): grayscale images with $121$ classes (spatial pyramid pooling, five inception-like convolutional blocks, two $1024$-unit dense layers).
For all experiments: , sampled per example per channel during training, fixed to $5.5$ at test time, all models trained using CXXNET with identical optimization protocols and without augmentation, ensembles, or multi-view test strategies.
5. Quantitative Performance
Summary of key quantitative metrics from (Xu et al., 2015):
| Dataset | ReLU | Leaky ReLU () | PReLU | RReLU |
|---|---|---|---|---|
| *CIFAR-10 (error, %) * | 12.45 | 11.20 | 11.79 | 11.19 |
| CIFAR-100 (error, %) | 42.96 | 40.42 | 41.63 | 40.25 |
| NDSB (val log-loss) | 0.7727 | 0.7391 | 0.7454 | 0.7292 |
RReLU consistently achieves the lowest test error or validation log-loss among all evaluated activations. Notably, on CIFAR-100 with Inception and batch normalization, RReLU obtains 75.68% test accuracy using a single model and no ensemble. PReLU minimizes training error but its generalization on small datasets is inferior to RReLU. Leaky ReLU performs comparably when using aggressive , but without the regularization effect, is prone to overfitting.
6. Regularization Mechanisms and Overfitting
The key regularization mechanism of RReLU is the per-sample, per-channel stochasticity of the negative slope, which prevents weight co-adaptation in the negative activation regime. This is analogous, in spirit, to the function of Dropout regularization. The presence of randomness during training encourages robust feature development and reduces test error, particularly when labeled data is limited or overfitting is otherwise problematic. In all reported experiments, RReLU yields the best validation metrics despite converging slightly more slowly than deterministic counterparts—an empirically strong indicator of regularizing efficacy.
7. Practical Recommendations and Use Cases
RReLU is recommended as a direct substitute for ReLU, Leaky ReLU, or PReLU in convolutional networks trained on small or medium-scale data where overfitting is a concern. The principal guidelines are:
- Sample negative slopes from a moderate uniform range (e.g., ).
- Replace random with its mean at inference to ensure deterministic outputs.
- Maintain all other hyperparameters and training protocols as with standard activations for fair comparison.
A plausible implication is that the regularization benefits of RReLU may diminish in very large data regimes where overfitting is inherently less severe, but its systematic superiority on modestly sized datasets has been clearly demonstrated (Xu et al., 2015).