Swoosh Activations: Theory & Applications

Updated 22 January 2026

Swoosh activations are smooth, non-monotonic functions combining sigmoid gating with linear branches to enhance gradient propagation and feature learning.
Variants like canonical Swish and Flatten-T Swish adjust parameters such as beta and threshold to optimize convergence and accuracy across diverse neural network architectures.
The Swoosh Activation Function (SAF) employs a regularization penalty for heatmap-based landmark detection by tuning coefficients to align with optimal mean squared error targets.

The term "Swoosh activations" encompasses a family of activation functions characterized by a smooth, non-monotonic curve that resembles the "swish" or "swoosh" shape. These functions include canonical Swish ( $x\cdot\sigma(\beta x)$ ), its parametric and thresholded variants such as Flatten-T Swish, and, more recently, the Swoosh Activation Function (SAF) designed for regularization in heatmap-based landmark detection. Their mathematical form typically combines a sigmoid gating with linear or modified branches, distinct from piecewise and monotonically-gated rectifiers. Swoosh-type activations are designed to boost expressivity, gradient flow, and trainability, especially in deep architectures or tasks with high-frequency and heterogeneous target structures.

1. Mathematical Definitions and Key Properties

Canonical Swish is defined as:

$\mathrm{Swish}_\beta(x) = x \cdot \sigma(\beta x) \quad \text{where} \quad \sigma(z) = \frac{1}{1+e^{-z}},$

with $\beta>0$ controlling the steepness of the gating. The unparameterized form ( $\beta=1$ ) is common in practice (Al-Safwan et al., 2021, Eger et al., 2019, Szandała, 2020, Hayou et al., 2018).

Key properties:

Smoothness: $C^\infty$ ; derivatives exist everywhere, unlike ReLU.
Non-monotonicity: For $x < 0$ , the curve dips below zero, facilitating richer feature learning.
Non-zero derivatives: $\frac{d}{dx}\mathrm{Swish}(x) = \sigma(x) + x \sigma(x)(1-\sigma(x))$ , ensuring gradient propagation across all regions.
Limiting cases: As $\beta \to \infty$ , Swish approximates ReLU; as $\beta \to 0$ , it becomes the identity.
Range: For $\beta=1$ , minimum is $\approx -0.2785$ ; unbounded above.

Flatten-T Swish (FTS) introduces a threshold:

$f_T(x) = \begin{cases} x \cdot \sigma(x) + T & x \geq 0 \ T & x < 0 \end{cases}$

$T$ sets a negative floor; $T=-0.20$ performs best for MNIST deep FNNs, allowing negative value propagation while retaining sparse gradients on $x<0$ (Chieng et al., 2018).

Swoosh Activation Function (SAF) is a regularization penalty function for optimizing heatmap MSEs in landmark detection:

$\mathrm{SAF}(x) = (a x + \frac{1}{b x})^c - \mathrm{Min}$

Optimal MSE $x^* = \sqrt{1/(a b)}$ aligns SAF's minimum with the desired regularization target, and $a$ , $b$ , $c$ tune its profile. SAF is not an activation in the standard forward-pass sense but enforces distributional sharpness/dispersion for outputs in heatmap regression (Zhou et al., 2024).

2. Comparative Role in Deep Networks

Swoosh-family activations remedy limitations of classical nonlinearities:

Activation	Smoothness	Gradient at $x<0$	Monotonicity	Negative Signal
ReLU	Not smooth	Zero	Yes	None
Tanh/Sigmoid	Smooth	Small (Saturating)	Yes	Yes (but saturating)
Swish/Swoosh	Smooth	Small, nonzero	No	Yes (non-monotonic)
Flatten-T Swish	Piecewise smooth	Zero (x<0), $T<0$	Partly	Yes (via $T$ )
SAF (Loss reg.)	N/A	N/A	N/A	N/A

Swish-type activations preserve gradient flow—especially for negative pre-activations—avoiding dead units ("dying ReLU"). Their non-monotonic bumps around zero enable networks to approximate functions with mixed-frequency content and sharp transitions more easily (Al-Safwan et al., 2021, Eger et al., 2019, Chieng et al., 2018).

3. Empirical Performance in Key Domains

Physics-Informed Neural Networks (PINNs), Helmholtz Equation

Al-Safwan et al. (Al-Safwan et al., 2021) demonstrated that Swish activation markedly improves PINN convergence for 2D Helmholtz problems, particularly in models with sharp heterogeneities.
- In an 8-layer, 20-neuron PINN, Swish accelerated initial loss decay by ~30% compared to tanh/ELU and reduced final $L_2$ error by ~15%.
- Real and imaginary wavefield errors ("two-scatter test") in $L_2$ and $L_\infty$ were $10-20\%$ lower for Swish.
Swish’s ability to represent high-wavenumber components—by leveraging non-monotonicity—overcomes the spectral bias toward low frequencies seen with tanh and ReLU.

Natural Language Processing

In "Is it Time to Swish?" (Eger et al., 2019), Swish achieved strong results across 8 NLP tasks (sentence/document classification, sequence tagging):
- Best-case Swish was top-5 among 21 activations; on certain sequence tasks, Swish exceeded ReLU by 1–3 percentage points.
- In deeply stacked networks, Swish outperformed saturated functions (cube, cosid) and matched the performance of penalized-tanh and Maxout variants.
- Training stability and convergence were on par with ReLU; Swish did not exhibit dead neurons.

Image Classification and Shallow CNNs

On CIFAR-10 (2-conv CNN, 25 epochs), Swish (β=1) yielded 69.9% accuracy versus 71.8% for ReLU and 72.9% for Leaky ReLU (Szandała, 2020), indicating that Swish excels primarily in deeper architectures.
Flatten-T Swish (T=-0.20) improved MNIST accuracy by up to 1.15% and halved the epochs needed for convergence in 512–1568 neuron deep FNNs (Chieng et al., 2018).

Landmark Detection via Heatmap Regularization

SAF, as introduced by Zhou et al. (Zhou et al., 2024), enhanced the accuracy of fetal thalamus diameter and head circumference measurements in ultrasonography.
- ICC improved from 0.684 (BiometryNet) to 0.737 with SAF ( $a=1$ ), with similar gains for EfficientNet-based models.
- The approach is architecture-agnostic and requires only a loss-term addition, not modification of forward activations.

4. Theoretical Insights: Initialization and Trainability

Hayou et al. (Hayou et al., 2018) provided a theoretical foundation for Swish activations in deep, randomly initialized networks. Swish, owing to its smoothness and tunable nonlinearity, enables signal and gradient propagation "deeper" than ReLU:

The edge-of-chaos criterion ( $\chi_1=1$ $χ_{1} = 1$ ), preserving correlation and gradient magnitude across layers, is more tractable for Swish than for ReLU:
- Swish admits a continuum of "edges" parameterized by $(\sigma_w, \sigma_b)$ , allowing for nonzero bias and thus richer variance propagation.
- On the edge, correlation collapse is polynomial ( $O(\ell^{-p})$ ), not exponential; for ReLU, only one edge is available and prohibits nonzero bias.
Swish’s depth-scale advantage explains its superior performance in very deep nets (often exceeding 100 layers).

5. Guidelines for Deployment and Tuning

$\beta$ -tuning: Default $\beta=1$ is effective in most settings. Grid-searching $\beta\in[0.5,2.0]$ or making it a trainable scalar can yield marginal performance gains (Al-Safwan et al., 2021, Eger et al., 2019).
Initialization: Xavier/Glorot normal weights with zero bias are recommended; avoid large initial weights that push activations into sigmoid tails.
Optimizer regime: Swish’s rapid initial convergence supports early switching to second-order optimizers (e.g., L-BFGS) (Al-Safwan et al., 2021).
SAF coefficients ( $a,b,c$ ): For heatmap-based landmark detection, compute ground-truth MSE for initial calibration and sweep $a$ over small integer values.
Depth and architecture: Swish/Swoosh activations are most beneficial in networks of moderate-to-greater depth; monitor gradient norms when stacking deeply.

6. Context: Limitations and Implementation Considerations

Computational overhead: Swish/Swoosh activations incur a per-element sigmoid computation, which increases cost relative to ReLU (5–10 $\times$ ) (Szandała, 2020, Eger et al., 2019).
Plug-and-play viability: For tasks with minimal hyperparameter tuning, simple functions (sin, penalized-tanh) can exhibit greater stability and lower variance (Eger et al., 2019, Szandała, 2020).
Gradient sparsity: Flatten-T Swish purposely zeroes gradients on $x<0$ to maintain sparsity, whereas canonical Swish retains them.
SAF specialization: SAF is not used as a forward-pass activation but as a scalar penalty for regularized loss in heatmap distribution, applicable to any architecture outputting pairwise heatmaps (Zhou et al., 2024).
Swoosh family distinctions: Not all "swoosh" functions are activation nonlinearities; careful distinction is necessary between Swish/FTS (forward), SAF (regularization), and other variants.

7. Extensions and Outlook

Adaptive and spatially-varying Swish/Swoosh activations present promising extensions, as suggested in PINN literature (Al-Safwan et al., 2021). Combining locally-learned gates $\beta(x)$ with the smooth, non-monotonic profile may further enhance representational power and compliance with heterogeneous or high-frequency domains. A plausible implication is the broader application of Swoosh-regularization frameworks like SAF beyond medical image analysis to general landmark detection or structured output learning, contingent on future empirical validation.

Markdown Report Issue Upgrade to Chat

References (6)

Is it time to swish? Comparing activation functions in solving the Helmholtz equation using physics-informed neural networks (2021)

Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks (2019)

Review and Comparison of Commonly Used Activation Functions for Deep Neural Networks (2020)

On the Selection of Initialization and Activation Function for Deep Neural Networks (2018)

Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning (2018)

Improving Automatic Fetal Biometry Measurement with Swoosh Activation Function (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Swoosh Activations.

Swoosh Activations: Theory & Applications

1. Mathematical Definitions and Key Properties

2. Comparative Role in Deep Networks

3. Empirical Performance in Key Domains

Physics-Informed Neural Networks (PINNs), Helmholtz Equation

Natural Language Processing

Image Classification and Shallow CNNs

Landmark Detection via Heatmap Regularization

4. Theoretical Insights: Initialization and Trainability

5. Guidelines for Deployment and Tuning

6. Context: Limitations and Implementation Considerations

7. Extensions and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Swoosh Activations: Theory & Applications

1. Mathematical Definitions and Key Properties

2. Comparative Role in Deep Networks

3. Empirical Performance in Key Domains

Physics-Informed Neural Networks (PINNs), Helmholtz Equation

Natural Language Processing

Image Classification and Shallow CNNs

Landmark Detection via Heatmap Regularization

4. Theoretical Insights: Initialization and Trainability

5. Guidelines for Deployment and Tuning

6. Context: Limitations and Implementation Considerations

7. Extensions and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research