Friedman & Nemenyi Tests Overview
- The Friedman test is a nonparametric omnibus procedure that ranks and compares multiple treatments across datasets to test for performance equivalence.
- The Nemenyi mean-ranks test serves as a post-hoc method to compare all pairs following Friedman’s test but can yield paradoxical results due to pool-dependence.
- Alternatives like pairwise tests and S-plots offer pool-independent comparisons and controlled Type I error, enhancing inference clarity.
The Friedman and Nemenyi tests are core nonparametric methodologies for analyzing and comparing multiple treatments or algorithms under a randomized complete block design. Their primary application is the statistical comparison of several methods across multiple data sets, with particular relevance in areas such as machine learning, psychology, and medicine. After an initial omnibus hypothesis test (Friedman), post-hoc analyses such as the Nemenyi mean-ranks test are commonly employed to determine sources of significant differences. However, the dependence of Nemenyi post-hoc inferences on the entire set of treatments and accompanying paradoxes have recently prompted scrutiny and recommendations for alternative pairwise procedures.
1. The Friedman Test: Omnibus Nonparametric Comparison
The Friedman test is used to detect differences among algorithms (treatments) evaluated on datasets (blocks). For each dataset , the outcomes () are ranked, yielding ; average ranks replace raw performance scores, with ties handled via average-ranking. Each algorithm’s sum of ranks is , with mean rank . The null hypothesis asserts “all algorithms perform equivalently,” that is, for all .
The Friedman statistic is
Alternatively, in terms of ,
For large , is approximately -distributed with degrees of freedom. In practice, the test serves as a robust, nonparametric alternative to repeated-measures ANOVA for arbitrary, not necessarily normal, data (Benavoli et al., 2015, Elamir, 2022).
2. Nemenyi Mean-Ranks Post-hoc Test
Upon rejection of the Friedman test’s omnibus null, the Nemenyi test is traditionally used for all algorithm pairs. For any algorithms , , the mean-rank difference forms the test statistic, evaluated versus a critical difference (CD):
where is the upper –quantile of the Studentized range distribution with treatments and infinite degrees of freedom. Algorithms and are declared significantly different at family-wise level if . This controls the family-wise error rate across all comparisons and is operationally analogous to Tukey-Kramer procedures for parametric ANOVA (Benavoli et al., 2015).
3. Critique of the Mean-Ranks Test and Its Dependence on Algorithm Pool
A foundational critique, detailed by Benavoli, Corani, and Mangili (2016) (Benavoli et al., 2015), is that Nemenyi’s mean-ranks test produces decisions for any pair contingent on the presence, absence, and relative ordering of all other algorithms involved in the ranking. This can lead to paradoxical scenarios:
- In an experiment where and each win on half the cases, two-algorithm tests (sign, Wilcoxon, t-test) yield nonsignificance. However, introducing additional poor-performing algorithms can inflate sufficiently to exceed and declare a significant difference.
- On real-world datasets, the decision on a given pair can flip between significant and non-significant solely due to the composition of the algorithm pool. For example, in UCI data, algorithm pair vs was shown to change significance status depending on which other classifiers were included [(Benavoli et al., 2015), Table 4].
This pool-dependence means mean-ranks tests cannot guarantee control of maximum Type I error when equivalent algorithms are present, as also discussed by Fligner & Killeen (1984).
4. Alternative Two-Algorithm Post-hoc Procedures
To address the pool-dependence flaw, tests evaluating only the paired performances of and are recommended. These include:
a) Sign Test: For each dataset , set if , if , $0$ if tie. number of s among non-ties. Under , . Large allows normal approximation:
to compare to standard normal quantiles.
b) Wilcoxon Signed-Rank Test: For dataset , compute ; discard ties. Rank among nonzero values, sum ranks for positive . Under (symmetric differences),
for large . Both tests require family-wise correction (e.g., Bonferroni, Holm) over all pairs (Benavoli et al., 2015).
5. Recent Developments: S-Statistics and Graphical Interpretation
Recent work (Elamir, 2022) proposes a graphical “S-plot” approach that simultaneously provides the global Friedman test and local post-hoc indications with drastically fewer comparisons. For treatments and blocks, each treatment has a score
where is the expected rank sum under . The sum recovers the classical Friedman statistic.
The distribution of is well-approximated via gamma moments matching:
- , , and third moment derived from those of .
- Fitted Gamma(, ) with , , matching mean and skewness.
- The threshold provides Bonferroni-adjusted familywise Type I error.
The S-plot visualizes each ; treatments with are significant contributors to rejection. This reduces testing from to with controlled error rates and delivers immediate interpretive insight (Elamir, 2022).
6. Empirical Validation and Practical Recommendations
Simulation studies have demonstrated that both the classical Friedman and S-statistic procedures maintain empirical Type I error within Bradley's robustness bounds across a range of and for both normal and exponential data, with accuracy improving as increases. Real-data applications (e.g., class size effects on children’s questions, per Gibbons & Chakraborti) confirm that the S-plot precisely identifies the dominant treatments responsible for global rejection, reducing the reliance on multiple pairwise post-hoc tables (Elamir, 2022).
Practical guidelines:
- Apply the Friedman test as the omnibus procedure for multiple-treatment, multiple-block designs.
- Avoid the classical Nemenyi mean-ranks test; its results for a pair may depend irrationally on other treatments present.
- Prefer pairwise comparisons based exclusively on two-algorithm tests (Wilcoxon signed-rank if symmetry plausible, else sign test), with appropriate correction for multiple comparisons (Benavoli et al., 2015).
- Consider global-to-local visualization approaches such as S-plots for succinct interpretability and error control.
7. Summary Table: Properties and Critique
| Method | Pairwise Test Pool Dependence | Familywise Error Control | Number of Comparisons |
|---|---|---|---|
| Friedman + Nemenyi | Yes (dependent) | Yes (nominal) | |
| Friedman + Pairwise (Wilcoxon/Sign) | No (independent) | Yes (Bonferroni/Holm) | |
| S-Statistic/S-Plot [Editor’s term] | No (per-treatment) | Yes (Gamma approx, Bonferroni) |
The core limitation of the mean-ranks test is its statistical dependence on the composition of the entire set of algorithms, which undermines its relevance for pairwise inference. Alternative approaches leveraging either pairwise-only tests or S-statistical visualizations achieve more interpretable, pool-independent, and statistically valid post-hoc inference (Benavoli et al., 2015, Elamir, 2022).