Papers
Topics
Authors
Recent
Search
2000 character limit reached

Friedman & Nemenyi Tests Overview

Updated 12 February 2026
  • The Friedman test is a nonparametric omnibus procedure that ranks and compares multiple treatments across datasets to test for performance equivalence.
  • The Nemenyi mean-ranks test serves as a post-hoc method to compare all pairs following Friedman’s test but can yield paradoxical results due to pool-dependence.
  • Alternatives like pairwise tests and S-plots offer pool-independent comparisons and controlled Type I error, enhancing inference clarity.

The Friedman and Nemenyi tests are core nonparametric methodologies for analyzing and comparing multiple treatments or algorithms under a randomized complete block design. Their primary application is the statistical comparison of several methods across multiple data sets, with particular relevance in areas such as machine learning, psychology, and medicine. After an initial omnibus hypothesis test (Friedman), post-hoc analyses such as the Nemenyi mean-ranks test are commonly employed to determine sources of significant differences. However, the dependence of Nemenyi post-hoc inferences on the entire set of treatments and accompanying paradoxes have recently prompted scrutiny and recommendations for alternative pairwise procedures.

1. The Friedman Test: Omnibus Nonparametric Comparison

The Friedman test is used to detect differences among mm algorithms (treatments) evaluated on NN datasets (blocks). For each dataset jj, the outcomes XijX_{ij} (i=1,...,mi=1,...,m) are ranked, yielding RijR_{ij}; average ranks replace raw performance scores, with ties handled via average-ranking. Each algorithm’s sum of ranks is Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}, with mean rank Ri=Ri/N\overline{R}_i = R_i/N. The null hypothesis H0H_0 asserts “all mm algorithms perform equivalently,” that is, NN0 for all NN1.

The Friedman statistic is

NN2

Alternatively, in terms of NN3,

NN4

For large NN5, NN6 is approximately NN7-distributed with NN8 degrees of freedom. In practice, the test serves as a robust, nonparametric alternative to repeated-measures ANOVA for arbitrary, not necessarily normal, data (Benavoli et al., 2015, Elamir, 2022).

2. Nemenyi Mean-Ranks Post-hoc Test

Upon rejection of the Friedman test’s omnibus null, the Nemenyi test is traditionally used for all NN9 algorithm pairs. For any algorithms jj0, jj1, the mean-rank difference jj2 forms the test statistic, evaluated versus a critical difference (CD):

jj3

where jj4 is the upper jj5–quantile of the Studentized range distribution with jj6 treatments and infinite degrees of freedom. Algorithms jj7 and jj8 are declared significantly different at family-wise level jj9 if XijX_{ij}0. This controls the family-wise error rate across all XijX_{ij}1 comparisons and is operationally analogous to Tukey-Kramer procedures for parametric ANOVA (Benavoli et al., 2015).

3. Critique of the Mean-Ranks Test and Its Dependence on Algorithm Pool

A foundational critique, detailed by Benavoli, Corani, and Mangili (2016) (Benavoli et al., 2015), is that Nemenyi’s mean-ranks test produces decisions for any pair XijX_{ij}2 contingent on the presence, absence, and relative ordering of all other algorithms involved in the ranking. This can lead to paradoxical scenarios:

  • In an experiment where XijX_{ij}3 and XijX_{ij}4 each win on half the cases, two-algorithm tests (sign, Wilcoxon, t-test) yield nonsignificance. However, introducing additional poor-performing algorithms can inflate XijX_{ij}5 sufficiently to exceed XijX_{ij}6 and declare a significant difference.
  • On real-world datasets, the decision on a given pair can flip between significant and non-significant solely due to the composition of the algorithm pool. For example, in UCI data, algorithm pair XijX_{ij}7 vs XijX_{ij}8 was shown to change significance status depending on which other classifiers were included [(Benavoli et al., 2015), Table 4].

This pool-dependence means mean-ranks tests cannot guarantee control of maximum Type I error when equivalent algorithms are present, as also discussed by Fligner & Killeen (1984).

4. Alternative Two-Algorithm Post-hoc Procedures

To address the pool-dependence flaw, tests evaluating only the paired performances of XijX_{ij}9 and i=1,...,mi=1,...,m0 are recommended. These include:

a) Sign Test: For each dataset i=1,...,mi=1,...,m1, set i=1,...,mi=1,...,m2 if i=1,...,mi=1,...,m3, i=1,...,mi=1,...,m4 if i=1,...,mi=1,...,m5, i=1,...,mi=1,...,m6 if tie. i=1,...,mi=1,...,m7 number of i=1,...,mi=1,...,m8s among non-ties. Under i=1,...,mi=1,...,m9, RijR_{ij}0. Large RijR_{ij}1 allows normal approximation:

RijR_{ij}2

to compare to standard normal quantiles.

b) Wilcoxon Signed-Rank Test: For dataset RijR_{ij}3, compute RijR_{ij}4; discard ties. Rank RijR_{ij}5 among nonzero values, sum ranks RijR_{ij}6 for positive RijR_{ij}7. Under RijR_{ij}8 (symmetric differences),

RijR_{ij}9

for large Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}0. Both tests require family-wise correction (e.g., Bonferroni, Holm) over all Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}1 pairs (Benavoli et al., 2015).

5. Recent Developments: S-Statistics and Graphical Interpretation

Recent work (Elamir, 2022) proposes a graphical “S-plot” approach that simultaneously provides the global Friedman test and local post-hoc indications with drastically fewer comparisons. For Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}2 treatments and Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}3 blocks, each treatment Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}4 has a score

Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}5

where Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}6 is the expected rank sum under Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}7. The sum Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}8 recovers the classical Friedman statistic.

The distribution of Ri=j=1NRijR_i = \sum_{j=1}^N R_{ij}9 is well-approximated via gamma moments matching:

  • Ri=Ri/N\overline{R}_i = R_i/N0, Ri=Ri/N\overline{R}_i = R_i/N1, and third moment Ri=Ri/N\overline{R}_i = R_i/N2 derived from those of Ri=Ri/N\overline{R}_i = R_i/N3.
  • Fitted Gamma(Ri=Ri/N\overline{R}_i = R_i/N4, Ri=Ri/N\overline{R}_i = R_i/N5) with Ri=Ri/N\overline{R}_i = R_i/N6, Ri=Ri/N\overline{R}_i = R_i/N7, matching mean and skewness.
  • The threshold Ri=Ri/N\overline{R}_i = R_i/N8 provides Bonferroni-adjusted familywise Type I error.

The S-plot visualizes each Ri=Ri/N\overline{R}_i = R_i/N9; treatments with H0H_00 are significant contributors to rejection. This reduces testing from H0H_01 to H0H_02 with controlled error rates and delivers immediate interpretive insight (Elamir, 2022).

6. Empirical Validation and Practical Recommendations

Simulation studies have demonstrated that both the classical Friedman and S-statistic procedures maintain empirical Type I error within Bradley's robustness bounds across a range of H0H_03 and H0H_04 for both normal and exponential data, with accuracy improving as H0H_05 increases. Real-data applications (e.g., class size effects on children’s questions, per Gibbons & Chakraborti) confirm that the S-plot precisely identifies the dominant treatments responsible for global rejection, reducing the reliance on multiple pairwise post-hoc tables (Elamir, 2022).

Practical guidelines:

  • Apply the Friedman test as the omnibus procedure for multiple-treatment, multiple-block designs.
  • Avoid the classical Nemenyi mean-ranks test; its results for a pair may depend irrationally on other treatments present.
  • Prefer pairwise comparisons based exclusively on two-algorithm tests (Wilcoxon signed-rank if symmetry plausible, else sign test), with appropriate correction for multiple comparisons (Benavoli et al., 2015).
  • Consider global-to-local visualization approaches such as S-plots for succinct interpretability and error control.

7. Summary Table: Properties and Critique

Method Pairwise Test Pool Dependence Familywise Error Control Number of Comparisons
Friedman + Nemenyi Yes (dependent) Yes (nominal) H0H_06
Friedman + Pairwise (Wilcoxon/Sign) No (independent) Yes (Bonferroni/Holm) H0H_07
S-Statistic/S-Plot [Editor’s term] No (per-treatment) Yes (Gamma approx, Bonferroni) H0H_08

The core limitation of the mean-ranks test is its statistical dependence on the composition of the entire set of algorithms, which undermines its relevance for pairwise inference. Alternative approaches leveraging either pairwise-only tests or S-statistical visualizations achieve more interpretable, pool-independent, and statistically valid post-hoc inference (Benavoli et al., 2015, Elamir, 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Friedman and Nemenyi Tests.