Friedman & Nemenyi Tests Overview

Updated 12 February 2026

The Friedman test is a nonparametric omnibus procedure that ranks and compares multiple treatments across datasets to test for performance equivalence.
The Nemenyi mean-ranks test serves as a post-hoc method to compare all pairs following Friedman’s test but can yield paradoxical results due to pool-dependence.
Alternatives like pairwise tests and S-plots offer pool-independent comparisons and controlled Type I error, enhancing inference clarity.

The Friedman and Nemenyi tests are core nonparametric methodologies for analyzing and comparing multiple treatments or algorithms under a randomized complete block design. Their primary application is the statistical comparison of several methods across multiple data sets, with particular relevance in areas such as machine learning, psychology, and medicine. After an initial omnibus hypothesis test (Friedman), post-hoc analyses such as the Nemenyi mean-ranks test are commonly employed to determine sources of significant differences. However, the dependence of Nemenyi post-hoc inferences on the entire set of treatments and accompanying paradoxes have recently prompted scrutiny and recommendations for alternative pairwise procedures.

1. The Friedman Test: Omnibus Nonparametric Comparison

The Friedman test is used to detect differences among $m$ algorithms (treatments) evaluated on $N$ datasets (blocks). For each dataset $j$ , the outcomes $X_{ij}$ ( $i=1,...,m$ ) are ranked, yielding $R_{ij}$ ; average ranks replace raw performance scores, with ties handled via average-ranking. Each algorithm’s sum of ranks is $R_i = \sum_{j=1}^N R_{ij}$ , with mean rank $\overline{R}_i = R_i/N$ . The null hypothesis $H_0$ asserts “all $m$ algorithms perform equivalently,” that is, $N$ 0 for all $N$ 1.

The Friedman statistic is

$N$ 2

Alternatively, in terms of $N$ 3,

$N$ 4

For large $N$ 5, $N$ 6 is approximately $N$ 7-distributed with $N$ 8 degrees of freedom. In practice, the test serves as a robust, nonparametric alternative to repeated-measures ANOVA for arbitrary, not necessarily normal, data (Benavoli et al., 2015, Elamir, 2022).

2. Nemenyi Mean-Ranks Post-hoc Test

Upon rejection of the Friedman test’s omnibus null, the Nemenyi test is traditionally used for all $N$ 9 algorithm pairs. For any algorithms $j$ 0, $j$ 1, the mean-rank difference $j$ 2 forms the test statistic, evaluated versus a critical difference (CD):

$j$ 3

where $j$ 4 is the upper $j$ 5–quantile of the Studentized range distribution with $j$ 6 treatments and infinite degrees of freedom. Algorithms $j$ 7 and $j$ 8 are declared significantly different at family-wise level $j$ 9 if $X_{ij}$ 0. This controls the family-wise error rate across all $X_{ij}$ 1 comparisons and is operationally analogous to Tukey-Kramer procedures for parametric ANOVA (Benavoli et al., 2015).

3. Critique of the Mean-Ranks Test and Its Dependence on Algorithm Pool

A foundational critique, detailed by Benavoli, Corani, and Mangili (2016) (Benavoli et al., 2015), is that Nemenyi’s mean-ranks test produces decisions for any pair $X_{ij}$ 2 contingent on the presence, absence, and relative ordering of all other algorithms involved in the ranking. This can lead to paradoxical scenarios:

In an experiment where $X_{ij}$ 3 and $X_{ij}$ 4 each win on half the cases, two-algorithm tests (sign, Wilcoxon, t-test) yield nonsignificance. However, introducing additional poor-performing algorithms can inflate $X_{ij}$ 5 sufficiently to exceed $X_{ij}$ 6 and declare a significant difference.
On real-world datasets, the decision on a given pair can flip between significant and non-significant solely due to the composition of the algorithm pool. For example, in UCI data, algorithm pair $X_{ij}$ 7 vs $X_{ij}$ 8 was shown to change significance status depending on which other classifiers were included [(Benavoli et al., 2015), Table 4].

This pool-dependence means mean-ranks tests cannot guarantee control of maximum Type I error when equivalent algorithms are present, as also discussed by Fligner & Killeen (1984).

4. Alternative Two-Algorithm Post-hoc Procedures

To address the pool-dependence flaw, tests evaluating only the paired performances of $X_{ij}$ 9 and $i=1,...,m$ 0 are recommended. These include:

a) Sign Test: For each dataset $i=1,...,m$ 1, set $i=1,...,m$ 2 if $i=1,...,m$ 3, $i=1,...,m$ 4 if $i=1,...,m$ 5, $i=1,...,m$ 6 if tie. $i=1,...,m$ 7 number of $i=1,...,m$ 8s among non-ties. Under $i=1,...,m$ 9, $R_{ij}$ 0. Large $R_{ij}$ 1 allows normal approximation:

$R_{ij}$ 2

to compare to standard normal quantiles.

b) Wilcoxon Signed-Rank Test: For dataset $R_{ij}$ 3, compute $R_{ij}$ 4; discard ties. Rank $R_{ij}$ 5 among nonzero values, sum ranks $R_{ij}$ 6 for positive $R_{ij}$ 7. Under $R_{ij}$ 8 (symmetric differences),

$R_{ij}$ 9

for large $R_i = \sum_{j=1}^N R_{ij}$ 0. Both tests require family-wise correction (e.g., Bonferroni, Holm) over all $R_i = \sum_{j=1}^N R_{ij}$ 1 pairs (Benavoli et al., 2015).

5. Recent Developments: S-Statistics and Graphical Interpretation

Recent work (Elamir, 2022) proposes a graphical “S-plot” approach that simultaneously provides the global Friedman test and local post-hoc indications with drastically fewer comparisons. For $R_i = \sum_{j=1}^N R_{ij}$ 2 treatments and $R_i = \sum_{j=1}^N R_{ij}$ 3 blocks, each treatment $R_i = \sum_{j=1}^N R_{ij}$ 4 has a score

$R_i = \sum_{j=1}^N R_{ij}$ 5

where $R_i = \sum_{j=1}^N R_{ij}$ 6 is the expected rank sum under $R_i = \sum_{j=1}^N R_{ij}$ 7. The sum $R_i = \sum_{j=1}^N R_{ij}$ 8 recovers the classical Friedman statistic.

The distribution of $R_i = \sum_{j=1}^N R_{ij}$ 9 is well-approximated via gamma moments matching:

$\overline{R}_i = R_i/N$ 0, $\overline{R}_i = R_i/N$ 1, and third moment $\overline{R}_i = R_i/N$ 2 derived from those of $\overline{R}_i = R_i/N$ 3.
Fitted Gamma( $\overline{R}_i = R_i/N$ 4, $\overline{R}_i = R_i/N$ 5) with $\overline{R}_i = R_i/N$ 6, $\overline{R}_i = R_i/N$ 7, matching mean and skewness.
The threshold $\overline{R}_i = R_i/N$ 8 provides Bonferroni-adjusted familywise Type I error.

The S-plot visualizes each $\overline{R}_i = R_i/N$ 9; treatments with $H_0$ 0 are significant contributors to rejection. This reduces testing from $H_0$ 1 to $H_0$ 2 with controlled error rates and delivers immediate interpretive insight (Elamir, 2022).

6. Empirical Validation and Practical Recommendations

Simulation studies have demonstrated that both the classical Friedman and S-statistic procedures maintain empirical Type I error within Bradley's robustness bounds across a range of $H_0$ 3 and $H_0$ 4 for both normal and exponential data, with accuracy improving as $H_0$ 5 increases. Real-data applications (e.g., class size effects on children’s questions, per Gibbons & Chakraborti) confirm that the S-plot precisely identifies the dominant treatments responsible for global rejection, reducing the reliance on multiple pairwise post-hoc tables (Elamir, 2022).

Practical guidelines:

Apply the Friedman test as the omnibus procedure for multiple-treatment, multiple-block designs.
Avoid the classical Nemenyi mean-ranks test; its results for a pair may depend irrationally on other treatments present.
Prefer pairwise comparisons based exclusively on two-algorithm tests (Wilcoxon signed-rank if symmetry plausible, else sign test), with appropriate correction for multiple comparisons (Benavoli et al., 2015).
Consider global-to-local visualization approaches such as S-plots for succinct interpretability and error control.

7. Summary Table: Properties and Critique

Method	Pairwise Test Pool Dependence	Familywise Error Control	Number of Comparisons
Friedman + Nemenyi	Yes (dependent)	Yes (nominal)	$H_0$ 6
Friedman + Pairwise (Wilcoxon/Sign)	No (independent)	Yes (Bonferroni/Holm)	$H_0$ 7
S-Statistic/S-Plot [Editor’s term]	No (per-treatment)	Yes (Gamma approx, Bonferroni)	$H_0$ 8

The core limitation of the mean-ranks test is its statistical dependence on the composition of the entire set of algorithms, which undermines its relevance for pairwise inference. Alternative approaches leveraging either pairwise-only tests or S-statistical visualizations achieve more interpretable, pool-independent, and statistically valid post-hoc inference (Benavoli et al., 2015, Elamir, 2022).

Markdown Report Issue Upgrade to Chat

References (2)

Should we really use post-hoc tests based on mean-ranks? (2015)

A Graphical Approach for Friedman Test: Moments Approach (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Friedman and Nemenyi Tests.