ADReSS 2020 Challenge

Updated 15 January 2026

The paper introduces a challenge that benchmarks automated speech-derived features for Alzheimer’s disease detection, MMSE score inference, and cognitive decline prediction.
It employs rigorous preprocessing, dual-modality feature extraction, and baseline ML models to ensure reproducible and clinically viable outcomes.
Results indicate that fusion and deep learning approaches enhance diagnostic accuracy despite challenges from noisy ASR outputs and acoustic variability.

The ADReSS 2020 Challenge is a benchmark initiative focused on the automatic detection and prognosis of Alzheimer's Dementia (AD) using only spontaneously produced speech. The challenge formalizes three core prediction tasks—binary Alzheimer’s disease recognition, inference of Mini-Mental State Examination (MMSE) scores, and prediction of future cognitive decline—each evaluated under rigorous, pre-defined test conditions and using strictly speech-derived features. The ADReSS framework emphasizes fully automated processing pipelines, specifically prohibiting manual annotation interventions, thus targeting the development of clinically viable machine learning solutions for dementia research (Luz et al., 2021).

1. Formal Statement of Prediction Tasks

ADReSS 2020 operationalizes three supervised learning problems. Input feature vectors $x \in \mathbb{R}^d$ are extracted algorithmically from audio recordings. Output variables vary by task:

Task 1 (AD Classification): Given $x$ , learn a classifier $f_1 : \mathbb{R}^d \rightarrow \{0,1\}$ with $\hat{y}_{AD} = f_1(x) \approx y_{AD}$ , where $y_{AD} = 1$ for AD and $0$ for controls.
Task 2 (MMSE Regression): Learn a regression function $f_2 : \mathbb{R}^d \rightarrow \mathbb{R}$ such that $\hat{y}_{MMSE} = f_2(x) \approx y_{MMSE}$ , where $y_{MMSE}$ is the ground-truth Mini-Mental State Examination score.
Task 3 (Cognitive Decline Prediction): Learn a classifier $f_3 : \mathbb{R}^d \rightarrow \{0,1\}$ ; for input $x$ at baseline, $\hat{y}_{PROG} = f_3(x) \approx y_{PROG}$ , where $y_{PROG} = 1$ if $\Delta MMSE \geq 5$ over two years, otherwise $0$.

This formalization enables benchmarking of both diagnostic and prognostic capabilities of speech-derived models (Luz et al., 2021).

2. Datasets and Preprocessing Protocols

The ADReSS 2020 Challenge leverages two distinct datasets, with rigorous standardization for age/gender balance via propensity-score matching:

Diagnosis & MMSE Dataset (Cookie Theft): Includes recordings from 242 participants (half AD, half controls; age 69–92), prompted to describe the "Cookie Theft" picture. Each session is 1–2 minutes, mono, 16 kHz. Data is split into 70% train and 30% test, maintaining demographic parity.
Prognosis (Semantic Fluency): Contains data from 105 AD-diagnosed subjects, each completing a 60s semantic fluency task at baseline, with MMSE tracked over two years. The decline label is defined by MMSE drop $\geq 5$ .

Preprocessing Steps:

Stationary noise removal via spectral subtraction.
RMS normalization for amplitude leveling.
Optional diarization segmentation (not mandatory).

Such preprocessing reduces confounds from recording artifacts and speaker variability, supporting reproducible machine learning pipelines (Luz et al., 2021).

3. Feature Extraction Pipeline

The challenge specifies a dual-modality feature extraction workflow:

Acoustic Features

Frame slicing: 100 ms windows, no overlap.
eGeMAPS (88-dimension): Extracts pitch, prosody (loudness, spectral flux), MFCCs, formant structure, and voice quality measures.
Active Data Representation (ADR): A self-organizing map (SOM) clusters framewise eGeMAPS vectors, yielding histograms per recording and second-order (mean, variance) statistics within each cluster, outputting a compact $x_{AC}$ .

Tools: openSMILE for eGeMAPS, custom MATLAB/Python code for ADR.

Linguistic Features

Automatic Speech Recognition: Google Cloud Speech API generates raw transcripts.
CHAT format and CLAN analysis: Morphological tagging (MOR), summary statistics (EVAL: speech rate, MLU, pauses), and lexical diversity via moving-average type-token ratio (MATTR).
Resulting linguistic feature vectors $x_{LIN} \in \mathbb{R}^{40+}$ capture both surface statistics and morpho-lexical markers.

The pipeline is fully automated; no manual transcript curation is allowed (Luz et al., 2021).

4. Baseline Modeling Approaches

Baseline experiments employ classical ML models, implemented in MATLAB’s Statistics & ML Toolbox:

Classification (Tasks 1 & 3):

Linear Discriminant Analysis (LDA)
Decision Trees (DT, with leaf-size grid search and CART splits)
Support Vector Machine (SVM, linear kernel)
Random Forest (TreeBagger, 50 trees)
k-Nearest Neighbours (KNN)

Regression (Task 2):

Linear Regression (LR)
DT Regressor (CART)
Support Vector Regression (SVR, RBF kernel)
Random Forest Regression
Gaussian Process Regression (GPR)

Parameter tuning adopts leave-one-subject-out cross-validation (LOSO-CV) on training data. Final evaluation is performed on a 30% held-out test split (Luz et al., 2021). Advanced approaches, e.g. deep learning-based fusion (BiLSTM, highway layers, gating), have improved upon these baselines by integrating acoustic, lexical, disfluency, and pause features in a unified architecture (Rohanian et al., 2021).

5. Evaluation Metrics and Baseline Performance

ADReSS adopts established quantitative metrics:

Classification: Accuracy, precision, recall, and macro-average $F_1$ .
Regression: Root mean squared error (RMSE).

Baseline Results Summary

Task	Feature set	Test Accuracy (%) / RMSE	Best Model/Approach
1	Fusion (late)	78.87%	SVM, late fusion
2	Linguistic	5.28 RMSE	SVR
3	Linguistic	66.67% F₁	Decision Tree

Fusion consistently provided marginal gains for AD diagnosis but not for decline prediction. Decision tree methods tended to overfit, indicated by higher cross-validation metrics than corresponding test-set results. Linguistic features derived from noisy ASR outputs were consistently predictive, indicating robustness of text-derived speech biomarkers even at high word error rates (Luz et al., 2021).

Notably, BiLSTM-based multimodal fusion using gating and highway layers further increased AD classification accuracy to 84% and reduced MMSE RMSE to 4.26, underscoring the added value of integrating word-probability, disfluency, and pause features alongside acoustic inputs (Rohanian et al., 2021).

6. Robustness, Limitations, and Open Directions

Speech-derived features retain diagnostic signal even in the presence of noisy ASR outputs (WER ~33%) and acoustic degradation, as supported by consistent performance gains from fusion models equipped with gating mechanisms capable of down-weighting unreliable modalities. However, the cognitive decline (prognosis) task remains particularly challenging, with maximum $F_1$ of 67% in held-out evaluations, reflecting the subtlety of early disease trajectories.

Identified directions for advancement include:

Augmenting training data through realistic perturbations (noise, speed).
Employing domain-adapted ASR or end-to-end encoders (e.g., Wav2Vec2) to mitigate transcript noise.
Exploiting sequence models (RNN, Transformer) capable of capturing temporal speech dynamics.
Joint multi-task learning to leverage cross-task dependencies between diagnosis and MMSE prediction.
Enriching feature sets with prosodic-dynamics ( $\Delta$ -features), deeper syntactic/semantic analysis, and speaker interaction statistics.

Such approaches are anticipated to improve generalization, especially for the longitudinal prediction of disease progression (Luz et al., 2021, Rohanian et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

Detecting cognitive decline using speech only: The ADReSSo Challenge (2021)

Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ADReSS 2020 Challenge.