Papers
Topics
Authors
Recent
Search
2000 character limit reached

ADReSS 2020 Challenge

Updated 15 January 2026
  • The paper introduces a challenge that benchmarks automated speech-derived features for Alzheimer’s disease detection, MMSE score inference, and cognitive decline prediction.
  • It employs rigorous preprocessing, dual-modality feature extraction, and baseline ML models to ensure reproducible and clinically viable outcomes.
  • Results indicate that fusion and deep learning approaches enhance diagnostic accuracy despite challenges from noisy ASR outputs and acoustic variability.

The ADReSS 2020 Challenge is a benchmark initiative focused on the automatic detection and prognosis of Alzheimer's Dementia (AD) using only spontaneously produced speech. The challenge formalizes three core prediction tasks—binary Alzheimer’s disease recognition, inference of Mini-Mental State Examination (MMSE) scores, and prediction of future cognitive decline—each evaluated under rigorous, pre-defined test conditions and using strictly speech-derived features. The ADReSS framework emphasizes fully automated processing pipelines, specifically prohibiting manual annotation interventions, thus targeting the development of clinically viable machine learning solutions for dementia research (Luz et al., 2021).

1. Formal Statement of Prediction Tasks

ADReSS 2020 operationalizes three supervised learning problems. Input feature vectors xRdx \in \mathbb{R}^d are extracted algorithmically from audio recordings. Output variables vary by task:

  • Task 1 (AD Classification): Given xx, learn a classifier f1:Rd{0,1}f_1 : \mathbb{R}^d \rightarrow \{0,1\} with y^AD=f1(x)yAD\hat{y}_{AD} = f_1(x) \approx y_{AD}, where yAD=1y_{AD} = 1 for AD and $0$ for controls.
  • Task 2 (MMSE Regression): Learn a regression function f2:RdRf_2 : \mathbb{R}^d \rightarrow \mathbb{R} such that y^MMSE=f2(x)yMMSE\hat{y}_{MMSE} = f_2(x) \approx y_{MMSE}, where yMMSEy_{MMSE} is the ground-truth Mini-Mental State Examination score.
  • Task 3 (Cognitive Decline Prediction): Learn a classifier f3:Rd{0,1}f_3 : \mathbb{R}^d \rightarrow \{0,1\}; for input xx at baseline, y^PROG=f3(x)yPROG\hat{y}_{PROG} = f_3(x) \approx y_{PROG}, where yPROG=1y_{PROG} = 1 if ΔMMSE5\Delta MMSE \geq 5 over two years, otherwise $0$.

This formalization enables benchmarking of both diagnostic and prognostic capabilities of speech-derived models (Luz et al., 2021).

2. Datasets and Preprocessing Protocols

The ADReSS 2020 Challenge leverages two distinct datasets, with rigorous standardization for age/gender balance via propensity-score matching:

  • Diagnosis & MMSE Dataset (Cookie Theft): Includes recordings from 242 participants (half AD, half controls; age 69–92), prompted to describe the "Cookie Theft" picture. Each session is 1–2 minutes, mono, 16 kHz. Data is split into 70% train and 30% test, maintaining demographic parity.
  • Prognosis (Semantic Fluency): Contains data from 105 AD-diagnosed subjects, each completing a 60s semantic fluency task at baseline, with MMSE tracked over two years. The decline label is defined by MMSE drop 5\geq 5.

Preprocessing Steps:

  • Stationary noise removal via spectral subtraction.
  • RMS normalization for amplitude leveling.
  • Optional diarization segmentation (not mandatory).

Such preprocessing reduces confounds from recording artifacts and speaker variability, supporting reproducible machine learning pipelines (Luz et al., 2021).

3. Feature Extraction Pipeline

The challenge specifies a dual-modality feature extraction workflow:

Acoustic Features

  • Frame slicing: 100 ms windows, no overlap.
  • eGeMAPS (88-dimension): Extracts pitch, prosody (loudness, spectral flux), MFCCs, formant structure, and voice quality measures.
  • Active Data Representation (ADR): A self-organizing map (SOM) clusters framewise eGeMAPS vectors, yielding histograms per recording and second-order (mean, variance) statistics within each cluster, outputting a compact xACx_{AC}.

Tools: openSMILE for eGeMAPS, custom MATLAB/Python code for ADR.

Linguistic Features

  • Automatic Speech Recognition: Google Cloud Speech API generates raw transcripts.
  • CHAT format and CLAN analysis: Morphological tagging (MOR), summary statistics (EVAL: speech rate, MLU, pauses), and lexical diversity via moving-average type-token ratio (MATTR).
  • Resulting linguistic feature vectors xLINR40+x_{LIN} \in \mathbb{R}^{40+} capture both surface statistics and morpho-lexical markers.

The pipeline is fully automated; no manual transcript curation is allowed (Luz et al., 2021).

4. Baseline Modeling Approaches

Baseline experiments employ classical ML models, implemented in MATLAB’s Statistics & ML Toolbox:

Classification (Tasks 1 & 3):

  • Linear Discriminant Analysis (LDA)
  • Decision Trees (DT, with leaf-size grid search and CART splits)
  • Support Vector Machine (SVM, linear kernel)
  • Random Forest (TreeBagger, 50 trees)
  • k-Nearest Neighbours (KNN)

Regression (Task 2):

Parameter tuning adopts leave-one-subject-out cross-validation (LOSO-CV) on training data. Final evaluation is performed on a 30% held-out test split (Luz et al., 2021). Advanced approaches, e.g. deep learning-based fusion (BiLSTM, highway layers, gating), have improved upon these baselines by integrating acoustic, lexical, disfluency, and pause features in a unified architecture (Rohanian et al., 2021).

5. Evaluation Metrics and Baseline Performance

ADReSS adopts established quantitative metrics:

  • Classification: Accuracy, precision, recall, and macro-average F1F_1.
  • Regression: Root mean squared error (RMSE).

Baseline Results Summary

Task Feature set Test Accuracy (%) / RMSE Best Model/Approach
1 Fusion (late) 78.87% SVM, late fusion
2 Linguistic 5.28 RMSE SVR
3 Linguistic 66.67% F₁ Decision Tree

Fusion consistently provided marginal gains for AD diagnosis but not for decline prediction. Decision tree methods tended to overfit, indicated by higher cross-validation metrics than corresponding test-set results. Linguistic features derived from noisy ASR outputs were consistently predictive, indicating robustness of text-derived speech biomarkers even at high word error rates (Luz et al., 2021).

Notably, BiLSTM-based multimodal fusion using gating and highway layers further increased AD classification accuracy to 84% and reduced MMSE RMSE to 4.26, underscoring the added value of integrating word-probability, disfluency, and pause features alongside acoustic inputs (Rohanian et al., 2021).

6. Robustness, Limitations, and Open Directions

Speech-derived features retain diagnostic signal even in the presence of noisy ASR outputs (WER ~33%) and acoustic degradation, as supported by consistent performance gains from fusion models equipped with gating mechanisms capable of down-weighting unreliable modalities. However, the cognitive decline (prognosis) task remains particularly challenging, with maximum F1F_1 of 67% in held-out evaluations, reflecting the subtlety of early disease trajectories.

Identified directions for advancement include:

  • Augmenting training data through realistic perturbations (noise, speed).
  • Employing domain-adapted ASR or end-to-end encoders (e.g., Wav2Vec2) to mitigate transcript noise.
  • Exploiting sequence models (RNN, Transformer) capable of capturing temporal speech dynamics.
  • Joint multi-task learning to leverage cross-task dependencies between diagnosis and MMSE prediction.
  • Enriching feature sets with prosodic-dynamics (Δ\Delta-features), deeper syntactic/semantic analysis, and speaker interaction statistics.

Such approaches are anticipated to improve generalization, especially for the longitudinal prediction of disease progression (Luz et al., 2021, Rohanian et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ADReSS 2020 Challenge.