ADReSSo 2021: Alzheimer Speech Analysis Challenge

Updated 15 January 2026

The paper demonstrates that the challenge’s main contribution lies in evaluating speech-only models for Alzheimer’s detection using unedited ASR outputs and rigorous multimodal fusion techniques.
The methodology integrates acoustic and linguistic features via BiLSTM, gating mechanisms, and BERT-based fusion to estimate cognitive scores and predict longitudinal decline.
The results highlight that compact, speech-derived models achieve competitive performance, with up to 84% AD accuracy and robust MMSE regression, offering practical implications for clinical assessments.

The ADReSSo 2021 Challenge is a benchmark-driven shared task in the field of computational paralinguistics and clinical speech analysis, targeting robust, fully automated detection and prediction of Alzheimer’s dementia (AD), cognitive screening scores, and cognitive decline using only spontaneous speech recordings. Originating as an evolution of the ADReSS Challenge (2020), ADReSSo defines rigorous, real-world conditions by providing uncorrected automatic speech recognition (ASR) outputs, strictly prohibiting manual transcriptions or segmentations, and enforcing strong demographic balancing (Rohanian et al., 2021, Luz et al., 2021).

1. Challenge Structure and Objectives

The ADReSSo 2021 Challenge specifies three primary prediction tasks:

Alzheimer’s Dementia (AD) Classification: Binary classification to distinguish individuals with AD from healthy controls, using a single spontaneous Cookie Theft picture-description recording as input.
Mini–Mental State Examination (MMSE) Score Inference: Regression to estimate each participant’s clinical cognitive score ( $\hat y \in [0,30]$ ) from their diagnostic speech sample.
Cognitive Decline Prediction: Binary prediction of significant longitudinal decline (defined as $\Delta \text{MMSE} = \text{MMSE}_\text{base} - \text{MMSE}_\text{year2} \ge 5$ ) using a baseline semantic category-fluency recording from AD patients.

Each task is formulated to develop models, $f_{\text{AD}}(x)\in\{0,1\}$ , $f_{\text{MMSE}}(x)\approx y$ , and $f_{\text{Prog}}(x)\in\{0,1\}$ , relying exclusively on automatically processed speech data (Luz et al., 2021).

2. Dataset Design and Preprocessing

The dataset comprises two matched cohorts:

Diagnostic set: 242 unique adults (balanced AD/control), each providing a spontaneous Cookie Theft recording.
Prognostic set: 105 AD patients, each with two years of longitudinal follow-up for decline assessment.

Organizers implemented propensity-score matching via a probit model, attaining exact balance in age and gender (standardized mean differences $<$ 0.001). Each partition is split into training (≈70%) and test (≈30%) sets, yielding 169/73 subjects for the diagnostic, and 73/32 for prognostic evaluation (Luz et al., 2021).

No manual transcript edits or hand-corrected segmentations are permitted; all analyses are restricted to automatically diarized voice-activity regions and ASR outputs (Google Cloud STT yields WER ≈ 32.8%) (Rohanian et al., 2021).

3. Feature Extraction Modalities

Acoustic Features

eGeMAPS: 88 low-level descriptors per 100 ms non-overlapping frame, including F0 semitone, loudness, MFCCs, jitter, shimmer, and formant statistics.
COVAREP (in advanced entries): 79 frame-level features at 100 Hz (prosodic, voice-quality, spectral measures), summarized by seven statistics per dimension (mean, max, min, median, std, skewness, kurtosis), normalized to zero mean/unit variance. Redundant features discarded (Rohanian et al., 2021, Luz et al., 2021).
Active Data Representation (ADR): Frames are clustered using a self-organizing map (SOM); cluster transitions (first and second order) are encoded as global summary vectors (Luz et al., 2021).

Linguistic Features

ASR Transcript Processing: Outputs passed to the CLAN toolkit for lexical and morphological annotation (MOR), summary measures (EVAL: utterances, type-token ratio, pauses), and MATTR calculation.
Augmentations (advanced):
- 100-dimensional GloVe embeddings per word.
- ASR language-model word probability, $p(w_t \mid w_{<t})$ .
- Disfluency tags (repair onset, edit term, fluent), generated by a left-to-right multi-task LSTM disfluency detector.
- Pause duration features, quantifying unfilled pauses as short ([0.5,1.5)s) or long (≥1.5s), derived from ASR timestamps (Rohanian et al., 2021).

4. Baseline and Advanced Model Architectures

Classical Baselines

Baseline models encompass Decision Trees (DT), k-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Random Forest (RF), and Support Vector Machine/Regression (SVM/SVR). For regression (task 2), Linear Regression (LR) and Gaussian Process Regression (GP) are additionally used (Luz et al., 2021).

Deep Multimodal Fusion Models

BiLSTM-Highway-Gating Architecture: Parallel lexical and acoustic branches encode respective features; lexical input passes through two stacked BiLSTM layers (16 units, timestep=10, stride=2), acoustic input through four BiLSTM layers (256 units, timestep=20, stride=1). After per-modality encoding, a gating vector $g=\sigma(W_g [h^{\text{lex}}; h^{\text{ac}}] + b_g)$ adaptively fuses representations, allowing the model to down-weight noisy modalities (particularly under high WER). The fusion is expressed as $h^{\text{fused}} = g\odot h^{\text{lex}} + (1-g)\odot h^{\text{ac}}$ .
Highway Layers: Three subsequent highway layers facilitate gating between transformed and carried features with transform gate $T(x) = \sigma(W_T x + b_T)$ and carry gate $1-T(x)$. Output heads are task-specific (sigmoid for classification, linear for regression).
BERT-based Fusion: Replaces the lexical BiLSTM with a fine-tuned “bert-large-uncased” encoder. “[CLS]” token embeddings are concatenated with pooled acoustic features, then passed through the same multimodal gating/highway architecture.

All deep models use ADAM optimization (learning rate = $1\times10^{-4}$ for LSTM systems; $2\times10^{-5}$ for BERT), with hyperparameters selected by grid search. BERT models use batch size 4 over 8 epochs (Rohanian et al., 2021).

5. Evaluation Metrics and Empirical Results

Metrics

Classification:

$\mathrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

where $TP$ , $TN$ , $FP$ , $FN$ denote true/false positive/negatives.

Regression:

$\mathrm{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^N (\hat y_i - y_i)^2}$

with $y_i$ the true and $\hat y_i$ the predicted MMSE (Rohanian et al., 2021, Luz et al., 2021).

Baseline Results

Task	Modality	Best Baseline	Accuracy (%) / RMSE
AD Classification	Late fusion	Decision Tree	78.9 / —
MMSE Regression	Linguistic	SVR	— / 5.28
Cognitive Decline	Linguistic	Decision Tree	66.7 (F1) / —

Acoustic-only models underperform linguistic and fused models, especially in cross-validation and on small prognostic sets. Even with noisy ASR, late fusion and feature aggregation facilitate performance surpassing simple acoustic pipelines (Luz et al., 2021).

Advanced System Performance

The multimodal BiLSTM+gating model achieves 84% accuracy on AD classification and RMSE=4.26 for MMSE regression. Ablation studies reveal successive gains: text-only BiLSTM yields 76%/5.31; addition of disfluency and pause features increases accuracy to 81% and lowers RMSE to 4.43; adding ASR word-probabilities achieves 77%/4.75; full multimodal gating gives 84%/4.26. Comparable BERT-gating models match RMSE=4.38 but entail significantly greater parameter count ( $\approx$ 105M vs. 4.9M for BiLSTM-highway) (Rohanian et al., 2021).

Model	AD Accuracy	MMSE RMSE
BiLSTM+Gating	84%	4.26
Text-only BiLSTM	76%	5.31
Acoustic-only BiLSTM	68%	6.03
BERT+Gating	—	4.38
Fine-tuned BERT	80%	4.49

6. Key Insights and Methodological Implications

Multimodal gating mechanisms consistently enhance resilience to noisy inputs, suppressing irrelevant or error-prone features and exploiting complementary information from lexical predictability, disfluency patterns, and pause structure. Inclusion of ASR word-probabilities, disfluency features, and pause durations, even in the face of substantial word error rates, yields robust predictive performance. Compact BiLSTM-highway architectures provide competitive or superior results relative to much larger BERT-based systems while maintaining orders-of-magnitude parameter efficiency (Rohanian et al., 2021).

A plausible implication is that scalable, speech-only AD diagnostic systems should employ learned fusion/gating to dynamically accommodate variable transcription quality and noise.

7. Significance and Future Directions

ADReSSo 2021 constitutes the first shared benchmark requiring strictly “speech-only” processing, disallowing manual annotation and targeting both diagnostic and longitudinal prognostic prediction. The challenge catalyzes methodological rigor by enforcing demographic balance, real-world ASR conditions, and multimodal feature exploitation.

Recommendations for subsequent research include: fusing ASR-based embeddings with word-probabilities instead of “clean” transcripts; systematic extraction of disfluency/pause features to capture AD-specific hesitation; deploying gating-based fusion for time-step-level noise adaptation; and leveraging compact, information-rich architectures suitable for practical applications (Rohanian et al., 2021, Luz et al., 2021).

The challenge data and evolving methodological insights situate ADReSSo as a reference platform for future work on automated, scalable, and generalizable cognitive assessment and monitoring.

Markdown Report Issue Upgrade to Chat

References (2)

Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs (2021)

Detecting cognitive decline using speech only: The ADReSSo Challenge (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ADReSSo 2021 Challenge.