Hierarchical-CTC Pre-training in ASR

Updated 27 December 2025

Hierarchical-CTC pre-training is a strategy that applies auxiliary CTC losses at intermediate layers to learn multi-scale linguistic representations.
It employs progressive supervision using fine-to-coarse subword vocabularies, thereby mitigating the abstraction gap in end-to-end ASR.
Empirical results show reduced word error rates and improved data efficiency, particularly in limited-resource scenarios.

Hierarchical-CTC pre-training refers to a family of strategies that employ auxiliary Connectionist Temporal Classification (CTC) losses at intermediate layers of a deep neural encoder, with each auxiliary loss supervising the model at a different linguistic granularity. The method is motivated by the abstraction gap in end-to-end automatic speech recognition (ASR), where directly predicting word-level sequences from speech signals poses a significant representational challenge. Hierarchical-CTC pre-training uses losses over progressively coarser subword vocabularies at deeper encoder layers, often with explicit conditioning dependencies, to guide the model toward learning effective multi-scale linguistic representations. Empirical results demonstrate improved downstream recognition performance and data efficiency, especially under limited-resource scenarios (Higuchi et al., 2021, Krishna et al., 2018).

1. Architectural Principles

Hierarchical-CTC models are typically built on deep encoders (e.g., Transformers, Conformers, bidirectional LSTMs), where input speech representations pass through several stacked layers. Key architectural dimensions include:

Input and Front-End: Pre-processed features frequently involve mel-filterbanks (e.g., 80-dim) and pitch (e.g., 3-dim), optionally augmented with CNN subsampling, SpecAugment, and speed perturbation (Higuchi et al., 2021).
Encoder: For example, an 18-layer Transformer or Conformer (with $d_{\text{model}}=256$ , $d_{\text{ff}}=1024$ or $2048$) can be employed (Higuchi et al., 2021); bidirectional LSTMs (e.g., 5 layers, 320 units/layer/direction) are used in baseline multitask settings (Krishna et al., 2018).
CTC Branches: Multiple CTC “heads” branch from different encoder layers. Each branch consists of a linear projection followed by a softmax over its specific vocabulary.

A distinctive aspect of hierarchical-CTC as introduced in (Higuchi et al., 2021) is the conditioning mechanism: the prediction at each higher-level branch is explicitly conditioned on posterior distributions from preceding, lower-level branches. Concretely, with encoder output $X^{(i)}$ ,

$\tilde{X}^{(i)} = X^{(i-1)} + \text{SelfAttention}(X^{(i-1)})$

$X^{(i)} = \tilde{X}^{(i)} + \text{FeedForward}(\tilde{X}^{(i)}) + \text{Linear}(A^{(i)})$

where $A^{(i)} = \text{softmax}(W_o X^{(i)})$ is the lower-level branch posterior. This mechanism enforces dependency between successive subword granularities.

2. Loss Formulation and Optimization

The hierarchical-CTC loss framework introduces separate CTC objectives at $K$ encoder layers, each predicting a sequence at a different subword granularity. For $E$ encoder layers, losses are attached at $i_k = \lfloor kE/K \rfloor$ for $k = 1, \ldots, K$ .

The per-head CTC loss:

$\mathcal{L}_k = - \sum_{(X, Y^{(k)})} \log P_{\mathrm{ctc}} \Bigl( Y^{(k)} \,\Bigl|\, \hat Y^{(1)},\ldots,\hat Y^{(k-1)}, X^{(i_k)} \Bigr)$

$P_{\mathrm{ctc}}$ marginalizes over all valid alignments of $Y^{(k)}$ given $X^{(i_k)}$ and the previous predicted sequences $\hat Y^{(j)}$ .

The total hierarchical CTC loss is the average:

$\mathcal{L}_{\mathrm{hc-ctc}} = \frac{1}{K} \sum_{k=1}^{K} \mathcal{L}_k$

In the "intermediate CTC" (SC-CTC) paradigm, identical vocabularies are used at all CTC heads, with each loss applied individually; "parallel CTC" (ParaCTC) applies all losses at the topmost layer without lower-level conditioning (Higuchi et al., 2021).

In hierarchical multitask learning for CTC (Krishna et al., 2018), the objective is interpolated between main and auxiliary tasks:

$L_\text{total} = \lambda L^{\text{sub}}_\text{CTC} + (1-\lambda) L^{\text{aux}}_\text{CTC}$

where $\lambda \in [0,1]$ .

3. Multi-Granular Subword Supervision

Subword unit vocabularies are constructed from training transcripts using algorithms such as SentencePiece BPE. Supervision proceeds from finer-grained tokens at shallower layers to coarser-grained tokens at deeper layers:

Example (LibriSpeech-100h & TEDLIUM2):
- $|V^{(1)}| = 256$ (character-ngram level)
- $|V^{(2)}| = 2{,}048$ (small BPE)
- $|V^{(3)}| = 16{,}384$ (word-level BPE)

This hierarchy balances sequence length versus target sparsity—coarse units (words) present alignment challenges and sparsity, while fine-grained units retain acoustic proximity but are compositional.

A plausible implication is that the use of multi-granular subwords allows explicit guidance in forming word-level representations, thereby reducing error rates in low-resource setups by mitigating word sparsity (Higuchi et al., 2021).

4. Training Procedure and Hyperparameterization

The primary training loop applies Adam optimizer (standard ( $\beta_1=0.9$ , $\beta_2=0.98$ ) or $\beta_2=0.999$ ) with layer-specific learning rate schedules, often featuring warmup and checkpoint averaging (Higuchi et al., 2021, Krishna et al., 2018). Data augmentation strategies, such as SpecAugment and speed perturbation, expand the effective dataset.

For pre-training in hierarchical multitask CTC (Krishna et al., 2018), phone-level CTC heads are trained on a partial encoder (e.g., 3 or 4 layers). The resulting parameters initialize the lower part of a full encoder, which is then trained with the full loss (possibly in combination with auxiliary losses and interpolation). In all cases, CTC-trained decoders use greedy search for decoding; no external LLM or beam search is used during evaluation in (Higuchi et al., 2021).

Batch sizes, learning rate schedules, and early stopping are managed using standard configurations (e.g., batch sizes 128→32, learning rate halving if WER fails to improve after three checkpoints, up to 100 epochs) (Krishna et al., 2018).

5. Empirical Results

Hierarchical-CTC pre-training delivers consistent improvements over standard CTC and alternative multitask paradigms. Representative word error rates (WER) without LLM or beam search are:

Model	LS-100 dev-clean/other	LS-960 dev-clean/other	TED2 dev/test
CTC	11.5 / 24.8	4.2 / 10.0	11.8 / 10.7
SC-CTC	8.9 / 21.0	3.2 / 8.2	9.4 / 8.6
ParaCTC	10.4 / 24.0	4.6 / 10.3	10.9 / 10.2
HC-CTC (ours)	8.2 / 19.9	3.1 / 8.0	9.1 / 8.6

(Higuchi et al., 2021)

On LibriSpeech-100h, HC-CTC reduces WER by ~2.7% relative vs. CTC and 8% relative vs. ParaCTC.
On LibriSpeech-960h and TEDLIUM2, HC-CTC slightly outperforms SC-CTC and offers faster training/inference.

Ablations reveal that multi-granular subword vocabularies (e.g., $256 \rightarrow 2$ k $\rightarrow 16$ k) outperform single-granularity settings (e.g., $16$k throughout). Conditioning ablation confirms the importance of explicit dependency; removing conditioning increases WER (8.7% vs. 8.2% on LS-100 dev set).

In hierarchical multitask CTC (Krishna et al., 2018), combining pretraining with auxiliary phone CTC (at $i=4$ ) and interpolated weighting ( $\lambda=0.5$ ) yields the best overall results, with a 3.4% absolute WER reduction on Eval2000.

6. Representation Learning and Downstream Impact

Analysis of model behavior highlights that hierarchical supervision bridges the abstraction gap from acoustic to character, subword, and word-level representations. Conditioning on fine-grained token posteriors provides flexible building blocks for constructing coarse-grained (word) predictions and relaxes the conventional CTC conditional independence assumption (Higuchi et al., 2021).

HC-CTC training reliably sharpens CTC alignments, especially for complex or rare word sequences. Attention visualization shows that lower-level posteriors are leveraged for improved alignment.

A plausible implication is that this lightweight, generic pre-training method can bootstrap any deep encoder structure with improved intermediate representations for word-level recognition, providing better initializations for further fine-tuning (e.g., with attention-based decoders) (Higuchi et al., 2021). In resource-constrained settings, this approach offers significant convergence and stability benefits (Krishna et al., 2018).

7. Extensions and Open Directions

Potential developments include:

Autoregressive Decoding: Stacking an attention-based decoder atop the HC-CTC encoder for fully end-to-end training.
Acoustic Units: Replacing text-based subwords with acoustically discovered units (e.g., via self-supervised or articulatory embeddings) at lower branches.
Dynamic Granularity or Loss Weighting: Adapting the number of hierarchical steps $K$ and weighting $\lambda_k$ based on utterance length or domain to improve generalization under domain shift.
Extremely Low-Resource Regimes: In multitask CTC, the optimal auxiliary loss position shifts; standard MTL (auxiliary at top) is best with least data, while intermediate layers are optimal with more data (Krishna et al., 2018).

Hierarchical-CTC pre-training remains a flexible and effective option for leveraging linguistic structure in end-to-end ASR systems, particularly as model depth and task complexity increase (Higuchi et al., 2021, Krishna et al., 2018).

Markdown Report Issue Upgrade to Chat

References (2)

Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units (2021)

Hierarchical Multitask Learning for CTC-based Speech Recognition (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical-CTC Pre-training.