ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition

Published 17 Feb 2025 in cs.SD, cs.AI, cs.CV, cs.IR, and cs.LG | (2502.11840v1)

Abstract: Chord recognition serves as a critical task in music information retrieval due to the abstract and descriptive nature of chords in music analysis. While audio chord recognition systems have achieved significant accuracy for small vocabularies (e.g., major/minor chords), large-vocabulary chord recognition remains a challenging problem. This complexity also arises from the inherent long-tail distribution of chords, where rare chord types are underrepresented in most datasets, leading to insufficient training samples. Effective chord recognition requires leveraging contextual information from audio sequences, yet existing models, such as combinations of convolutional neural networks, bidirectional long short-term memory networks, and bidirectional transformers, face limitations in capturing long-term dependencies and exhibit suboptimal performance on large-vocabulary chord recognition tasks. This work proposes ChordFormer, a novel conformer-based architecture designed to tackle structural chord recognition (e.g., triads, bass, sevenths) for large vocabularies. ChordFormer leverages conformer blocks that integrate convolutional neural networks with transformers, thus enabling the model to capture both local patterns and global dependencies effectively. By addressing challenges such as class imbalance through a reweighted loss function and structured chord representations, ChordFormer outperforms state-of-the-art models, achieving a 2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling class imbalance, providing robust and balanced recognition across chord types. This approach bridges the gap between theoretical music knowledge and practical applications, advancing the field of large-vocabulary chord recognition.

Abstract PDF Upgrade to Chat

Summary

The paper introduces ChordFormer, a novel conformer-based architecture that integrates CNNs and self-attention for improved large-vocabulary audio chord recognition.
It employs a reweighted loss function and CRF decoding to mitigate class imbalance and produce coherent chord sequences.
Evaluations demonstrate that ChordFormer outperforms CNN, CNN+BLSTM, and transformer models, achieving up to 6% higher class-wise accuracy on rare chord types.

Summary of "ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition"

Introduction

The paper "ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition" (2502.11840) introduces a novel conformer-based architecture designed to tackle large-vocabulary audio chord recognition. This task is critical for Music Information Retrieval (MIR) as it involves decoding audio into sequences of chord labels. Despite advancements in chord recognition systems for major/minor chords, recognizing chords in a large vocabulary presents challenges due to data imbalance and limited training samples for rare chord types. Previous methods combining convolutional neural networks (CNNs), bidirectional long short-term memory networks (BLSTMs), and transformers demonstrate limitations in capturing long-term dependencies necessary for complex chord structures, such as triads, bass, and sevenths.

Methodology

ChordFormer Architecture Overview

The ChordFormer architecture incorporates three key modules: a preprocessing module for converting audio signals into Constant-Q Transform (CQT) representations, a conformer block to process CQT features using convolutional layers and self-attention mechanisms, and a decoding model to interpret activations and generate chord sequences.

Figure 1: Overview of the ChordFormer Architecture.

Feature Extraction with Conformer Blocks

The core of ChordFormer is the conformer block, which integrates CNNs and self-attention mechanisms to balance local pattern recognition and global dependency modeling. The multi-head self-attention (MHSA) module offers advantages in dynamically adjusting its focus on relevant frames, capturing long-range dependencies more effectively than traditional RNNs or CNNs. A convolution module captures local feature information, and a feedforward block enhances expressive capacity using Swish activation and residual connections for better gradient flow.

Figure 2: Architecture Overview of Key Modules in ChordFormer Model.

Loss Function and Class Imbalance

To tackle class imbalance, ChordFormer employs a reweighted loss function, assigning higher weights to underrepresented chord classes. This approach ensures balanced learning across a wide range of chords, enhancing recognition of rare chord types.

Chord Sequence Decoding

ChordFormer utilizes Conditional Random Field (CRF) decoding to produce coherent chord sequences, penalizing excessive transitions and enabling control over output chord vocabulary. This improves temporal smoothness and accuracy in predictions.

Results

Evaluation Metrics

ChordFormer was evaluated using various metrics, including Weighted Chord Symbol Recall (WCSR), frame-wise accuracy ($acc_{\text{frame}$), and class-wise accuracy ($acc_{\text{class}$). The model achieves significant improvements with a 2% increase in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary datasets. Notably, ChordFormer excels in recognizing rare chord types and demonstrates superior performance on large-vocabulary tasks.

Comparisons and Class Recall

Experimental results indicate that ChordFormer outperforms baseline models, including CNN, CNN+BLSTM, and traditional transformer models across multiple metrics. Class recall evaluations reveal improved accuracy for rare chord classes under varying re-weighting strategies.

Figure 3: Top: chord class appearance in song dataset; Bottom: ChordFormer recall values for each chord class under varying re-weight factors.

Figure 4: Confusion matrix of ChordFormer with re-weighting factors ($\gamma = 0.5, w_{\text{max}$).

Conclusion

ChordFormer marks a substantial advancement in automatic chord recognition by effectively integrating convolutional layers and self-attention mechanisms within a conformer framework. It successfully addresses large-vocabulary recognition challenges, offering enhanced performance especially for rare chord types through innovative reweighting strategies and structured chord representations. Future research directions may explore adaptive reweighting techniques and self-supervised learning to further optimize performance and expand the applicability of chord recognition systems in real-world MIR applications. Overall, ChordFormer provides a robust foundation for future developments in automatic chord recognition, highlighting its applicability in diverse musical contexts.

Markdown Report Issue