HeartMuLa: Music & Cardiac Audio AI

Updated 16 January 2026

HeartMuLa is a dual-domain framework that integrates hierarchical large language models for controllable music generation and deep multi-label models for detailed heart sound analysis.
In music AI, HeartMuLa employs ultra-low-frame-rate tokenization and separate global and local Transformers to achieve scalable, high-fidelity song synthesis with precise section control.
In biomedical audio, HeartMuLa leverages multi-branch CNN ensembles and audio-LLM techniques to extract multiple murmur attributes, delivering high accuracy in clinical auscultation analysis.

HeartMuLa is a term used for two distinct advanced systems in contemporary machine learning research: (1) a large-scale, open-source foundation model for controllable music generation using hierarchical LLMs and ultra-low-frame-rate tokenization, and (2) a family of state-of-the-art multi-label learning systems for comprehensive analysis of heart and cardiopulmonary sounds, usually framed as murmur attribute extraction and signal separation/classification. Both domains share a commitment to semantic-level audio representation, robust multi-attribute modeling, and hierarchical control of content. Below is a comprehensive account of the methodologies, core architectures, datasets, evaluation protocols, and benchmarks that underpin HeartMuLa research in automatic auscultation and music AI.

1. HeartMuLa in Music AI: Hierarchical Foundation Models

The HeartMuLa music system (Yang et al., 15 Jan 2026) is a hierarchical, LLM-driven framework for music understanding and synthesis, introducing a scalable approach that rivals commercial systems in both fidelity and controllability.

1.1 Model Architecture

The system comprises four principal components: HeartCLAP (audio-text alignment), HeartTranscriptor (lyric recognition), HeartCodec (an ultra-low-frame-rate RVQ-based music tokenizer), and HeartMuLa proper (the controllable song-generation LLM).

HeartCodec compresses 48 kHz stereo music via multi-encoder fusion (Whisper, WavLM, MuEncoder) to feature streams, compresses via a query-based Transformer to 12.5 Hz frame rate, and applies K=8 codebooks with V=8 192 entries for RVQ, yielding a 3 840× compression.
Generation is factorized into a global Transformer, $θ_{\rm glo}$ , for coarse structure (autoregressively on $a_{l,0}$ tokens), and a local Transformer, $θ_{\rm loc}$ , for intra-frame detail (predicting $a_{l,k}$ for $k=1...K-1$ ).
Model sizes include a 3.3B-parameter baseline (3B global + 0.3B local) and a 7.3B-parameter enhanced model, each with ~32–48 layers and 4096 hidden units.

1.2 Conditioning and Controllability

Section-level control is achieved by interleaving style annotations (tagged by a multimodal LLM) within lyric tokens and using explicit section markers ([intro], [verse], etc.).
Short-music mode is deployed via specialized training and inference (denser CFG dropout, shorter context).

1.3 Training Protocol

Training proceeds in four stages: warmup (30 s clips with lyric/audio condition), full-dataset pretraining, supervised fine-tuning (high-quality subset, increased loss weighting), and direct preference optimization (DPO) with RL-style feedback from MuQ, AudioBox, SongEval, and PER datasets.

2. HeartMuLa in Biomedical Audio: Multi-Attribute Heart Sound Analysis

HeartMuLa is also the term for a family of deep-learning approaches targeting comprehensive heart-sound analysis, especially multi-label murmur attribute prediction, as pioneered in Deep CardioSound (Guo et al., 2022), and audio-LLM transfer (Florea et al., 23 Jan 2025).

2.1 Problem Definition

The core task extends beyond binary murmur detection to automatic labeling of phonocardiogram (PCG) segments with multiple clinically-used attributes: timing, grading, pitch, quality, and shape (Guo et al., 2022). The analysis typically focuses on the systolic period; diastolic analysis is less commonly addressed due to data scarcity.

2.2 Datasets and Annotation

CirCor DigiScope: 5,272 PCG recordings from 1,577 pediatric patients, annotated at the patient/segment level for up to five murmur groups and detailed cardiac segment boundaries (S1, systole, etc.) (Guo et al., 2022, Florea et al., 23 Jan 2025).
Manikin-Recorded Cardiopulmonary Sounds: 210 high-fidelity, filter-enhanced heart and lung audio clips (separated and mixed), annotated for 10 heart and 6 lung conditions, with location metadata (Torabi et al., 2024).

3. Deep Learning Architectures and Audio Representation

3.1 Multi-Branch CNN Ensembles

Deep CardioSound employs a five-branch CNN ensemble architecture: each branch specializes in one murmur attribute group and shares a DenseNet-121 as frontend, operating on "bags" of consecutive, preprocessed systolic segments. Feature fusion and a global consistency branch augment the multi-label structure. Cross-entropy losses are combined across groups, with no explicit focal/imbalance weighting; data augmentation is used to mitigate label skew (Guo et al., 2022).

3.2 Audio-LLM Approaches

Recent advances involve fine-tuning large-scale audio LLMs such as Qwen2-Audio (Florea et al., 23 Jan 2025). These use Whisper-style audio encoders (SSM transformer hybrids) to produce embeddings, which are then processed by LLMs (e.g., Qwen-7B) for multitask classification (timing, grading, pitch, etc.), with LoRA adapters for parameter-efficient training. Segmentation with SSAMBA is used to feed phase-specific representations (S1, S2, etc.) to the model.

3.3 Preprocessing and Feature Extraction

Approaches utilize denoising (Butterworth filters), precise segmentation (via HSMM or self-supervised models), normalization, and time-frequency representations (MFCC, STFT, Wavelet Scattering Transform). The blending of fine-grained signal processing with hierarchical deep learning distinguishes the state of the art (Patwa et al., 2023, Torabi et al., 2024).

4. Evaluation Protocols and Benchmarks

4.1 Music Model Evaluation

Objective metrics: SongEval, AudioBox, Tag-Sim, PER (lyric intelligibility).
Subjective metrics: Mean Opinion Scores (MOS) for musicality, structure, text alignment.
Performance: HeartMuLa-7B achieves competitive or superior scores on PER and subjective musicality relative to leading commercial models such as Suno-v5 and MiniMax-2.0. Batch inference with optimized attention reduces end-to-end song generation times by 5.4× (Yang et al., 15 Jan 2026).

4.2 Biomedical Audio Evaluation

Segment-level: Sensitivity, specificity, F1, precision.
Patient-level: Label group mode aggregation, overall accuracy (up to 96.9%) (Guo et al., 2022).
Feature-specific: Multi-class accuracy and weighted-accuracy (W.acc) for 11 murmur tasks (timing, grading, pitch, quality, shape—systolic and diastolic), with audio-LLM methods achieving 100% on most, except "grading" where text encoder limitations are observed (Florea et al., 23 Jan 2025).
Comparisons: Deep CardioSound and audio-LLM models surpass prior DNNs in multi-label F1 and classification of long-tail attributes.

Model / Task	Sys. Timing	Sys. Shape	Sys. Grading	W.acc (murmur)
Deep CardioSound	96.6%	96.3%	96.6%	—
Qwen2-Audio (with seg.)	100%	100%	33.4%	75.6%
M2D+AST	—	—	—	83.2%

5. Signal Processing, Separation, and Physiology Modeling

5.1 Heart-Lung Source Separation

Using the manikin-mixed dataset, techniques such as affine non-negative matrix factorization (ANMF), deep U-Nets, and multimodal self-supervised encoders enable monaural separation of overlapped heart and lung sounds, with SNR gains of 6–10 dB reported for ANMF (Torabi et al., 2024). The dataset is designed to benchmark "HeartMuLa"-style front-ends.

5.2 Multiscale Hemodynamics

The term "HeartMuLa" also appears as the label for a geometric multiscale simulation platform: 3D Navier–Stokes in an ALE framework with RIIS valve models, one-way kinematic coupling from 3D electromechanics, and 0D closed-loop circuit for the rest of the circulation (Zingaro et al., 2021). Physiological outputs—LV volume, ejection fraction, pressure flow—are validated against gold-standard MRI and echo data.

6. Limitations, Challenges, and Future Directions

Segment-label reliance: Most biomedical HeartMuLa models require high-quality S1–S2 segmentation or prior knowledge of class distributions; errors in segmentation or rare label groups (especially diastolic murmurs and "grading") propagate directly into final outputs (Guo et al., 2022, Florea et al., 23 Jan 2025).
Domain adaptation: Transfer learning across populations and device modalities remains unresolved for heart sound models (Patwa et al., 2023).
Generalization: While music HeartMuLa scales well with model size and data, open questions remain in multi-lingual lyric alignment and style transfer (Yang et al., 15 Jan 2026).
Data limitations: Simulator-only datasets lack patient variability and natural artifact, calling for augmentation with real-world recordings (Torabi et al., 2024).
Next steps: Joint modeling of heart sounds with ECG/imaging, multimodal LLMs, uncertainty quantification, improved text encoders, and end-to-end learning from waveform to structured diagnosis are identified as priorities for future research (Florea et al., 23 Jan 2025).

7. Impact and Practical Implications

HeartMuLa systems—both musical and biomedical—have significantly advanced the capacity for large-scale, controllable generation and multi-label analysis of complex audio. In clinical contexts, HeartMuLa-based models enable structured, explainable murmur assessments spanning multiple attributes, supporting differential diagnosis and potentially automating auscultation and triage. In music AI, HeartMuLa demonstrates that academic-scale resources can now approach or exceed commercial-grade song synthesis in flexibility, fidelity, and interpretability. The convergence of semantic-token audio representations, hierarchical generative modeling, and multi-condition control represents a paradigm shift in both AI-driven audio understanding and production (Guo et al., 2022, Yang et al., 15 Jan 2026, Florea et al., 23 Jan 2025).