Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables

Published 29 Jun 2023 in eess.AS and cs.SD | (2306.17252v3)

Abstract: This paper introduces GlOttal-flow LPC Filter (GOLF), a novel method for singing voice synthesis (SVS) that exploits the physical characteristics of the human voice using differentiable digital signal processing. GOLF employs a glottal model as the harmonic source and IIR filters to simulate the vocal tract, resulting in an interpretable and efficient approach. We show it is competitive with state-of-the-art singing voice vocoders, requiring fewer synthesis parameters and less memory to train, and runs an order of magnitude faster for inference. Additionally, we demonstrate that GOLF can model the phase components of the human voice, which has immense potential for rendering and analysing singing voices in a differentiable manner. Our results highlight the effectiveness of incorporating the physical properties of the human voice mechanism into SVS and underscore the advantages of signal-processing-based approaches, which offer greater interpretability and efficiency in synthesis.

Abstract PDF Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper introduces GOLF, a novel SVS module that integrates a glottal flow model with differentiable LPC to simulate the human vocal tract efficiently.
It achieves competitive performance with state-of-the-art vocoders while using approximately 35% less memory and offering tenfold faster CPU inference.
The approach enhances phase reconstruction fidelity, paving the way for more interpretable and high-quality audio synthesis applications.

Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables: A Comprehensive Overview

The paper "Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables" presents GlOttal-flow LPC Filter (GOLF), a novel approach for singing voice synthesis (SVS). This research leverages the intrinsic physical attributes of human vocalization through differentiable digital signal processing (DDSP). GOLF employs a glottal model as the harmonic source and integrates it with IIR filters to simulate the vocal tract, thereby offering an efficient and interpretable synthesis method. The hypothesis is substantiated by demonstrating that GOLF achieves competitive performance with state-of-the-art vocoders while requiring significantly fewer synthesis parameters and exhibiting reduced memory consumption and faster inference times.

Methodological Innovations

This study introduces GOLF as an SVS module that unfolds from the Harmonic-plus-Noise architecture native to DDSP, in tandem with subtractive synthesis strategies akin to SawSing. A glottal flow model replaces traditional harmonic sources, and an innovative differentiable IIR implementation in PyTorch enhances training efficiency. This configuration is exercised as a neural vocoder, where an encoder transforms the input features, specifically mel-spectrograms, into synthesis parameters necessary for signal decoding.

The model employs the transformed Liljencrants-Fant (LF) model for generating glottal pulses, sampled across a continuum of parameter values believed to correspond well with perceived vocal effort. The glottal pulses are implemented as fixed wavetables and used as entries that facilitate a compact representation of harmonic components and random elements.

Comparative Analysis

The empirical evaluation juxtaposes GOLF against three DDSP-based vocoders: DDSP itself, SawSing, and Pulse-train LPC Filter (PULF). The results reveal that GOLF uses approximately 35% of the memory required by other models, with a real-time factor that is tenfold faster on the CPU, indicating operational efficiency. Moreover, GOLF's predicted waveforms closely align with the ground truth, hinting at superior phase reconstruction capabilities which are distinct from models employing zero-phase filtering methods such as DDSP and SawSing.

Implications and Future Directions

The implications of this research encompass both theoretical and practical dimensions. Theoretically, the alignment of glottal flow models with LPC filtering within an SVS context underscores the potential for signal processing-based techniques in facilitating more interpretable and efficient machine learning models. Practically, the ability of GOLF to faithfully capture the phase components of the human voice holds promise for applications in voice matching and audio synthesis, where phase accuracy is paramount.

Speculatively, future work could explore more versatile glottal source models and incorporate additional filters to address complex-phase components inherent in voice signals and ambient acoustic environments. Additionally, the research hints at the utility of phase-matching in GOLF, proposing that this model could be expanded towards time-domain vocal decomposition and synthesis.

Concluding Thoughts

GOLF represents a significant stride in SVS, characterized by its interpretability, efficiency, and the promise of enhancing voice synthesis fidelity. While grounded in robust analysis, further investigations could amplify its scope and applicability, potentially informing the next generation of SVS systems that harmonize well with the intricate sonic textures of human singing.