Magnitude and system dependence of embedding shifts between originals and clones across accents

Determine the magnitude of shifts in deep speaker-embedding space between original utterances and their voice-cloned counterparts, and ascertain whether these shifts differ between heavily accented Mandarin speech and socially standard Mandarin speech and whether they depend on the choice of voice-cloning system.

Background

The paper investigates how accent-related variation is preserved or reshaped by commercial voice-cloning systems and how such changes are reflected both in deep speaker-embedding spaces and in human perception. A central concern is whether voice cloning alters accent-specific cues in a way that changes the relationship between original and cloned utterances in embedding space.

Speaker embeddings (e.g., ECAPA-TDNN) are widely used to capture speaker-discriminative information. If voice cloning systematically modifies accent cues, this could manifest as a shift in distances between original and cloned utterances in embedding space. The authors explicitly note that the size of these shifts, their dependence on accent (accented vs. standard Mandarin), and their dependence on the particular cloning system are not clear and require quantification.

References

However, it is not clear how large these shifts are, whether they differ for accented versus standard speech, or whether they depend on the voice-cloning system.

Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones  (2604.01562 - Yang et al., 2 Apr 2026) in Introduction (Section 1)