SC-CNN-demo: "Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems"

Authors: Hyungchan Yoon, Chanhwan Kim, Seyun Um, Hyun-Wook Yoon, Hong-Goo Kang

Code

Abstract

This letter proposes an effective speaker-conditioning method that is applicable to zero-shot multi-speaker text-to-speech (ZSM-TTS) systems. Based on the inductive bias in the speech generation task, in which local context information in text/phoneme sequences heavily affect the speaker characteristics of the output speech, we propose a Speaker-Conditional Convolutional Neural Network (SC-CNN) for the ZSM-TTS task. SC-CNN first predicts convolutional kernels from each learned speaker embedding, then applies 1-D convolutions to phoneme sequences with the predicted kernels.It utilizes the aforementioned inductive bias and effectively models the characteristic of speech by providing the speaker-specific local context in phonetic domain.We also build both FastSpeech2 and VITS-based ZSM-TTS systems to verify its superiority over conventional speaker conditioning methods.The results confirm that the models with SC-CNN outperform the recent ZSM-TTS models in terms of both subjective and objective measurements.

VCTK zero-shot inference

Section 1. FastSpeech2 system

[16000Hz]


GT (16kHz)
Synthesized text	We want to get to the final, anyway.	I feel really good.	He wasn't ready to cope with the pressures.	When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.
ZS-FastSpeech 2
StyleSpeech
SC-StyleSpeech
w/o DS
w/o WN

Section 2. VITS system

[22050Hz]


GT (22.05kHz)
Synthesized text	People were determined to change the Government in the last election.	In fact, he is not even in the squad for the game.	What kind of man does that, Mr Dick ?	Many complicated ideas about the rainbow have been formed.
TransferTTS
SC-TransferTTS