Recent zero-shot style-transfer speech synthesis methods have shown promising results and addressed adaptation to unseen speaking styles. While most state-of-the-art methods generalize to new speakers and styles using very large models or corpora, achieving similar generalization with a smaller model remains an open challenge. We propose a zero-shot method that uses the small GenerSpeech backbone plus a fine-grained style encoder. To disentangle speakers, global/fine-grained styles, and content embeddings, we introduce a mutual-information minimization loss. To further disentangle style from speaker and boost style embedding diversity, we introduce a maximum-mean-discrepancy-guided cycle-consistency loss. Experimental results show we outperform baseline zero-shot style-transfer methods with 58% average style preference and a 3.64 prosody PMOS on VCTK.
Training set: VCTK | Test set: VCTK
Target TextAnd we will go meet her wednesday at the train station.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextHis friends say he's looking for the pot of gold at the end of the rainbow.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextBut refraction by the raindrops which causes the rainbows.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextAnd the width of the colored band increases as the size of the drops increases.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextThe result is to give a bow with an abnormally wide yellow band, since red and green light when mixed form yellow.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextAnd maybe a snack for her brother Bob.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Training set: LibriTTS | Test set: VCTK
Target TextAnd maybe a snack for her brother Bob.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextAnd its two ends apparently beyond the horizon.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextHis friends say he is looking for the pot of gold at the end of the rainbow.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextAnd we will go meet her Wednesday at the train station.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextBut refraction by the raindrops which causes the rainbows.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
Target TextShe would never resort to such devices.
| Ours | GenerSpeech | YourTTS | VALL-E-X |
|---|---|---|---|
If you find this work useful, please cite our paper:
@article{Eren2026improving,
author = {Eren, Eray and Liu, Qingju and Alwan, Abeer and Bharaj, Gaurav},
title = {Improving zero-shot style transfer text-to-speech by disentangled fine-grained style modeling},
journal = {JASA Express Letters},
volume = {6},
number = {3},
pages = {034802},
year = {2026},
doi = {10.1121/10.0042974},
url = {https://pubs.aip.org/asa/jel/article/6/3/034802/3383063/Improving-zero-shot-style-transfer-text-to-speech}
}