Improving Zero-Shot Style Transfer Text-to-Speech by Disentangled Fine-Grained Style Modeling

Eray Eren1, Qingju Liu2, Abeer Alwan1, and Gaurav Bharaj2
1Department of Electrical and Computer Engineering, University of California, Los Angeles, California 90095, USA
2Flawless AI, Los Angeles, California 90401, USA
Published in JASA Express Letters, 2026

Abstract

Recent zero-shot style-transfer speech synthesis methods have shown promising results and addressed adaptation to unseen speaking styles. While most state-of-the-art methods generalize to new speakers and styles using very large models or corpora, achieving similar generalization with a smaller model remains an open challenge. We propose a zero-shot method that uses the small GenerSpeech backbone plus a fine-grained style encoder. To disentangle speakers, global/fine-grained styles, and content embeddings, we introduce a mutual-information minimization loss. To further disentangle style from speaker and boost style embedding diversity, we introduce a maximum-mean-discrepancy-guided cycle-consistency loss. Experimental results show we outperform baseline zero-shot style-transfer methods with 58% average style preference and a 3.64 prosody PMOS on VCTK.

Model Architecture

Model Architecture

Style Encoder

Style Encoder

Test Samples for Zero-Shot Style Transfer on VCTK Dataset

In-Domain Samples

Training set: VCTK  |  Test set: VCTK

Target TextAnd we will go meet her wednesday at the train station.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextHis friends say he's looking for the pot of gold at the end of the rainbow.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextBut refraction by the raindrops which causes the rainbows.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextAnd the width of the colored band increases as the size of the drops increases.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextThe result is to give a bow with an abnormally wide yellow band, since red and green light when mixed form yellow.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextAnd maybe a snack for her brother Bob.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Out-of-Domain Samples

Training set: LibriTTS  |  Test set: VCTK

Target TextAnd maybe a snack for her brother Bob.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextAnd its two ends apparently beyond the horizon.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextHis friends say he is looking for the pot of gold at the end of the rainbow.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextAnd we will go meet her Wednesday at the train station.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextBut refraction by the raindrops which causes the rainbows.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Target TextShe would never resort to such devices.

Reference Audio

Ground Truth

Ours GenerSpeech YourTTS VALL-E-X

Citation

If you find this work useful, please cite our paper:

@article{Eren2026improving,
  author  = {Eren, Eray and Liu, Qingju and Alwan, Abeer and Bharaj, Gaurav},
  title   = {Improving zero-shot style transfer text-to-speech by disentangled fine-grained style modeling},
  journal = {JASA Express Letters},
  volume  = {6},
  number  = {3},
  pages   = {034802},
  year    = {2026},
  doi     = {10.1121/10.0042974},
  url     = {https://pubs.aip.org/asa/jel/article/6/3/034802/3383063/Improving-zero-shot-style-transfer-text-to-speech}
}