Improving Zero-Shot Style Transfer TTS | JASA Express Letters 2026

Published in JASA Express Letters, 2026

Abstract

Recent zero-shot style-transfer speech synthesis methods have shown promising results and addressed adaptation to unseen speaking styles. While most state-of-the-art methods generalize to new speakers and styles using very large models or corpora, achieving similar generalization with a smaller model remains an open challenge. We propose a zero-shot method that uses the small GenerSpeech backbone plus a fine-grained style encoder. To disentangle speakers, global/fine-grained styles, and content embeddings, we introduce a mutual-information minimization loss. To further disentangle style from speaker and boost style embedding diversity, we introduce a maximum-mean-discrepancy-guided cycle-consistency loss. Experimental results show we outperform baseline zero-shot style-transfer methods with 58% average style preference and a 3.64 prosody PMOS on VCTK.

Read the Paper DOI: 10.1121/10.0042974

Model Architecture

Style Encoder

Test Samples for Zero-Shot Style Transfer on VCTK Dataset

In-Domain Samples

Training set: VCTK | Test set: VCTK

Target TextAnd we will go meet her wednesday at the train station.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextHis friends say he's looking for the pot of gold at the end of the rainbow.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextBut refraction by the raindrops which causes the rainbows.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextAnd the width of the colored band increases as the size of the drops increases.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextThe result is to give a bow with an abnormally wide yellow band, since red and green light when mixed form yellow.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextAnd maybe a snack for her brother Bob.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Out-of-Domain Samples

Training set: LibriTTS | Test set: VCTK

Target TextAnd maybe a snack for her brother Bob.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextAnd its two ends apparently beyond the horizon.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextHis friends say he is looking for the pot of gold at the end of the rainbow.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextAnd we will go meet her Wednesday at the train station.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextBut refraction by the raindrops which causes the rainbows.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Target TextShe would never resort to such devices.

Reference Audio

Ground Truth

Ours	GenerSpeech	YourTTS	VALL-E-X

Citation

If you find this work useful, please cite our paper:

@article{Eren2026improving,
  author  = {Eren, Eray and Liu, Qingju and Alwan, Abeer and Bharaj, Gaurav},
  title   = {Improving zero-shot style transfer text-to-speech by disentangled fine-grained style modeling},
  journal = {JASA Express Letters},
  volume  = {6},
  number  = {3},
  pages   = {034802},
  year    = {2026},
  doi     = {10.1121/10.0042974},
  url     = {https://pubs.aip.org/asa/jel/article/6/3/034802/3383063/Improving-zero-shot-style-transfer-text-to-speech}
}