PolyVoice: Language Models for Speech to Speech Translation
-
- contributions
- [Overview of PolyVoice](#Overview of PolyVoice)
LM-based method in S2ST
contributions
- Decoder-only model for speech2speech translation.
- Unit-based audio LM predicts the SoundStream Codec
Overview of PolyVoice
two LM-based components: a S2UT front-end for translation and a U2S back-end for synthesis.
An extra language model for duration prediction.
- Semantic unit are extracted by mhubert
- Acoustic units are soundstream codec(residual vector quantizer), using a autoregressive model and a non-autoregressive model.