Model Details

Domain:

Speech

Task:

Text-to-speech TTS

Speech synthesis

Model Access:

Open weights (unrestricted)

Introduction

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers. Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.

Training

Training Code Accessibility

Apache 2.0 https://huggingface.co/Zyphra/Zonos-v0.1-transformer

Size Notes: "The Zonos-v0.1 models are trained on approximately 200,000 hours of speech data, encompassing both neutral-toned speech (like audiobook narration) and highly expressive speech. The majority of our data is English, although there are substantial amounts of Chinese, Japanese, French, Spanish, and German."

Parameters

1600000000

Notes: 1.6B

Zonos-v0.1 - Use Model

Zonos-v0.1 - Use Model

Model Details

Introduction

Training

Parameters

Zonos-v0.1 - Use Model

Zonos-v0.1 - Use Model

Model Details

Introduction

Training

Parameters