Model Details

Domain:

Task:

Model Access:

Unreleased

Citations:

974

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See this https URL for demos of our work.

Benchmarking

FLOPs10100000000000000000

Notes: "The models are trained using 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6k acoustic tokens per GPU for 800k steps" 353M * 800k * 6k * 6 = 1.01e19 16 V100s is 2080 teraFLOP or 2e15 FLOP so 1e19 would take 1.5 hours at 100% utilization or ~5 hours at 30%. Is that plausible?

Training

Training Code Accessibilityonlly demos https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e/

HardwareNVIDIA V100

Size Notes: 60k hours ~13,680 words/hour * 60,000 = 820800000 words https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.3pbt0hfgv7pq

Parameters

Parameters353000000

Notes: "Both the AR model and the NAR model have the same transformer architecture with 12 layers, 16 attention heads, an embedding dimension of 1024, a feed-forward layer dimension of 4096, and a dropout of 0.1" Ben's script says that's 353M parameters, using n_block 12, d_model 1024, d_ff 4096, encoder only False https://github.com/bencottier/ml-parameter-count/blob/main/parameter_count.py

Authors

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei

Related ModelsView all models

MAI-Image-1By Microsoft

Image generation

MAI-Voice-1By Microsoft

Speech

Phi-4 MiniBy Microsoft

Language

Microsoft MAI-1By Microsoft

Language

Model Details

Domain:

Task:

Model Access:

Unreleased

Citations:

974

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs10100000000000000000

Training

Training Code Accessibilityonlly demos https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e/

HardwareNVIDIA V100

Size Notes: 60k hours ~13,680 words/hour * 60,000 = 820800000 words https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.3pbt0hfgv7pq