We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See this https URL for demos of our work.
Notes: "The models are trained using 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6k acoustic tokens per GPU for 800k steps" 353M * 800k * 6k * 6 = 1.01e19 16 V100s is 2080 teraFLOP or 2e15 FLOP so 1e19 would take 1.5 hours at 100% utilization or ~5 hours at 30%. Is that plausible?
Size Notes: 60k hours ~13,680 words/hour * 60,000 = 820800000 words https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.3pbt0hfgv7pq
Notes: "Both the AR model and the NAR model have the same transformer architecture with 12 layers, 16 attention heads, an embedding dimension of 1024, a feed-forward layer dimension of 4096, and a dropout of 0.1" Ben's script says that's 353M parameters, using n_block 12, d_model 1024, d_ff 4096, encoder only False https://github.com/bencottier/ml-parameter-count/blob/main/parameter_count.py