>
Fish Speech V1.5 is a leading text-to-speech (TTS) model trained on more than 1 million hours of audio data in multiple languages.
Notes: Previous model Fish-Speech 1.4 used 1.9151355e+21 FLOP for training (see model card) and was trained on 720000 hours of audio, while this model is similar but trained on 1M hours of audio -> its training compute couldbe estimated as 1.9151355e+21 * 10^6 / 720000 = 2.6599104e+21 FLOP
Size Notes: [tokens] Their previous model was trained on 720000 hours of audio data that was equal to 5*10^11 tokens -> 1M hours ~ 7*10^11 tokens model trained on more than 1 million hours of audio data in multiple languages.