Introduction

Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning. Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ to achieve superior compression ratios and near 100\% codebook utilization. Our approach addresses key limitations of current TTS systems while providing a foundation for more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential to advance TTS technology in AI applications. The implementation is open source at \href{this https URL}{this https URL}.

Benchmarking

FLOPs

1.92e+21

Notes: "The AR training utilized an 8*H100 80G GPUs for one week, while the vocoder training employed an 8*4090 GPUs for an additional week. Note that these timelines exclude the DPO stage." (989400000000000 FLOP / GPU / sec [H100, bf16 assumed] + 330000000000000 FLOP / GPU / sec [4090, bf16 assumed]) * 8 GPUs * 168 hours * 3600 sec / hour * 0.3 [assumed utilization] = 1.9151355e+21 FLOP

Training

Training Code Accessibility

This codebase is released under Apache License and all model weights are released under CC-BY-NC-SA-4.0 License https://github.com/fishaudio/fish-speech/tree/main/fish_speech https://huggingface.co/fishaudio/fish-speech-1.4

Hardware

NVIDIA H100 SXM5 80GB,NVIDIA GeForce RTX 4090

Hardware Quantity

Size Notes: [tokens] "The dataset contains about 720,000 hours of speech across different languages, with 300,000 hours each of English and Mandarin Chinese as the main components. We also included 20,000 hours each of other language families: Germanic (German), Romance (French, Italian), East Asian (Japanese, Korean), and Semitic (Arabic)." • Batch size: 1M tokens • Training steps: 500K 10^6 * 500000 = 5*10^11 tokens

Introduction

Benchmarking

FLOPs

1.92e+21

Training

Training Code Accessibility

Hardware

NVIDIA H100 SXM5 80GB,NVIDIA GeForce RTX 4090

Hardware Quantity

Model Details

Introduction

Benchmarking

Training

Authors

Related Models

OpenAudio-S1-mini

Fish-Speech 1.5

Model Details

Introduction

Benchmarking

Training

Authors

Related Models

OpenAudio-S1-mini

Fish-Speech 1.5

Fish-Speech 1.4 - Use Model

Fish-Speech 1.4 - Use Model

Model Details

Introduction

Benchmarking

Training

Authors

Related Models

OpenAudio-S1-mini

Fish-Speech 1.5

Fish-Speech 1.4 - Use Model

Fish-Speech 1.4 - Use Model

Model Details

Introduction

Benchmarking

Training

Authors

Related Models

OpenAudio-S1-mini

Fish-Speech 1.5