We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: this https URL
Notes: "the large model was trained on 128 A100 GPUs for 200k steps (∼1 week)" A100s are 312 teraflop/s 128 * 312 trillion * 7 * 24 * 3600 * 0.3 (utilization assumption) = 7.2e21 Text encoding uses T5-Large, which used 2.3e21 FLOP in pre-training per Flan paper: https://arxiv.org/abs/2210.11416
Size Notes: "Overall we are left with ∼4k hours for training data." mix of speech and other sounds Training the audio autoencoder uses reconstruction loss on sequence of raw audio samples. Audio files are in 16kHz, so 16k * 4k * 3600 = 230.4B samples Audio language modelling operates on tokens; "each second of audio is represented by 500 tokens". 4k * 3600 * 500 = 7.2B tokens
Notes: "We trained two sets of ALMs, one with 285M parameters (base) and the other with 1B parameters (large)."