We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at this https URL.
Notes: We train the 300M, 1.5B and 3.3B parameter models, using respectively 32, 64 and 96 GPUs, with mixed precision. Unclear how many epochs used so FLOP calculation is not feasible.
Size Notes: "We train on 30-second audio crops sampled at random from the full track... We use 20K hours of licensed music" 20000 hours * 60 min/hour * 2 inputs/min = 2400000 input sequences EnCodec is run at 32kHz but after convolutions has a frame rate of 50 Hz, suggesting 2400000 * 30s * 50/s = 3,600,000,000 audio tokens. Not confident enough in this calculation to add to database.
Notes: "We train autoregressive transformer models at different sizes: 300M, 1.5B, 3.3B parameters" Uses EnCodec 32kHz (HF version has 59M params) for audio tokenization.