Model Details

Domain:

Audio

Task:

Audio generation

Model Access:

Open weights (non-commercial)

Citations:

551

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at this https URL.

Benchmarking

Notes: We train the 300M, 1.5B and 3.3B parameter models, using respectively 32, 64 and 96 GPUs, with mixed precision. Unclear how many epochs used so FLOP calculation is not feasible.

Training

Training Code AccessibilityCode is released under MIT, model weights are released under CC-BY-NC 4.0 https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md

Size Notes: "We train on 30-second audio crops sampled at random from the full track... We use 20K hours of licensed music" 20000 hours * 60 min/hour * 2 inputs/min = 2400000 input sequences EnCodec is run at 32kHz but after convolutions has a frame rate of 50 Hz, suggesting 2400000 * 30s * 50/s = 3,600,000,000 audio tokens. Not confident enough in this calculation to add to database.

Parameters

Parameters3359000000

Notes: "We train autoregressive transformer models at different sizes: 300M, 1.5B, 3.3B parameters" Uses EnCodec 32kHz (HF version has 59M params) for audio tokenization.

Authors

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez

Related ModelsView all models

Llama 4 Behemoth previewBy Meta AI

Multimodal

Language

Vision

Llama 4 MaverickBy Meta AI

Multimodal

Language

Vision

Llama 4 ScoutBy Meta AI

Multimodal

Language

Vision

Llama 3.3 70BBy Meta AI

Language

Model Details

Domain:

Audio

Task:

Audio generation

Model Access:

Open weights (non-commercial)

Citations:

551

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

Notes: We train the 300M, 1.5B and 3.3B parameter models, using respectively 32, 64 and 96 GPUs, with mixed precision. Unclear how many epochs used so FLOP calculation is not feasible.

Training

Training Code AccessibilityCode is released under MIT, model weights are released under CC-BY-NC 4.0 https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md

Parameters

Parameters3359000000

Notes: "We train autoregressive transformer models at different sizes: 300M, 1.5B, 3.3B parameters" Uses EnCodec 32kHz (HF version has 59M params) for audio tokenization.