Model Details

Domain:

Task:

Model Access:

Open weights (unrestricted)

Citations:

27245

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Benchmarking

FLOPs8.51e+21

Notes: Section 5: We pretrain our model using 1024 V100 GPUs for approximately one day. Note this is the base pretraining comparable to BERT, 100k steps. Subsequently they do more: "increasing the number of pretraining steps from 100K to 300K, and then further to 500K". So assume 5x the 1024 V100 GPUs for 1d estimate. Mixed precision tensor cores get 1.25e14 FLOP/s. 1024 * 1.25e14 * 5 * 24 * 3600 * 0.3 = 1.65888e22 6ND estimate: batches are 8k sequences of 512 tokens; 500k updates means the model saw 500k * 8k * 512 = 2.048T tokens 6 * 2.048T * 355M = 4.36224e21 geometric mean: sqrt(1.65888e22 * 4.36224e21) = 8.5067e21 Authors of "AI and Memory Wall" estimated model's training compute as 4,300,000 PFLOP = 4.3*10^21 FLOP (https://github.com/amirgholami/ai_and_memory_wall)

Training

Training Code Accessibilitycode and weights: https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.md pretrain code: https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.pretraining.md repo is MIT license

HardwareNVIDIA Tesla V100 DGXS 32 GB

Hardware Quantity1024

Size Notes: 160GB*200M words/GB * (4 tokens / 3 words) = 3.2e10 tokens max steps 500k batch size 8k "We pretrain with sequences of at most T = 512 tokens." 500000*8000*512 = 2.048e+12 tokens

Parameters

Parameters355000000

Notes: 355M https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.md

Authors

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov

Related ModelsView all models

wave2vec 2.0 LARGEBy Facebook

Speech

DETRBy Facebook

Vision

Retrieval-Augmented GeneratorBy Facebook

Language

ResNeXt-101 32x48dBy Facebook

Vision

Facebook,University of Washington | RoBERTa Large - Capabilities, Benchmarks and Use Cases

Model Details

Domain:

Task:

Model Access:

Open weights (unrestricted)

Citations:

27245

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs8.51e+21

Training

HardwareNVIDIA Tesla V100 DGXS 32 GB

Hardware Quantity1024

Size Notes: 160GB*200M words/GB * (4 tokens / 3 words) = 3.2e10 tokens max steps 500k batch size 8k "We pretrain with sequences of at most T = 512 tokens." 500000*8000*512 = 2.048e+12 tokens