Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
Notes: Section 5: We pretrain our model using 1024 V100 GPUs for approximately one day. Note this is the base pretraining comparable to BERT, 100k steps. Subsequently they do more: "increasing the number of pretraining steps from 100K to 300K, and then further to 500K". So assume 5x the 1024 V100 GPUs for 1d estimate. Mixed precision tensor cores get 1.25e14 FLOP/s. 1024 * 1.25e14 * 5 * 24 * 3600 * 0.3 = 1.65888e22 6ND estimate: batches are 8k sequences of 512 tokens; 500k updates means the model saw 500k * 8k * 512 = 2.048T tokens 6 * 2.048T * 355M = 4.36224e21 geometric mean: sqrt(1.65888e22 * 4.36224e21) = 8.5067e21 Authors of "AI and Memory Wall" estimated model's training compute as 4,300,000 PFLOP = 4.3*10^21 FLOP (https://github.com/amirgholami/ai_and_memory_wall)
Size Notes: 160GB*200M words/GB * (4 tokens / 3 words) = 3.2e10 tokens max steps 500k batch size 8k "We pretrain with sequences of at most T = 512 tokens." 500000*8000*512 = 2.048e+12 tokens
Notes: 355M https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.md