>
Notes: "Model was trained using Deepspeed and Megatron libraries, on 300B tokens dataset for 3 epochs, around 45 days on 512 V100. After that model was finetuned 1 epoch with sequence length 2048 around 20 days on 200 GPU A100 on additional data" 512 GPUs * 125000000000000 FLOPs/s [peak] * 45 days * 24 hours * 3600 s * 0.3 + 200 GPUs * 312000000000000 FLOPs/s [peak for fp16] * 20 days * 24 hours * 3600 s * 0.3 = 1.0699776e+23 they probably used fp16 as in their similar project: https://habr.com/ru/companies/sberdevices/articles/780334/ 6ND = 6*13B*300B*3 = 70200*10^18 = 7*10^24
Notes: 13B