Transformer language models (LMs) are fundamental to NLP research methodologies and applications in various languages. However, developing such models specifically for the Russian language has received little attention. This paper introduces a collection of 13 Russian Transformer LMs, which spans encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) architectures. We provide a report on the model architecture design and pretraining, and the results of evaluating their generalization abilities on Russian language understanding and generation datasets and benchmarks. By pretraining and releasing these specialized Transformer LMs, we aim to broaden the scope of the NLP research directions and enable the development of industrial solutions for the Russian language.
Notes: 6 FLOP / token / parameter * 1500000000000 tokens * 1740000000 parameters = 1.566e+22 FLOP or 312000000000000 FLOPs/s/GPU * 112 GPUs * 45 days * 24 h * 3600 s * 0.3 = 4.076e22 FLOP geometric mean: 2.5e22
Size Notes: 450GB *200M words per GB = 90000000000 words In general, different domains and sizes of the subcorpora are included in the resulting pretraining corpora of our LMs, which range from 30GB (ruBERT) to 450GB (ruGPT-3). "Collectively, the model saw about 1.5 trillion tokens." from https://habr.com/ru/companies/sberdevices/articles/730088/
Notes: 1.74B