We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct, designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate strong performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, and TruthfulQA.
Notes: 6 FLOP/parameter/token * 7000000000 parameters * 4000000000000 tokens = 1.68e+23 FLOP 312000000000000 FLOP/GPU/sec * 812321 GPU-hours * 3600 sec / hour * 0.3 [assumed utilization] = 2.7371968415999997e+23 FLOP sqrt(1.68e+23 * 2.7371968415999997e+23) = 2.1444092e+23 FLOP
Size Notes: pre-trained with 4T tokens
Notes: 7B