We are excited to introduce the gte-modernbert series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The gte-modernbert series models include both text embedding models and rerank models. The gte-modernbert models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as MTEB, LoCO, and COIR evaluation. Model Overview Developed by: Tongyi Lab, Alibaba Group Model Type: Text Embedding Primary Language: English Model Size: 149M Max Input Length: 8192 tokens Output Dimension: 768
Notes: 6 FLOP / token / parameter * 149 * 10^6 parameters * 1028000000000 tokens * 4 epochs [see dataset size notes] = 3.676128e+21 FLOP
Size Notes: "The gte-modernbert series of models follows the training scheme of the previous GTE models, with the only difference being that the pre-training language model base has been replaced from GTE-MLM to ModernBert." assuming the same training dataset: from https://aclanthology.org/2024.emnlp-industry.103/ "a total of 1,028B tokens " Table 8: batch size 8192 max steps 250000 sequence length 2048 8192*250000*2048 = 4.194304e+12 tokens -> ~4 epochs
Notes: 149M