This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5
Notes: 6ND = 6*560000000*(1000000000+1600000*2 epochs) = 3.370752e+18 confidence 'likely" because pre-training epochs are unknown
Size Notes: Pre-training: Table 1 around 1B text pairs in different languages Fine-tuning: 1.6M total: 1000000000+160000 = 1000160000 text pairs
Notes: 560M from https://huggingface.co/intfloat/multilingual-e5-large