This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mDeBERTa and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTa Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. We have made our pre-trained models and inference code publicly available at this https URL.
Notes: 6ND = 6 FLOP / parameter / token * 42666666667 tokens * 418000000 parameters = 1.07008e+20 FLOP
Size Notes: "We train those models with 160GB data" from Table 8 Wiki+Book 16GB OpenWebText 38GB Stories 31GB CC-News 76GB "All the models are trained for 500,000 steps with a batch size of 8192 and warming up steps of 10,000." 160GB* 200M words per GB * 4/3 tokens per English word ~ 42666666667 Batch Size 8k Max Steps 500k sequence length unknown
Notes: from HF "The DeBERTa V3 large model comes with 24 layers and a hidden size of 1024 . Its total parameter number is 418M since we use a vocabulary containing 128K tokens which introduce 131M parameters in the Embedding layer. This model was trained using the 160GB data as DeBERTa V2."