Sporting 1.75 trillion parameters, Wu Dao 2.0 is roughly ten times the size of Open AI's GPT-3.
Notes: It's a mixture-of-experts model, so all 1.75 trillion params were most likely not trained on each token. "The parameter scale of Enlightenment 2.0 reached a record-breaking 1.75 trillion. According to reports, the new generation FastMoE technology is the key to the realization of the "Trillion Model" cornerstone of Enlightenment 2.0." Speculatively assuming 3% of parameters are active per forward pass and the model was trained for one epoch: 6 FLOP / token / parameter * 1.75 * 10^12 parameters * 0.03 * 4.9 * 10^12 tokens [see dataset size notes] = 1.54e+24 FLOP
Size Notes: [tokens assumed[ "Bilingual (Cn and En) data: 4.9T text and images" https://keg.cs.tsinghua.edu.cn/jietang/publications/wudao-3.0-meta-en.pdf
Notes: "It's been trained on 1.75 trillion parameters" MoE architecture, "tens of thousands of experts" https://keg.cs.tsinghua.edu.cn/jietang/publications/wudao-3.0-meta-en.pdf