Notes: 6ND aproximation 6*8.54B*6T = 3.07e23 "Gemma 2B and 7B are trained on 2T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code." As confirmation: "We estimate the carbon emissions from pretraining the Gemma models to be βΌ 131 π‘πΆπ2ππ. " U.S. avg CO2 per kWh is ~0.87lbs 131 tCO2 * 2000 lb/t * (1 kWh/0.87lb) = 3.01e5 kWh Per SemiAnalysis TPU v5e uses ~ 5x less power than H100, so ~140 W TDP 3.01e5 kWh * 1000 W/kW * 1 TPUv5e/140 W = 2.15e6 TPUv5e-hours In bf16 precision, TPUv5e has peak performance of 197 TF/s, so: 2.15e6 * 3600 * 197e12 * 0.3 = 4.57e23
Size Notes: "Gemma 2B and 7B are trained on 2T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code." Not explicitly stated that this doesn't involve multiple epochs, but I expect it does not.
Notes: Table 2, sum of embedding and non-embedding parameters