We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4- 340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process. (from technical report: https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf )
Notes: 9 trillion tokens for training 6 * 340B * 9T = 1.8E25 alternatively, can do a hardware estimate with a few extra steps: According to the technical report, Nemotron-4 340B was trained using up to 6144 H100 GPUs. Helpfully, they also report the model FLOP utilization (MFU), which was 41-42% (Table 2). This is the ratio of the actual output of their GPUs, in FLOP used for training, relative to their theoretical max of 989 teraFLOP/s per GPU. Unfortunately, the report omits the last ingredient, which is the duration of the training run. However, in Table 2 they report some relevant data that we can use to infer the training time. Nemotron-4 was trained in several stages, but the largest stage used all 6144 GPUs with a batch size of 2304 and an iteration time (time per batch) of 8.0 seconds. This stage involved 7.6T tokens, so it makes up the majority of training. A batch size of 2304 means that each batch consists of 2304 sequences, and they report that the sequence length used for training was 4096 tokens. This means that each batch contained 4096 * 2304 = 9,437,184 tokens. So, during this stage, it took 8 seconds to train the model on 9.4m tokens. Extrapolating to the entire 9T token dataset, this implies the training run would have taken 7,659,574 seconds, or 89 days. (it actually took longer because they didn't use all their GPUs for the whole run) Multiplying 7,659,574 seconds by 41% MFU, 989 peak teraFLOP/s for each H100, and 6144 H100s, we get ~1.9e25 FLOP. This is very close to our first estimate.
Size Notes: 9T training tokens. They first train on an 8T token dataset and then an additional 1T tokens, it's slightly unclear if that's more data or a partial second epoch 6.75T words using 1 token = 0.75 words
Notes: 340B