Model Details

Domain:

Task:

Model Access:

Open weights (unrestricted)

Introduction

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4- 340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process. (from technical report: https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf )

Benchmarking

FLOPs

1.8e+25

Notes: 9 trillion tokens for training 6 * 340B * 9T = 1.8E25 alternatively, can do a hardware estimate with a few extra steps: According to the technical report, Nemotron-4 340B was trained using up to 6144 H100 GPUs. Helpfully, they also report the model FLOP utilization (MFU), which was 41-42% (Table 2). This is the ratio of the actual output of their GPUs, in FLOP used for training, relative to their theoretical max of 989 teraFLOP/s per GPU. Unfortunately, the report omits the last ingredient, which is the duration of the training run. However, in Table 2 they report some relevant data that we can use to infer the training time. Nemotron-4 was trained in several stages, but the largest stage used all 6144 GPUs with a batch size of 2304 and an iteration time (time per batch) of 8.0 seconds. This stage involved 7.6T tokens, so it makes up the majority of training. A batch size of 2304 means that each batch consists of 2304 sequences, and they report that the sequence length used for training was 4096 tokens. This means that each batch contained 4096 * 2304 = 9,437,184 tokens. So, during this stage, it took 8 seconds to train the model on 9.4m tokens. Extrapolating to the entire 9T token dataset, this implies the training run would have taken 7,659,574 seconds, or 89 days. (it actually took longer because they didn't use all their GPUs for the whole run) Multiplying 7,659,574 seconds by 41% MFU, 989 peak teraFLOP/s for each H100, and 6144 H100s, we get ~1.9e25 FLOP. This is very close to our first estimate.

Training

Training Code Accessibility

Permissive commercial license: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf

Hardware

NVIDIA H100 SXM5 80GB

Hardware Quantity

6144

Size Notes: 9T training tokens. They first train on an 8T token dataset and then an additional 1T tokens, it's slightly unclear if that's more data or a partial second epoch 6.75T words using 1 token = 0.75 words

Parameters

340000000000

Notes: 340B

Authors

Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft, Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy, Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing Zhang, Vivienne Zhang, Yian Zhang, Chen Zhu