We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
Notes: 1.4T tokens * 32.5B params * 6 FLOP/token/param = 2.73e+23 FLOP
Size Notes: Table 1 indicates that 1.4T tokens involved sampling sub-datasets at more or less than one epoch. Correcting for this: (1.1 epoch * 3.3TB) + (1.06 epoch * 0.783TB) + ... = 1.4T tokens 5.24 epoch-TBs = 1.4T tokens 5.24 epoch-TB * 1000 GB/TB * 200M token/GB = 1.4T tokens 1.05T epoch*token = 1.4T tokens 1 epoch = 1.34T tokens
Notes: Table 2 in the paper