We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
FLOPs5.5e+23
Notes: 1.4e12 tokens * 6.52e10 parameters * 6 FLOP/token/parameter = 5.5e23 FLOP
Training Code Accessibility"we are releasing our model under a noncommercial license focused on research use cases" https://ai.meta.com/blog/large-language-model-llama-meta-ai/
HardwareNVIDIA A100
Hardware Quantity2048
Size Notes: Table 1 indicates that 1.4T tokens involved sampling sub-datasets at more or less than one epoch. Correcting for this: (1.1 epoch * 3.3TB) + (1.06 epoch * 0.783TB) + ... = 1.4T tokens 5.24 epoch-TBs = 1.4T tokens 5.24 epoch-TB * 1000 GB/TB * 200M token/GB = 1.4T tokens 1.05T epoch*token = 1.4T tokens 1 epoch = 1.34T tokens
Parameters65200000000
Notes: Model card, table 1: https://github.com/facebookresearch/llama/blob/53011c3d7946dadb8274a4c5c7586ab54edf792d/MODEL_CARD.md