We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.
Notes: 6ND = 6 *1.3*10^9 parameters * 51 * 10^9 tokens = 3.978e+20 312000000000000 FLOP/s * 8 GPUs *103 hours *3600 sec / hour *0.3 [assumed utilization] = 2.7765504e+20 geometric mean sqrt(2.7765504e+20*3.978e+20) = 3.3234195e+20
Size Notes: A filtered code-language dataset, which is a subset of The Stack and StackOverflow, obtained by using a language model-based classifier (consisting of about 6B tokens). • A synthetic textbook dataset consisting of <1B tokens of GPT-3.5 generated Python textbooks. • A small synthetic exercises dataset consisting of ∼180M tokens of Python exercises and solutions For the 1.3B models, phi-1 and phi-1-base are checkpoints after training on 51B tokens (770 GPU hours) Training tokens: 54B tokens (7B unique tokens)
Notes: 1.3B The architecture for our 1.3B parameter phi-1 model consists of 24 layers, hidden dimension of 2048, MLP-inner dimension of 8192, and 32 attention heads of dimension 64 each.