We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.
Notes: 6ND = 6* 14*10^9 parameters * 10*10^12 tokens = 8.4e+23 FLOP 989500000000000 FLOP / sec [assumed bf16 precision] * 1920 GPUs * 504 hours * 3600 sec / hour * 0.3 [assumed utilization] = 1.0341209e+24 FLOP geometric mean sqrt(8.4e+23 * 1.0341209e+24) = 9.3202015e+23
Size Notes: "The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules with peak learning rate of 0.0003, constant weight decay of 0.1, and global batch size of 5760. " Table 5: Web 15% 1.3T unique tokens 1.2 epochs Web rewrites 15% 290B unique tokens 5.2 epochs Synthetic 40% 290B unique tokens 13.8 epochs Code data 20% 820B unique tokens 2.4 epochs Acquired sources 10% 580B unique tokens 1.7 epochs
Notes: 14B parameters, dense decoder-only Transformer model