We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
Notes: 90% CI: 8.2E+24 to 4.4E+25 NOTE: this is a rough estimate based on public information, much less information than most other systems in the database. Calculation and confidence intervals here: https://colab.research.google.com/drive/1O99z9b1I5O66bT78r9ScslE_nOj5irN9?usp=sharing
Size Notes: Speculative. Reported secondhand by online sources such as Semianalysis, but not verified by OpenAI. If total number of tokens seen was 13T, text was repeated for 2 epochs, and text was the majority of tokens, then dataset size roughly is 13T*0.75/2 = 4.9T words. Note this examines only the text dataset, since GPT-4 was first and foremost a language model. However, the vision component had its own vision dataset, which we believe accounted for a much smaller part of the compute budget.
Notes: Rumored to be 1.8T parameter MoE with 280B activated on the forward pass, per https://www.semianalysis.com/p/gpt-4-architecture-infrastructure. Other sources estimate 1.76T with 220B per forward pass https://web.archive.org/web/20230712123915/https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/