Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.
Notes: "Both our 2.7B and 9.4B parameter models were trained with batches of approximately 500k label BPE tokens per batch [...] The 9.4B parameter model was trained [...] for a total of 200k SGD steps." Also note that the full dataset contains 56.8B label BPE tokens and 88.8B context tokens, so for each batch of 500k label tokens, there are likely 500k * 88.8B / 56.8B = 780k context tokens. 6 * 9.4318B * 200k * (500k + 780k) = 1.449e22
Size Notes: Section 6. Pre-training is done on Pushshift.io Reddit: "Our final dataset contains 1.50B comments totaling 56.8B label BPE tokens and 88.8B context tokens." None of the fine-tuning datasets put a significant dent in the total dataset size. Epochs: they do 200k steps, where each batch has 500k label tokens = 100B label tokens seen. 56.8B label tokens in pre-training dataset, so 1.76 epochs
Notes: The largest model is a transformer with 9.4B parameters (Table 2)