This blog post introduces SmolLM, a family of state-of-the-art small models with 135M, 360M, and 1.7B parameters, trained on a new high-quality dataset. It covers data curation, model evaluation, and usage.
Notes: "Therefore, we decided to train the 1.7B model on 1 trillion tokens" (https://huggingface.co/blog/smollm#experiments). The model doesn't have any CNNs or RNNs (https://huggingface.co/blog/smollm#hyperparameters-choice), so I will approximate it to be dense. Then assuming the model saw 1 trillion tokens during training, the 6ND approximation yields Training compute = # of active parameters / forward pass * # of tokens * 6 FLOPs / token = 1.71e9 parameters * 1e12 tokens * 6 FLOPs / token ~= 1.03e22 FLOPS "We also instruction tuned the models using publicly available permissive instruction datasets. We trained all three models for one epoch on the permissive subset of the WebInstructSub dataset, combined with StarCoder2-Self-OSS-Instruct. Following this, we performed DPO (Direct Preference Optimization) for one epoch: using HelpSteer for the 135M and 1.7B models... We followed the training parameters from the Zephyr-Gemma recipe in the alignment handbook, but adjusted the SFT (Supervised Fine-Tuning) learning rate to 3e-4" (https://huggingface.co/blog/smollm#evaluation).
Size Notes: Pretraining tokens: 1T
Notes: The image here, https://huggingface.co/blog/smollm#hyperparameters-choice, shows that SmolLM-1.7B has 1.71B parameters.