The "Tootsie Roll" Process A core premise of the Marin 8B run was that we didn't fully know the best recipe— so we just started training with what we had, and planned to adapt along the way. Internally, we referred to this as the "Tootsie" process, a reference to Tootsie Rolls, which use a "graining" process where each day's batch contains a bit of the previous day's, seeding crystallization or something. (We are not food scientists.) This is admittedly a bit of a strained metaphor, but the idea was that we'd keep folding in new data, training techniques, and whatever else as the training process went on. (As it would turn out, dear reader, we would often change more than the data...) Model Basics Model Size We decided to build a roughly 7-8 billion parameter model mostly out of pragmatism: we initially only had reserved capacity to train a model of that size for long enough. Architecture We settled on the Llama architecture for the usual reasons: it has been shown to work well, easier to plug into existing inference stacks, no one ever got fired for buying IBM, etc. We used the same settings as Llama 3.1 8B.
Notes: 6 FLOP / parameter / token * 8 *10^9 parameters * 12.75 * 10^12 tokens = 6.12e+23 FLOP
Size Notes: 12.75T tokens
Notes: 8B Architecture Details Architecture: Llama 3 8B Hidden size: 4096 Feedforward size: 14336 Number of layers: 32 Number of attention heads: 32 Number of KV heads: 8