>
Model Summary: Granite-3.1-2B-Base extends the context length of Granite-3.0-2B-Base from 4K to 128K using a progressive training strategy by increasing the supported context length in increments while adjusting RoPE theta until the model has successfully adapted to desired length of 128K. This long-context pre-training stage was performed using approximately 500B tokens.
Notes: 6 FLOP / token / parameter * 2.5 * 10^9 parameters * 12*10^12 tokens = 1.8e+23 FLOP
Size Notes: 12T
Notes: 2.5B Model Architecture: Granite-3.1-2B-Base is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings.