>
Model Summary: Granite-3.1-8B-Base extends the context length of Granite-3.0-8B-Base from 4K to 128K using a progressive training strategy by increasing the supported context length in increments while adjusting RoPE theta until the model has successfully adapted to desired length of 128K. This long-context pre-training stage was performed using approximately 500B tokens.
Notes: 6ND = 6 FLOP / parameter / token * 8.1*10^9 parameters * 12*10^12 tokens = 5.832e+23 FLOP
Size Notes: 12T
Notes: 8.1B Model Architecture: Granite-3.1-8B-Base is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings.