We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at this https URL.
Notes: 9B model is trained using 64 A100-80GBs in 16bf. Length of training not stated. Might be possible to calculate training compute using operations counting method.
Size Notes: "OpenFlamingo models were trained for 60M interleaved (MMC4) examples1 and 120M LAION-2B examples." Table 2 and Figure 4 provide details on the length of each example. MMC4 has median of 2 images and 256 text tokens per sample, while LAION-2B has 1 and 17, respectively. Suggests a total of around 240M images and 17.4B text tokens. Prediction task is next (text) token prediction; images are only used for conditioning.
Notes: "We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters."