Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.
Notes: Based on: https://docs.google.com/spreadsheets/d/1enan21dFx03TkwufHgOwTVNBtuYlqNY9uurjIK6YS-8/edit#gid=0 Number of steps 4.5e5, batch size (tokens) 6.1e7, parameters 1e8 Calculation = 4e8 FLOP/bp * 4.5e5 bp + 2e8 FLOP/fp * 2.75e13 fp Batch size: 512 Seq length: 100 * 1192 tokens All models are trained on 32 V100 GPUs for 100k updates. The four models with best contact precision are then further trained to 150k updates. Finally, the best model at 150k updates is trained to 450k updates. 450k * 512 * 100 * 1192 * 100M * 6 = 1.65e22
Size Notes: "We train an MSA Transformer model with 100M parameters on a large dataset (4.3 TB) of 26 million MSAs, with an average of 1192 sequences per MSA." Average sequence is ~300 amino acids/tokens long. 26 million * 1192 * 300 = 9.3T tokens
Notes: "We train an MSA Transformer model with 100M parameters..."