In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Notes: Information: 128 NVIDIA V100 GPUs [Pre-training details] 906k steps [See Table S2: Hyperparameters] 131,072 tokens per batch ["We trained with 131,072 tokens per batch (128 gpus x 1024 tokens)." - Pre-training details] Estimate: 906e3 updates * 3 * 131072 tokens/update * 2 * 669.2e6 parameters = 4.8e20 FLOP
Size Notes: "(iii) UR50/D, 124.9M UniRef50 cluster members sampled evenly by cluster" As a rule of thumb, 200 amino acids per protein, so 124.9M * 200 = 24.98 billion Epochs: 906k steps * 131,072 tokens/batch / 24.98B = 4.75 epochs
Notes: See Table 1