As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google’s TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.
Notes: Table 9 from here: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf Can also be manually estimated based on the details in Table 11 and 4.6.1 Exp 4. 14B residues * 68 epochs = 952B tokens seen in forward passes. However, only 20% of tokens are masked as individual targets; other tokens in consecutive spans are collapsed into single-token targets to reduce computations. For masking rate of 20%, the average sequence will have 36% as many targets as input tokens under this strategy. This is the relevant number of backward passes: (2 * 952B * 19B) + (4 * 952B * 0.36 * 19B) = 6.22e22 36% figure verified here: https://colab.research.google.com/drive/1ETsmp_KRMK8kIRA5kdfcO9QiPK28cBQ6?usp=sharing
Size Notes: Pretrained over UniRef50; 45M proteins and 14B amino acids, per Table 2 952B tokens from Table 9 at: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1 (This is total tokens over multiple epochs)
Notes: Figure 1 indicates 1.15B parameters, but both the huggingface model and a replication (https://huggingface.co/ElnaggarLab/ankh-large and https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf) indicate 1.9B parameters. Notebook for counting params: https://colab.research.google.com/drive/1EGI5_vDl4pOBUukJexMHQR16BFKJe4a5?usp=sharing