Model Details

Domain:

Biology

Task:

Protein generation

Proteins

Protein or nucleotide language model pLM

Protein or nucleotide language model nLM

Protein contact and distance prediction

Protein classification

Protein localization prediction

Protein fold classification

Model Access:

Open weights (non-commercial)

Citations:

139

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google’s TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.

Benchmarking

FLOPs6.5e+21

Notes: Table 9 from here: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf Can also be manually estimated based on the details in Table 11 and 4.6.1 Exp 4. 14B residues * 68 epochs = 952B tokens seen in forward passes. However, only 20% of tokens are masked as individual targets; other tokens in consecutive spans are collapsed into single-token targets to reduce computations. For masking rate of 20%, the average sequence will have 36% as many targets as input tokens under this strategy. This is the relevant number of backward passes: (2 * 952B * 19B) + (4 * 952B * 0.36 * 19B) = 6.22e22 36% figure verified here: https://colab.research.google.com/drive/1ETsmp_KRMK8kIRA5kdfcO9QiPK28cBQ6?usp=sharing

Training

Training Code Accessibilitycc non-commercial: https://github.com/agemagician/Ankh/blob/main/LICENSE.md cc-by-nc for weigths: https://huggingface.co/ElnaggarLab/ankh-large

HardwareGoogle TPU v4

Hardware Quantity64

Size Notes: Pretrained over UniRef50; 45M proteins and 14B amino acids, per Table 2 952B tokens from Table 9 at: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1 (This is total tokens over multiple epochs)

Parameters

Parameters1900000000

Notes: Figure 1 indicates 1.15B parameters, but both the huggingface model and a replication (https://huggingface.co/ElnaggarLab/ankh-large and https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf) indicate 1.9B parameters. Notebook for counting params: https://colab.research.google.com/drive/1EGI5_vDl4pOBUukJexMHQR16BFKJe4a5?usp=sharing

Authors

Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, Burkhard Rost

Model Details

Domain:

Biology

Task:

Protein generation

Proteins

Protein or nucleotide language model pLM

Protein or nucleotide language model nLM

Protein contact and distance prediction

Protein classification

Protein localization prediction

Protein fold classification

Model Access:

Open weights (non-commercial)

Citations:

139

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs6.5e+21

Training

Training Code Accessibilitycc non-commercial: https://github.com/agemagician/Ankh/blob/main/LICENSE.md cc-by-nc for weigths: https://huggingface.co/ElnaggarLab/ankh-large

HardwareGoogle TPU v4

Hardware Quantity64

Parameters

Parameters1900000000

Authors

Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, Burkhard Rost

Technical University of Munich,Columbia University | Ankh large - Capabilities, Benchmarks and Use Cases

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors