Model Details

Domain:

Biology

Task:

Protein generation

Proteins

Protein or nucleotide language model pLM

Protein or nucleotide language model nLM

Protein contact and distance prediction

Protein classification

Protein localization prediction

Protein fold classification

Model Access:

Open weights (non-commercial)

Citations:

139

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google’s TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.

Benchmarking

FLOPs2.6e+21

Notes: Table 9 from here: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf

Training

Training Code Accessibilitycc non-commercial: https://github.com/agemagician/Ankh/blob/main/LICENSE.md cc by nc for weights: https://huggingface.co/ElnaggarLab/ankh-base

HardwareGoogle TPU v4

Size Notes: Pretrained over UniRef50; 45M proteins and 14B amino acids, per Table 2 952B tokens from Table 9 at: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1 (This is total tokens over multiple epochs)

Parameters

Parameters740000000

Notes: Figure 1 indicates 450M, but the model on huggingface https://huggingface.co/ElnaggarLab/ankh-base as well as Table 9 from https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf each indicate 740M parameters. Notebook for counting params: https://colab.research.google.com/drive/1EGI5_vDl4pOBUukJexMHQR16BFKJe4a5?usp=sharing

Authors

Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, Burkhard Rost

Model Details

Domain:

Biology

Task:

Protein generation

Proteins

Protein or nucleotide language model pLM

Protein or nucleotide language model nLM

Protein contact and distance prediction

Protein classification

Protein localization prediction

Protein fold classification

Model Access:

Open weights (non-commercial)

Citations:

139

AI Tools Usage

This model is commonly used behind the scenes in AI tools.

Introduction

Benchmarking

FLOPs2.6e+21

Notes: Table 9 from here: https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1.full.pdf

Training

Training Code Accessibilitycc non-commercial: https://github.com/agemagician/Ankh/blob/main/LICENSE.md cc by nc for weights: https://huggingface.co/ElnaggarLab/ankh-base

HardwareGoogle TPU v4

Parameters

Parameters740000000

Authors

Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Charlotte Rochereau, Burkhard Rost

Technical University of Munich,Columbia University | Ankh base - Capabilities, Benchmarks and Use Cases

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors

Top Tasks

Top Countries

Top Domains

Top Organizations

Top Categories

Top Collections

Platform

Model Details

AI Tools Usage

Introduction

Benchmarking

Training

Parameters

Authors