The accurate computational annotation of protein sequences with enzymatic function, especially those that are part of the functional and taxonomic dark matter, remains a fundamental challenge in bioinformatics. Here, we present HiFi-NN, (Hierarchically-Finetuned Nearest Neighbor search) which annotates protein sequences to the 4th level of EC (enzyme commission) number with greater precision and recall than all existing deep learning methods. HiFi-NN is a hierarchically-finetuned deep learning method based on a combination of semi-supervised representation learning and a nearest neighbours classifier. Furthermore, we show that this method can correctly identify the EC number of a given sequence to identities below 40%, where the current state of the art annotation tool, BLASTp, cannot. We proceed to improve the representations learned by increasing the diversity of the training set, not just in sequence space but also in terms of the environment the sequences have been sampled from. Finally, we use HiFi-NN to annotate a portion of microbial dark matter sequences in the MGnify database.

This tool serves as a method by which a query sequence(s) can be compared to a set of protein sequence embeddings to find those most similar to each query. It is assumed that distances encoded in the space represented by the reference embeddings are meaningful representations of protein similarity. To this end we provide a model which has been trained using contrastive learning to map ESM-2 embeddings to a new space which accurately refelcts the distances between proteins which contain similar annotations.
Clone this repository git clone https://github.com/Basecamp-Research/HiFi-NN.git and install the requirements from the requirements.yaml file.
conda env create -f requirements.yaml
If a FASTA file of sequences are used as the queries then ESM also needs to be installed into this directory as it is used to generate embeddings of these sequences.
git clone https://github.com/facebookresearch/esm.git
cd esm
pip install .
cd ../
The files necessary to run HiFi-NN can be downloaded from the following url:
wget https://zenodo.org/records/15013616/files/ModelData.zip
There are three possible modes for inference.
input in the annotate.yaml config.input in the annotate.yaml config.input in the annotate.yaml config.Then run:
python annotate.py
The default setttings for the above command will transfer the annotations of the k nearest neighbours to query protein(s) along with an associated confidence score and minimum distance to each EC. There are three alternative options for the format in which the k nearest neighbours can be used to annotate a particular protein sequence:
return distance = True, return confidence = False:
return distance = False, return confidence = True:
return distance = False, return confidence = False:
First, build the Docker image:
docker build -t hifinn .
Before running the Docker container, you need to download the ModelData locally:
wget https://zenodo.org/records/15013616/files/ModelData.zip
unzip ModelData.zip
Then run the container with mounted volumes:
docker run -v $(pwd)/ModelData:/app/ModelData -v $(pwd):/app/output --shm-size '2gb' hifinn
PyTorch will share data between processes using shared memory. Therefore if multiprocessing is used to load data then the shared memory segment size used by the container may not be enough. In the above example we increase the default from 64mb to 2gb. This will:
/app/ModelData in the container, we will also save the models predictions herecluster30_annos.json, in ModelDataWe can construct a FAISS index from a folder of ESM embeddings or a FASTA file by simply running the following script. The index created can then later be used for annotation, as outlined above.
python make_db.py
If you only wish to index a subset of a folder of embeddings you can specify the specific ids you wish to index, this is the last argument in the above command and is entirely optional. There are three accepted filetypes for these ids:
FASTA file of sequences where the id indicates the filename of the embeddings you wish to index.JSON file with a list of the ids you wish to embed.TXT file with a single id per line.To embed your training set using ESM-2 you should run the following command (source: https://github.com/facebookresearch/esm/blob/main/README.md).
python scripts/extract.py esm2_t33_650M_UR50D examples/data/some_proteins.fasta \
examples/data/some_proteins_emb_esm2 --repr_layers 32 --include per_tok
To train a model we simply run python train_overlap_loss.py. The config file in configs/overlap_loss_config.yaml should be adjusted accordingly to reflect the path's on your own machine.
Size Notes: "the model retrained with 3M selected, environmentally diverse sequences from Basecamp Research’s BaseGraph."
Notes: "The model boasts over 3M parameters."