The design of custom-tailored proteins has the potential to provide novel and groundbreaking solutions in many fields, including molecular medicine or environmental sciences. Among protein classes, enzymes are particularly attractive because their complex active sites can accelerate chemical transformations by several orders of magnitude. Since enzymes are biodegradable nanoscopic materials, they hold an unmatched promise as sustainable, large-scale industrial catalysts. Motivated by the enormous success of language models in designing novel yet nature-like proteins, we hypothesised that an enzyme-specific language model could provide new opportunities to design purpose-built artificial enzymes. Here, we describe ZymCTRL, a conditional language model trained on the BRENDA database of enzymes, which generates enzymes of a specific enzymatic class upon a user prompt. ZymCTRL generates artificial enzymes distant from natural ones while their intended functionality matches predictions from orthogonal methods. We release the model to the community.
Notes: "We trained for 179,000 steps on 48 NVIDIA A100s 80GB for about 15,000 GPU hours" 15000 * 3600 * 312 teraFLOPS * 0.3 (utilization assumption) = 5.05e21
Size Notes: 36,276,604 sequences after filtering, and training uses 90% of these. From figure 6, average sequence is 399.2 amino acids long. 36,276,604 * 0.9 * 399.2 = 13.0B amino acids
Notes: "ZymCTRL contains 36 layers totalling 738M parameters"