Version 0.5 (Beta Version: please contact jonfu@dtu.dk in case of bugs). Please cite our paper.
ProteusAI, a user-friendly and open-source ML platform, streamlines protein engineering and design tasks. ProteusAI offers modules to support researchers in various stages of the design-build-test-learn (DBTL) cycle, including protein discovery, structure-based design, zero-shot predictions, and ML-guided directed evolution (MLDE). Our benchmarking results demonstrate ProteusAI’s efficiency in improving proteins and enzymes within a few DBTL-cycle iterations. ProteusAI democratizes access to ML-guided protein engineering and is freely available for academic and commercial use. You can upload different data types to get started with ProteusAI. Click on the other module tabs to learn about their functionality and the expected data types.
Upload experimental data (CSV or Excel file) or a single protein (FASTA)
The Protein Design module is a structure-based approach to predict novel sequences using 'Inverse Folding' algorithms. The designed sequences are likely to preserve the fold of the protein while improving the thermal stability and solubility of proteins. To preserve important functions of the protein, we recommend the preservation of protein-protein, ligand-ligand interfaces, and evolutionarily conserved sites, which can be entered manually. The temperature factor influences the diversity of designs. We recommend the generation of at least 1,000 sequences and rigorous filtering before ordering variants for validation. To give an example: Sort the sequences from the lowest to highest score, predict the structure of the lowest-scoring variants, and proceed with the designs that preserve the geometry of the active site (in the case of an enzyme). Experiment with a small sample size to explore temperature values that yield desired levels of diversification before generating large numbers of sequences.
The Representations module offers methods to compute and visualize vector representations of proteins. These are primarily used by the MLDE and Discovery modules to make training more data-efficient. The representations are generated from classical algorithms such as BLOSUM62 or large protein language models that infuse helpful inductive biases into protein sequence representations. In some cases, the representations can be used to cluster proteins based on function or to predict protein properties. The module offers several visualization techniques to explore the representations and to understand the underlying structure of the protein data. Advanced analysis and predictions can be made by using the MLDE or Discovery modules in combination with the Representations module.
The Machine Learning Guided Directed Evolution (MLDE) module offers a structured approach to improve protein function through iterative mutagenesis, inspired by Directed Evolution. Here, machine learning models are trained on previously generated experimental results. The 'Search' algorithm will then propose novel sequences that will be evaluated and ranked by the trained model to predict mutants that are likely to improve function. The Bayesian optimization algorithms used to search for novel mutants are based on models trained on protein representations that can either be generated from large language models, which is currently very slow, or from classical algorithms such as BLOSUM62. For now, we recommend the use of BLOSUM62 representations combined with Random Forest models for the best trade-off between speed and quality. However, we encourage experimentation with other models and representations.
The Discovery Module offers a structured approach to identifying proteins even with little to no experimental data to start with. The goal of the module is to identify proteins with similar functions and to propose novel sequences that are likely to have similar functions. The module relies on representations generated by large protein language models that transform protein sequences into meaningful vector representations. It has been shown that these vector representations often cluster based on function. Clustering should be used if all, very few, or no sequences have annotations. Classification should be used if some or all sequences are annotated. To find out if you have enough sequences for classification, we recommend using the model statistics on the validation set, which are automatically generated by the module after training.