ReLSO
Improved Fitness Optimization Landscapes for Sequence Design
Description
In recent years, deep learning approaches for determining protein sequence-fitness relationships have gained traction. Advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labelled fitness data and consequently, attracted the application of various deep learning methods. Although these methods learn an implicit fitness landscape, there is little work on using the latent encoding directly for protein sequence optimization. Here we show that this latent space representation of a fitness landscape can be made very amenable to latent space optimization through a joint-training process. We also show that this encoding strategy which also provides improvements to generalization over more traditional training strategies. We apply our approach to several biological contexts and show that latent space optimization in a smooth learned folding landscape allows for more accurate and efficient optimization of protein sequences.
Citation
This repo accompanies the following publication:
Egbert Castro, Abhinav Godavarthi, Julien Rubinfien, Smita Krishnaswamy. Guided Generative Protein Design using Regularized Transformers. Nature Machine Intelligence, in review (2021).
How to run
First, install dependencies
# clone project
git clone https://github.com/KrishnaswamyLab/ReLSO-Guided-Generative-Protein-Design-using-Regularized-Transformers.git
# install project
cd ReLSO-Guided-Generative-Protein-Design-using-Regularized-Transformers
pip install -e .
pip install -r requirements.txt
Usage
Training models
# run training script
python train_relso.py --data gifford
*note: if arg option is not relevant to current model selection, it will not be used. See init method of each model to see what's used.
available dataset args:
gifford, GB1_WU, GFP, TAPE
available auxnetwork args:
base_reg