Structural basis for solubility in protein expression systems

ProteinQure

Last update: Aug 18, 2022

Related tags

Miscellaneous hackathon protein-structure protein drug-discovery solubility-prediction

Overview

Structural basis for solubility in protein expression systems

Large-scale protein production for biotechnology and biopharmaceutical applications rely on high protein solubility in expression systems. Solubility has been measured for a significant fraction of E. coli and S. cerevisiae proteomes and these datasets are routinely used to train predictors of protein solubility in different organisms. Thanks to continued advances in experimental structure-determination and modelling, many of these solubility measurements can now be paired with accurate structural models.

The challenge is mentored by Christopher Ing and Mark Fingerhuth.

Aim of the challenge

It is the objective of this project to use our provided dataset of protein structure and solubility value pairs in order to produce a solubility predictor with comparable accuracy to sequence-based predictors reported in the literature. The provided dataset to be used in this project is created by following the dataset curation procedure described in the SOLart paper, and this hackathon project has a similar aim to this manuscript.

The dataset

The process of generating the dataset is described in the SOLArt manuscript. At a high level, all experimentally tested E. coli and S. cerevisiae proteins were matched through Uniprot IDs to known crystallographic structures or high sequence similarity homology models. After balancing the fold types using CATH, a dataset containing a balanced spread of solubility values was produced. The resulting proteins for the training and testing of these models were prepared and disclosed in the supplemental material of this paper as a list of (Uniprot,PDB,Chain,Solubility) pairs. The PDB files were not included in this work so we had to re-extract them from SWISS-MODEL. Whenever a crystallographic structure was present, it was used, assuming high coverage over the Uniprot sequence. In some cases, the original PDB templates used within the original SOLArt paper had been superceded by improved templates, and we opted to take the highest resolution, highest sequence identity, models in our updated dataset. We stripped away all irrelevant chains and heteroatoms.

If issues are identified with individual structures, please refer to the Uniprot ID and manually investigate the best template. In some cases, we needed to improve structure correctness by modelling missing atoms/residues inside the Chemical Computing Group software MOE on a case-by-case basis.

The dataset can be found in the data/ subdirectory - it is already divided into training/ and test/ data. The training/ data comes with solubility_values.csv and solublity_values.yaml (same content just different format) which both contain the solubility target values for all the PDB files provided in that directory. Note that each PDB file is named after the Uniprot identifier of the respective protein and the protein column in the solubility_values.csv also contains the Uniprot identifiers.

The test/ dataset consists of three different subdirectories (protein structures derived from different organisms and with different approaches) and you should NOT use them for any training. Only the yeast_crystal_structs/ directory contains solubility_values.csv and solublity_values.yaml (same content just different format) files which you can use for some local testing & validation. In order to find out your performance on the entire test dataset you need to use the automated benchmarking system (see below).

Example output

Your code should output a file called predictions.csv in the following format:

protein,solubility
P69829,83
P31133,62

whereby the protein column contains the Uniprot ID (corresponds to the filename of the PDB files) and the solubility column contains the predicted solubility value (can be int or float).

Note, that there are three (!) test subsets but you are expected to submit all the predictions in one file (not three) for the benchmarking system to work.

Automated benchmarking system

The continuous integration script in .github/workflows/ci.yml will automatically build the Dockerfile on every commit to the main branch. This docker image will be published as your hackathon submission to https://biolib.com//. For this to work, make sure you set the BIOLIB_TOKEN and BIOLIB_PROJECT_URI accordingly as repository secrets.

To read more about the benchmarking system click here.

Say thanks

Give this repo a star:

Star the ProteinQure org on Github:

Using Python to parse through email logs received through several backup systems.

outlook-automated-backup-control Backup monitoring on a mailbox: In this mailbox there will be backup logs. The identification will based on the follo

2 Sep 28, 2022

Demo code for "Logs in distributed systems" webinar

Hexlet Logs Demo Пререквизиты docker-compose python3 Учетка в DataDog Базовое понимание, что такое логи (можно почитать гайд

1 Dec 1, 2021

Cylc: a workflow engine for cycling systems

Cylc: a workflow engine for cycling systems. Repository master branch: core meta-scheduler component of cylc-8 (in development); Repository 7.8.x branch: full cylc-7 system.

205 Dec 20, 2022

Unzip Japanese Shift-JIS zip archives on non-Japanese systems.

Unzip JP GUI Unzip Japanese Shift-JIS zip archives on non-Japanese systems. This script unzips the file while converting the file names from Shift-JIS

9 Dec 7, 2022

A micro-service that can be extended to help in monitoring systems

A micro-service that can be extended to help in monitoring systems. Be extensible to be incorporated in any of the systems to facilitate timely interventions.

1 Feb 6, 2022

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

2 Jul 25, 2022

PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

PROTEIN-EXPRESSION-ANALYSIS-FOR-DOWN-SYNDROME Down syndrome (DS) is a chromosomal disorder where organisms have an extra chromosome 21, sometimes know

1 Jan 20, 2022

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

537 Jan 5, 2023

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

536 Dec 20, 2022

Code for Mesh Convolution Using a Learned Kernel Basis

Mesh Convolution This repository contains the implementation (in PyTorch) of the paper FULLY CONVOLUTIONAL MESH AUTOENCODER USING EFFICIENT SPATIALLY

35 Jan 3, 2023

NBEATSx: Neural basis expansion analysis with exogenous variables

NBEATSx: Neural basis expansion analysis with exogenous variables We extend the NBEATS model to incorporate exogenous factors. The resulting method, c

100 Dec 31, 2022

An advanced real time threat intelligence framework to identify threats and malicious web traffic on the basis of IP reputation and historical data.

ARTIF is a new advanced real time threat intelligence framework built that adds another abstraction layer on the top of MISP to identify threats and malicious web traffic on the basis of IP reputation and historical data. It also performs automatic enrichment and threat scoring by collecting, processing and correlating observables based on different factors.

225 Dec 31, 2022

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

538 Jan 9, 2023

MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Machine Learning and Data Analytics Lab FAU

10 Dec 19, 2022

Used to record WKU's utility bills on a regular basis.

WKU水电费小助手一个用于定期记录WKU水电费的脚本 Looking for English Readme? 背景由于WKU校园内的水电账单系统时常存在扣费延迟的现象，而补扣的费用缺乏令人信服的证明。不少学生为费用摸不着头脑，但也没有申诉的依据。为了更好地掌握水电费使用情况，留下一手证据，我开源

2 Jul 21, 2022

Ansible is a radically simple IT automation platform that makes your applications and systems easier to deploy and maintain. Automate everything from code deployment to network configuration to cloud management, in a language that approaches plain English, using SSH, with no agents to install on remote systems. https://docs.ansible.com.

Ansible Ansible is a radically simple IT automation system. It handles configuration management, application deployment, cloud provisioning, ad-hoc ta

55.9k Jan 4, 2023

An advanced multi-threaded, multi-client python reverse shell for hacking linux systems. There's still more work to do so feel free to help out with the development. Disclaimer: This reverse shell should only be used in the lawful, remote administration of authorized systems. Accessing a computer network without authorization or permission is illegal.

PwnLnX An advanced multi-threaded, multi-client python reverse shell for hacking linux systems. There's still more work to do so feel free to help out

212 Dec 24, 2022

Python toolkit for defining+simulating+visualizing+analyzing attractors, dynamical systems, iterated function systems, roulette curves, and more

Attractors A small module that provides functions and classes for very efficient simulation and rendering of iterated function systems; dynamical syst

1 Aug 4, 2021

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish

DPDispatcher DPDispatcher is a python package used to generate HPC(High Performance Computing) scheduler systems (Slurm/PBS/LSF/dpcloudserver) jobs in

23 Nov 30, 2022

Structural basis for solubility in protein expression systems

Related tags

Overview

Structural basis for solubility in protein expression systems

Aim of the challenge

The dataset

Example output

Automated benchmarking system

Say thanks

You might also like...

Using Python to parse through email logs received through several backup systems.

Demo code for "Logs in distributed systems" webinar

Cylc: a workflow engine for cycling systems

Unzip Japanese Shift-JIS zip archives on non-Japanese systems.

A micro-service that can be extended to help in monitoring systems

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

Code for Mesh Convolution Using a Learned Kernel Basis

NBEATSx: Neural basis expansion analysis with exogenous variables

An advanced real time threat intelligence framework to identify threats and malicious web traffic on the basis of IP reputation and historical data.

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

MaD GUI is a basis for graphical annotation and computational analysis of time series data.

Used to record WKU's utility bills on a regular basis.

Python toolkit for defining+simulating+visualizing+analyzing attractors, dynamical systems, iterated function systems, roulette curves, and more

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish

Owner

ProteinQure

These are the scripts used for the project of ‘Assembly of a pan-genome for global cattle reveals missing sequence and novel structural variation, providing new insights into their diversity and evolution history’

RFDesign - Protein hallucination and inpainting with RoseTTAFold

emoji-math computes the given python expression and returns either the value or the nearest 5 emojis as measured by cosine similarity.

Expression interpreter written in Python

LSO, also known as Linux Swap Operator, is a software with both GUI and terminal versions that you can manage the Swap area for Linux operating systems.

ASVspoof 2021 Baseline Systems

Control System Packer is a lightweight, low-level program to transform energy equations into the compact libraries for control systems.

Test reproducibility of leiden/umap on different systems

EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way.

Intelligent Systems Project In Python