To be a next-generation DL-based phenotype prediction from genome mutations.

Overview
Sequence -----------+--> 3D_structure --> 3D_module --+                                      +--> ?
|                   |                                 |                                      +--> ?
|                   |                                 +--> Joint_module --> Hierarchical_CLF +--> ?
|                   |                                 |                                      +--> ?
+-> NLP_embeddings -+-------> Embedding_module -------+                                      +--> ?

ClynMut: Predicting the Clynical Relevance of Genome Mutations (wip)

To be a next-generation DL-based phenotype prediction from genome mutations. Will use sota NLP and structural techniques.

Planned modules will likely be:

  • 3D learning module
  • NLP embeddings
  • Joint module + Hierarchical classification

The main idea is for the model to learn the prediction in an end-to-end fashion.

Install

$ pip install clynmut

Example Usage:

import torch
from clynmut import *

hier_graph = {"class": "all", 
              "children": [
                {"class": "effect_1", "children": [
                  {"class": "effect_12", "children": []},
                  {"class": "effect_13", "children": []}
                ]},
                {"class": "effect_2", "children": []},
                {"class": "effect_3", "children": []},
              ]}

model = MutPredict(
    seq_embedd_dim = 512,
    struct_embedd_dim = 256, 
    seq_reason_dim = 512, 
    struct_reason_dim = 256,
    hier_graph = hier_graph,
    dropout = 0.0,
    use_msa = False,
    device = None)

seqs = ["AFTQRWHDLKEIMNIDALTWER",
        "GHITSMNWILWVYGFLE"]

pred_dicts = model(seqs, pred_format="dict")

Important topics:

3D structure learning

There are a couple architectures that can be used here. I've been working on 2 of them, which are likely to be used here:

Hierarchical classification

  • A simple custom helper class has been developed for it.

Testing

$ python setup.py test

Datasets:

This package will use the awesome work by Jonathan King at this repository.

To install

$ pip install git+https://github.com/jonathanking/sidechainnet.git

Or

$ git clone https://github.com/jonathanking/sidechainnet.git
$ cd sidechainnet && pip install -e .

Citations:

@article{pejaver_urresti_lugo-martinez_pagel_lin_nam_mort_cooper_sebat_iakoucheva et al._2020,
    title={Inferring the molecular and phenotypic impact of amino acid variants with MutPred2},
    volume={11},
    DOI={10.1038/s41467-020-19669-x},
    number={1},
    journal={Nature Communications},
    author={Pejaver, Vikas and Urresti, Jorge and Lugo-Martinez, Jose and Pagel, Kymberleigh A. and Lin, Guan Ning and Nam, Hyun-Jun and Mort, Matthew and Cooper, David N. and Sebat, Jonathan and Iakoucheva, Lilia M. et al.},
    year={2020}
@article{rehmat_farooq_kumar_ul hussain_naveed_2020, 
    title={Predicting the pathogenicity of protein coding mutations using Natural Language Processing},
    DOI={10.1109/embc44109.2020.9175781},
    journal={2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)},
    author={Rehmat, Naeem and Farooq, Hammad and Kumar, Sanjay and ul Hussain, Sibt and Naveed, Hammad},
    year={2020}
@article{pagel_antaki_lian_mort_cooper_sebat_iakoucheva_mooney_radivojac_2019,
    title={Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome},
    volume={15},
    DOI={10.1371/journal.pcbi.1007112},
    number={6},
    journal={PLOS Computational Biology},
    author={Pagel, Kymberleigh A. and Antaki, Danny and Lian, AoJie and Mort, Matthew and Cooper, David N. and Sebat, Jonathan and Iakoucheva, Lilia M. and Mooney, Sean D. and Radivojac, Predrag},
    year={2019},
    pages={e1007112}
You might also like...
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow.  This is part of the CASL project: http://casl-project.ai/
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Python generation script for BitBirds

BitBirds generation script Intro This is published under MIT license, which means you can do whatever you want with it - entirely at your own risk. Pl

TTS is a library for advanced Text-to-Speech generation.
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

Comments
  • TO DO LIST

    TO DO LIST

    • [x] Add embeddings functionality
    • [ ] Add 3d structure module (likely-to-be GVP/... based)
    • [x] Add classifier
    • [x] Hierarchical classification helper based on differentiability
    • [x] End-to-end code
    • [ ] data collection
    • [ ] data formatting
    • [ ] Run featurization for all data points (esm1b + af2 structs)
    • [ ] Perform a sample training
    • [ ] Perform sample evaluation
    • [ ] Iterate - improve
    • [ ] ...
    • [ ] idk, will see as we go
    opened by hypnopump 0
Releases(0.0.2)
Owner
Eric Alcaide
For he today that sheds his blood with me; Shall be my brother
Eric Alcaide
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 4.8k Feb 18, 2021
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 4, 2021
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

HNLP 1.1k Dec 16, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

Onur Kaplan 151 Dec 14, 2022
Label data using HuggingFace's transformers and automatically get a prediction service

Label Studio for Hugging Face's Transformers Website • Docs • Twitter • Join Slack Community Transfer learning for NLP models by annotating your textu

Heartex 135 Dec 29, 2022
A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

Oliver Guhr 27 Dec 22, 2022
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

Phil Wang 2.3k Jan 1, 2023
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023