Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

Graham Larue

Last update: Jul 26, 2022

Related tags

Machine Learning python bioinformatics annotation genome svm classification sequence-classification intron minor-intron u12-type annotated-intron-sequences intron-sequences annotated-introns intron-classification

Overview

(intron Interrogator and Classifier)

intronIC is a program that can be used to classify intron sequences as minor (U12-type) or major (U2-type), using a genome and annotation or the sequences themselves. Alternatively, intronIC can be used to simply extract all intron sequences without classification (using -s).

Installation

via `pip`

If you have (or can get) pip, running it on this repo is the easiest way to install the most recent version of intronIC (if you have multiple versions of Python installed, be sure to use the appropriate Python 3 version e.g. python3 in the following commands):

python3 -m pip install git+https://github.com/glarue/intronIC

Alternatively, you can get the last stable version published to PyPI:

python3 -m pip install intronIC

If successful, intronIC should now be callable from the command-line.

To upgrade to the latest version from a previous one, include --upgrade in either of the previous pip commands, e.g.

python3 -m pip install git+https://github.com/glarue/intronIC --upgrade

via `git clone`

Otherwise, you can simply clone this repository to your local machine using git:

git clone https://github.com/glarue/intronIC.git
cd intronIC/intronIC

If you clone the repo, you may also wish to add intronIC/intronIC to your system PATH (how best to do this depends on your platform).

See the wiki for more detail information about configuration/run options.

Dependencies

Python >=3.3
numpy & scipy
scikit-learn >=0.22
biogl
matplotlib (optional, required for plotting)

To install dependencies separately using pip, do

python3 -m pip install numpy scipy matplotlib 'scikit-learn>=0.22' biogl

intronIC was built and tested on Linux, but should run on Windows or Mac OSes without too much trouble (I say that now...).

Useful arguments

The required arguments for any classification run include a name (-n; see note below), along with either of the following:

Genome (-g) and annotation/BED (-a, -b) files

—OR—
Intron sequences file (-q) (see Training-data-and-PWMs for formatting information, which matches the reference sequence format)

By default, intronIC includes non-canonical introns, and considers only the longest isoform of each gene. Helpful arguments may include:

-p parallel processes, which can significantly reduce runtime
-f cds use only CDS features to identify introns (by default, uses both CDS and exon features)
--no_nc exclude introns with non-canonical (non-GT-AG/GC-AG/AT-AC) boundaries
-i include introns from multiple isoforms of the same gene (default: longest isoform only)

Running on test data

If you have installed via pip, first download the chromosome 19 FASTA and GFF3 sample files into a directory of your choice.
If you have cloned the repo, first change to the /intronIC/intronIC/test_data subdirectory, which contains Ensembl annotations and sequence for chromosome 19 of the human genome. Replace intronIC with ../intronIC.py in the following examples.

Classify annotated introns

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens

The various output files contain different information about each intron; information can be cross-referenced by using the intron label (usually the first column of the file). U12-type introns are those (by default) with probability scores >90%, or equivalently (depending on the output file) relative scores >0. For example, here is an example U12-type AT-AC intron from the meta.iic file:

HomSap-gene:ENSG00000141837@transcript:ENST00000614285-intron_1(47);[c:-1]      10.0    AT-AC   GCC|ATATCCTTTT...TTTTCCTTAATT...AATAC|TCC       CACCTCCAACACCCTTCTTTTCTTTGAACAAGAT[TTTTCCTTAATT]CCCCAATAC       50719   transcript:ENST00000614285      gene:ENSG00000141837    1       47      3.9
     2       u12     cds

To retrieve all U12-type introns from this file, one can filter based on the relative score (2nd column; U12-type introns have relative scores >0), e.g.

0)' homo_sapiens.meta.iic">

awk '($2!="." && $2>0)' homo_sapiens.meta.iic

Extract all annotated intron sequences

If you just want to retrieve all annotated intron sequences (without classification), add the -s flag:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens -s

See the rest of the wiki for more details about output files, etc.

A note on the `-n` (name) argument

By default, intronIC expects names in binomial (genus, species) form separated by a non-alphanumeric character, e.g. 'homo_sapiens', 'homo.sapiens', etc. intronIC then formats that name internally into a tag that it uses to label all output intron IDs, ignoring anything past the second non-alphanumeric character.

Output files, on the other hand, are named using the full name supplied via -n. If you'd prefer to have it leave whatever argument you supply to -n unmodified, use the --na flag.

If you are running multiple versions of the same species and would like to keep the same species abbreviations in the output intron data, simply add a tag to the end of the name, e.g. "homo_sapiens.v2"; the tags within files will be consistent ("HomSap"), but the file names across runs will be distinct.

Resource usage

For genomes with a large number of annotated introns, memory usage can be on the order of gigabytes. This should rarely be a problem even for most modern personal computers, however. For reference, the Ensembl 95 release of the human genome requires ~5 GB of memory.

For many non-model genomes, intronIC should run fairly quickly (e.g. tens of minutes). For human and other very well annotated genomes, runtime may be longer (the human Ensembl 95 release takes ~20-35 minutes in testing); run time scales relatively linearly with the total number of annotated introns, and can be improved by using parallel processes via -p.

See the rest of the wiki for more detailed instructions.

Cite

If you find this tool useful, please cite:

Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett, Comprehensive database and evolutionary dynamics of U12-type introns, Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078, https://doi.org/10.1093/nar/gkaa464

About

intronIC was written to provide a customizable, open-source method for identifying minor (U12-type) spliceosomal introns from annotated intron sequences. Minor introns usually represent ~0.5% (at most) of a given genome's introns, and contain distinct splicing motifs which make them amenable to bioinformatic identification.

Earlier minor intron resources (U12DB, SpliceRack, ERISdb, etc.), while important contributions to the field, are static by design. As such, these databases fail to reflect the dramatic increase in available genome sequences and annotation quality of the last decade.

In addition, other published identification methods employ a certain amount of heuristic fuzziness in defining the classification criteria of their U12-type scoring systems (i.e how "U12-like" does an intron need to look before being called a U12-type intron). intronIC relegates this decision to the well-established support-vector machine (SVM) classification method, which produces an easy-to-interpret "probability of being U12-type" score for each intron.

Furthermore, intronIC provides researchers the opportunity to tailor the underlying training data/position-weight matrices, should they have species-specific data to take advantage of.

Finally, intronIC performs a fair amount of bookkeping during the intron collection process, resulting in (potentially) useful metadata about each intron including parent gene/transcript, ordinal index and phase, information which (as far as I'm aware) is otherwise somewhat non-trivial to acquire.

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Petastorm Contents Petastorm Installation Generating a dataset Plain Python API Tensorflow API Pytorch API Spark Dataset Converter API Analyzing petas

1.6k Dec 31, 2022

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

4.2k Dec 29, 2022

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

6.9k Jan 5, 2023

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

924 Jan 3, 2023

(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

SLAM-application: installation and test (3D): LeGO-LOAM, LIO-SAM, and LVI-SAM Tested on Quadruped robot in Gazebo ● Results: video, video2 Requirement

203 Dec 26, 2022

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

124 Dec 28, 2022

A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

73 Oct 17, 2022

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

164 Jan 4, 2023

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

31 Nov 3, 2022

Comments

[BUG] intronIC not working for example data and own data

I am trying to run intronIC with example/own data and is not working.

COMMAND:

intronIC -g Homo_sapiens.Chr19.Ensembl_91.fa.gz -a Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens (same as wiki)

FEEDBACK:

[#] Starting intronIC [v1.1.1] run on [homo_sapiens (HomSap)] [#] Run command: [/home/rocesv/anaconda3/envs/Seidr/bin/intronIC -g /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.fa.gz -a /mnt/e/Gymnosperms_Comparative/Gymnosperms_ComparativeGenomics/Introns_U2vsU12/Homo_sapiens.Chr19.Ensembl_91.gff3.gz -n homo_sapiens] [#] Using [cds,exon] features to define introns [#] [58933] introns found in [Homo_sapiens.Chr19.Ensembl_91.gff3.gz] [#] [38681] introns with redundant coordinates excluded [#] [8178] introns omitted from scoring based on the following criteria: [#] * short (<30 nt): 66 [#] * ambiguous nucleotides in scoring regions: 0 [#] * non-canonical boundaries: 0 [#] * overlapping coordinates: 0 [#] * not in longest isoform: 8112 [#] Most common non-canonical splice sites: [#] * AT-AG (16/328, 4.88%) [#] * GT-TG (12/328, 3.66%) [#] * GG-AG (12/328, 3.66%) [#] * GA-AG (11/328, 3.35%) [#] * AG-AG (10/328, 3.05%) [#] [24] ([15] unique, [9] redundant) putatively misannotated U12-type introns corrected in [homo_sapiens.annotation.iic] [#] [12074] introns included in scoring analysis [#] Scoring introns using the following regions: [five, bp] [#] Raw scores calculated for [20690] U2 and [387] U12 reference introns [#] Raw scores calculated for [12074] experimental introns [#] Training set score vectors constructed: [20690] U2, [387] U12 [#] Training SVM using reference data Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in <module> sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'iid'

PROBLEM TRACEBACK:

**Starting optimization round 1/5 Traceback (most recent call last): File "/home/rocesv/anaconda3/envs/Seidr/bin/intronIC", line 8, in sys.exit(main()) File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5216, in main finalized_introns, model, u12_count, atac_count, demoted_swaps = apply_scores( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 3804, in apply_scores model, model_performance = optimize_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5512, in optimize_svm search_model, performance = train_svm( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/intronIC/intronIC.py", line 5431, in train_svm model = GridSearchCV( File "/home/rocesv/anaconda3/envs/Seidr/lib/python3.9/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, kwargs) TypeError: init() got an unexpected keyword argument 'iid

In both cases i have the same problem and the log is similar. Any idea? I am very interested on using this amazing tool.

Thank you in advance :)

PD: Running in conda env with python 3.9 (wsl 2 Ubuntu 20.04 Windows 10Pro)

opened by RocesV 2

Releases(v1.3.7)

v1.3.7(Jun 10, 2022)

Fix annoying issues with generating a version number under different installation scenarios.
Source code(tar.gz)
Source code(zip)
v1.3.6(Jun 10, 2022)
Deal with edge-case issue where a gene feature has children exon/CDS features in a direct parent-child relationship. Previously, this would bypass the recursive search for introns used by get_introns() due to an early exit, resulting in preferential inclusion of introns whose Parent attribute was the gene itself rather than a child transcript.

Remove old code/fix whitespace

Update __version__ paradigm

Remove Physarum-specific branch-point PWM code

Source code(tar.gz)
Source code(zip)
v1.3.2(Oct 20, 2021)

Misc. minor changes not affecting functionality.

Switch to limiting master (soon to be main) to point releases, with development code contained to dev.
Source code(tar.gz)
Source code(zip)
v1.3.0(Jul 23, 2021)
Changes default scoring behavior to include all (5', BPS and 3') regions, instead of the previous default of just 5' and BPS. The 3' region typically contains less differentiation between U2- and U12-type introns, but may help reduce FP and FN classifier calls in edge cases. Of course, it's also possible that it could also introduce FPs and/or FNs, although in my experience using all three seems to be more conservative than not.

Misc. minor internal changes.

Source code(tar.gz)
Source code(zip)
v1.2.0(Feb 8, 2021)
intronIC v1.2.0

Fix GridSearchCV regression with newer versions of scikit-learn (>v0.22) (see issue #1)

Due to scikit-learn's inversion of a default flag in GridSearchCV, intronIC must now require scikit-learn to be at least v0.22

This fix breaks compatibility with scikit-learn versions <v0.22

Source code(tar.gz)
Source code(zip)
v1.1.1(Dec 5, 2020)
intronIC v1.1.1

Replace parent-child hierarchical clustering of annotation features with simpler, directed graph-based approach

Fix occasional issues where parent genes of CDS/exon features weren't correctly identified

Source code(tar.gz)
Source code(zip)
v1.1.0(Oct 31, 2020)
A number of changes to the underlying data in this release - the default PWMs have been changed to a slightly less-stringent set, which should leave most results relatively unchanged and deals with some edge-cases where the original PWMs were overly penalizing for certain base positions due to being built from low-N samples. Other changes include:

Default 3'SS region shortened to [-6, 4]

By default, the human U2-type BPS PWM is used instead of the on-the-fly version. A per-run PWM can be generated using --generate_u2_bps_pwm

z-scores in the output have been adjusted to correspond to the entire dataset (previously, they were based on the training set only)

Non-canonical introns by default now use whatever PWM is closest to their terminal dinucleotides if one is obvious (e.g. for AT-TC introns, this would be the AT-AC PWM; for AT-AG introns, GT-AG and AT-AC are equally close in terms of edit distance). Otherwise, the terminal dinucleotides will be ignored and the best PWM will be selected based on the geometric mean of the component scores from each PWM. This can be reverted to the old behavior using --no_ignore_nc_dnts

Source code(tar.gz)
Source code(zip)
v1.0.14(Oct 30, 2020)
intronIC v1.0.14

Uses human U2-type BPS PWM (data from Pineda 2018) by default. To restore the previous paradigm wherein U2-type BPS PWMs are generated on-the-fly using the best match to U12-type BPS motifs in likely U2-type introns, pass --generate_u2_bps_pwm.

Source code(tar.gz)
Source code(zip)
v1.0.13(Sep 6, 2020)
intronIC v1.0.13

Add best U2-type BPS to meta.iic output file. Previously, only the best U12-type BPS sequence was reported. In certain cases, it may be useful to know which U2-type sequence was used in determining the BPS log-ratio score.

Reduce formatting stringency for custom PWMs This should reduce headaches if folks are adding their own PWMs by ignoring case, etc.

Add clause to terminate multiprocessing pool processes on forced exit There were cases I'd noticed in my own usage when force-exiting (e.g. via ctrl-c) where zombie processes would persist. Wrapping the whole thing in a try/except/finally seems to eliminate the issue (limited testing).

Source code(tar.gz)
Source code(zip)
v1.0.12(Aug 21, 2020)

Manual release to trigger Zenodo archive. No functional difference between this and previous version (v1.0.11).
Source code(tar.gz)
Source code(zip)

Retrieve annotated intron sequences and classify them as minor (U12-type) or major (U2-type)

Related tags

Overview

(intron Interrogator and Classifier)

Installation

via pip

via git clone

Dependencies

Useful arguments

Running on test data

Classify annotated introns

Extract all annotated intron sequences

A note on the -n (name) argument

Resource usage

Cite

About

You might also like...

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

A library of extension and helper modules for Python's data analysis and machine learning libraries.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

A Tools that help Data Scientists and ML engineers train and deploy ML models.

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Comments

[BUG] intronIC not working for example data and own data

Releases(v1.3.7)

v1.3.7(Jun 10, 2022)

v1.3.6(Jun 10, 2022)

v1.3.2(Oct 20, 2021)

v1.3.0(Jul 23, 2021)

v1.2.0(Feb 8, 2021)

v1.1.1(Dec 5, 2020)

v1.1.0(Oct 31, 2020)

v1.0.14(Oct 30, 2020)

v1.0.13(Sep 6, 2020)

v1.0.12(Aug 21, 2020)

Owner

Graham Larue

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any student(s) having the second lowest grade.

MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training

Falken provides developers with a service that allows them to train AI that can play their games

A simple python program which predicts the success of a movie based on it's type, actor, actress and director

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

via `pip`

via `git clone`

A note on the `-n` (name) argument