Incubator for useful bioinformatics code, primarily in Python and R

Brad Chapman

Last update: Jan 3, 2023

Related tags

Data Analysis bcbb

Overview

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics.

All code, images and documents in this repository are freely available for all uses. Code is available under the MIT license and images, documentations and talks under the Creative Commons No Rights Reserved (CC0) license.

Some projects which may be especially interesting:

CloudBioLinux -- An automated environment to install useful biological software and libraries. This is used to bootstrap blank machines, such as those you'd find on Cloud providers like Amazon, to ready to go analysis workstations. See the CloudBioLinux effort for more details. This project moved to its own repository at https://github.com/chapmanb/cloudbiolinux.
gff -- A GFF parsing library in Python, aimed for inclusion into Biopython.
nextgen -- A python toolkit providing best-practice pipelines for fully automated high throughput sequencing analysis. This project has moved into its own repository: https://github.com/chapmanb/bcbio-nextgen
distblast -- A distributed BLAST analysis running for identifying best hits in a wide variety of organisms for downstream phylogenetic analyses. The code is generalized to run on local multi-processor and distributed Hadoop clusters.

Comments

biopython->numpy interactive (y/n) while deploying pipeline

Even putting numpy >=1.6.1 in setup.py's before biopython, the following message pops up:

Numerical Python (NumPy) is not installed.

This package is required for many Biopython features.  Please install
it before you install Biopython. You can install Biopython anyway, but
anything dependent on NumPy will not work. If you do this, and later
install NumPy, you should then re-install Biopython.

You can find NumPy at http://numpy.scipy.org

Do you want to continue this installation? (y/N):

Apparently install_requires packages are not installed in order, so no dependency order can be defined that way... are you aware of any "pre_install_requires" or similar in setuptools ? Couldn't find it after quickly checking docs :-/

opened by brainstorm 14

barcode_sort_trim.py

i'm not sure if it's a problem in the latest version of barcode_sort_trim.py

After updating to the pipeline with FastQC, I've got an extra base 'A' in the 3' of read 1.

opened by tanglingfung 13
FastQC vs SolexaQA

Brad, It's not really an issue. But I want to know, from your experience, how much time you would save from switching to FastQC from SolexaQA?

Thanks, Paul

opened by tanglingfung 13
doing bcl->qseq->fastq->analysis->galaxy in one machine

Hi,

We have a different setting here where the drive with the bcl files is mounted to the analysis machine and we would do everything there. Do you recommend we keep the messaging system in the pipeline? Just want to get some advices.

Thanks, Paul

opened by tanglingfung 13
picard_sam_to_bam.py

Hi Brad,

it seems that it will keep finding CreateSequenceDictionary in /usr/share/java/picard even though I have specify another path in my config file? I have tried doing the setup again after I modified the config files, but still it didn't look up the path I specified.

and I didn't seem to have specified the path of hg19.fa for GATK?

Thanks, Paul

opened by tanglingfung 12
Convert GFF file to Sequin TBL file

Submitting to GenBank requires converting a GFF file to a Sequin TBL file, which is then converted ASN.1 using tbl2asn. I have searched, and I have not found a good (or any, really) converter from GFF to Sequin TBL. Would you be interested in adding such a tool? Here's the hacky script that I cobbled together for this purpose: gff3-to-tbl. It's not general purpose, but could be a useful starting point.

opened by sjackman 11
merging of demuxed fastq files and project-based analyses

Hi Brad,

more of a question than an issue. I noticed you've added code (bcbio.pipeline.sample.merge_sample) to merge samples across lanes. I've been using save_diskspace=true in order to remove sam files, but this I noticed also removes the demultiplexed files, right? I just want to make sure because it affects our data delivery routines, as outlined below.

In our setup, we have situations when we run several projects on one lane, which we distinguish with an extra "description" tag in run_info, so in principle each barcode could have a description with a different project name. We then partition fastq files in a lane based on the description tag when delivering data to customers.

On a similar note, when I do analyses for customers, I've been doing it on a project-by-project basis (it makes more sense to me), and therefore written helper scripts (project_*, see EDIT: https://github.com/percyfal/bcbb/tree/develop/nextgen/scripts) for this purpose. project_analysis_pipeline.sh is almost a copy of automated_initial_analysis.py, but starts off with demultiplexed files. Have you had this functionality in mind (or is it even already there)?

Cheers,

Per

opened by percyfal 11
Trailing Illumina 'A' and demultiplexing

Hi Brad,

We are seeing some issues with unexpectedly many reads ending up in the 'unmatched' category after demultiplexing. After digging around a little, we think that this may be related to the trailing 'A' that the Illumina machines add after the barcode.

More specifically, we allow one mismatch and no indels for the demuxing. It seems that the reads that are unexpectedly classified as unmatched have one mismatch in the actual 6-nucleotide barcode and are, in addition, having the trailing 'A' nucleotide miscalled.

Reading the code, it does indeed seem that for Illumina reads, the last 7 nucleotides, including the trailing 'A', of each read are matched when demultiplexing. Can you confirm that this is the case?

Our preference is to match just the 6-mer index sequence, excluding the last nucleotide in the read and it would be nice to have this done by default for Illumina reads, or at least be able to influence this behavior with a configuration option. What do you think?

Thanks /Pontus

opened by b97pla 11
GFFExaminer() displaying empty dict for UCSC GTF
I tried following http://biopython.org/wiki/GFF_Parsing to parse UCSC-generated GTF file.

After executing

pprint.pprint(examiner.parent_child_map(handle))

the output was

{}

Similarly,

examiner.available_limits(handle)

produced

3: {'gff_id': {}, 'gff_source': {}, 'gff_source_type': {}, 'gff_type': {}}

Trying to parse that same file with

from BCBio import GFF for rec in GFF.parse(handle): print rec

produced

ID: chr1 Name: <unknown name> Description: <unknown description> Number of features: 2 UnknownSeq(14409, alphabet = Alphabet(), character = '?')

Here are the first 10 lines from the GTF in question

chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 12613 12721 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12595 12721 0.000000 + 1 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 12595 12721 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 13403 13636 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene stop_codon 13637 13639 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
opened by spock 10

num_cores: messages; socket.timeout: timed out

Hey Brad, we are trying to use the distributed version of the pipeline.

We have a couple of test sets that we use to quickly see if the pipeline is working. One that takes the normal pipeline about 3 hours to finish, and another much smaller that takes about 7 minutes (this is with 8 cores).

When running the small test set on the messaging variant all files get generated as they should, and the program exits properly. Note that this small set consists of fastq files which are only 12 lines each, and I'm guessing much of the analysis gets skipped due to a lack of data.

When we run the messaging version of the pipeline for the larger set, the programs work for a while (time varies, but say between 45 minutes and 1 hour 30 minutes), but then one of the jobs crashes with a socket.timeout error, (this specific job I believe is some master that coordinates what the other jobs should be doing.

I'll include the output of that job here:

[2012-02-25 02:55:26,856] Found YAML samplesheet, using /proj/a2010002/nobackup/illumina/pipeline_test/archive/000101_SN001_001_AABCD99XX/run_info.yaml instead of Galaxy API
Traceback (most recent call last):
  File "/bubo/home/h10/vale/.virtualenvs/devel/bin/automated_initial_analysis.py", line 7, in <module>
    execfile(__file__)
  File "/bubo/home/h10/vale/bcbb/nextgen/scripts/automated_initial_analysis.py", line 117, in <module>
    main(*args, **kwargs)
  File "/bubo/home/h10/vale/bcbb/nextgen/scripts/automated_initial_analysis.py", line 48, in main
    run_main(config, config_file, fc_dir, work_dir, run_info_yaml)
  File "/bubo/home/h10/vale/bcbb/nextgen/scripts/automated_initial_analysis.py", line 65, in run_main
    lane_items = run_parallel("process_lane", lanes)
  File "/bubo/home/h10/vale/bcbb/nextgen/bcbio/distributed/messaging.py", line 28, in run_parallel
    return runner_fn(fn_name, items)
  File "/bubo/home/h10/vale/bcbb/nextgen/bcbio/distributed/messaging.py", line 67, in _run
    while not result.ready():
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/result.py", line 306, in ready
    return all(result.ready() for result in self.results)
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/result.py", line 306, in <genexpr>
    return all(result.ready() for result in self.results)
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/result.py", line 108, in ready
    return self.status in self.backend.READY_STATES
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/result.py", line 196, in status
    return self.state
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/result.py", line 191, in state
    return self.backend.get_status(self.task_id)
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/backends/base.py", line 237, in get_status
    return self.get_task_meta(task_id)["status"]
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/backends/amqp.py", line 128, in get_task_meta
    return self.poll(task_id)
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/backends/amqp.py", line 153, in poll
    with self.app.pool.acquire_channel(block=True) as (_, channel):
  File "/sw/comp/python/2.7.1_kalkyl/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/connection.py", line 789, in acquire_channel
    yield connection, connection.default_channel
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/connection.py", line 593, in default_channel
    self.connection
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/connection.py", line 586, in connection
    self._connection = self._establish_connection()
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/connection.py", line 546, in _establish_connection
    conn = self.transport.establish_connection()
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/transport/amqplib.py", line 252, in establish_connection
    connect_timeout=conninfo.connect_timeout)
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/transport/amqplib.py", line 62, in __init__
    super(Connection, self).__init__(*args, **kwargs)
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/amqplib/client_0_8/connection.py", line 129, in __init__
    self.transport = create_transport(host, connect_timeout, ssl)
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/amqplib/client_0_8/transport.py", line 281, in create_transport
    return TCPTransport(host, connect_timeout)
  File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/amqplib/client_0_8/transport.py", line 85, in __init__
    raise socket.error, msg
socket.timeout: timed out
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[DEBUG/MainProcess] running the remaining "atexit" finalizers

Have you encountered any issues with socket.timeout? Any ideas what we might be doing wrong?

opened by vals 9

GFF parsing fails with most recent version of BioPython

Overview

After upgrading to Biopython 1.68, GFF.parse() is now failing where it had no issues before.

To Reproduce

In a new virtualenv environment, run:

pip install numpy
pip install biopython
pip install bcbio-gff

wget http://tritrypdb.org/common/downloads/release-27/TcruziCLBrenerEsmeraldo-like/gff/data/TriTrypDB-27_TcruziCLBrenerEsmeraldo-like.gff

Next, launch python and run:

>>> from BCBio import GFF
>>> gff = 'TriTrypDB-27_TcruziCLBrenerEsmeraldo-like.gff'
>>> x=list(GFF.parse(gff))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/GFFParser.py", line 737, in parse
    target_lines):
  File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/GFFParser.py", line 327, in parse_in_parts
    cur_dict = self._results_to_features(cur_dict, results)
  File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/GFFParser.py", line 367, in _results_to_features
    results.get('child', []))
  File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/GFFParser.py", line 428, in _add_parent_child_features
    children)
  File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/GFFParser.py", line 471, in _add_children_to_parent
    cur_child, _ = self._add_children_to_parent(cur_child, children)
  File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/GFFParser.py", line 477, in _add_children_to_parent
    cur_parent.sub_features.append(cur_child)
AttributeError: 'SeqFeature' object has no attribute 'sub_features'
>>> import Bio
>>> Bio.__version__
'1.68'

The same code worked with Biopython 1.67, so it seems likely to be an issue resulting from changes made in the 1.68 release.

opened by khughitt 8

docs: Fix a few typos
There are small typos in:

posts/conferences/bosc2018_day1a.md

posts/seminars/tumor_heterogeneity_carter.md

Fixes:

Should read suppressors rather than supressors.

Should read service rather than serivce.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
opened by timgates42 0
glimmergff_to_proteins.py / Alternative Codon Table

Hi,

I'd like to add this issue for those who'd like to use the script with an alternative codon table.

Example: If you want to translate the sequence with a codon table 6 you need to change the script like the following, protein_seq = gene_seq.translate(6)

Regards, Zeynep

opened by zeynepkurtw 0
glimmergff_to_proteins.py / Reordering Fasta file

Hi,

Thank you for this very useful script!!

I was wondering if it's possible to create the protein multi fasta file with the order of ref_file (contigs) instead of the glimmer_file (gff) file?

I've re-ordered my assembly from large contigs to smaller ones. However, when I run this script I got my protein multi fasta file in the order of glimmer.gff instead of the contigs.

Regards, Zeynep

opened by zeynepkurtw 0
Any chance a new release will be made sometime soon?

I was just wondering if it would be possible for a new release of the gff package to be made sometime soon. The fix from #126 would be really nice to have in a released version.

opened by DavyCats 0

IndexError with NCBI gff

Hi!

I annotated a bacterium (Acidipropionibacterium acidipropionici - strain FAM19036) with NCBI PGAP.

I wanted to create SeqIO-objects from the gff file, but it failed:

import pprint
from BCBio import GFF
from BCBio.GFF import GFFExaminer
examiner = GFFExaminer()
with open('data/FAM19036/annot.gff') as in_handle:
    pprint.pprint(examiner.available_limits(in_handle))
print("------------------------------------------------------------")
with open('FAM19036/annot.gff') as in_handle:
    for rec in GFF.parse(in_handle):
        print(rec)

{'gff_id': {('CP040634.1',): 6772},
 'gff_source': {('.',): 3361,
                ('GeneMarkS-2+',): 360,
                ('Local',): 1,
                ('Protein Homology',): 2916,
                ('cmsearch',): 24,
                ('tRNAscan-SE',): 110},
 'gff_source_type': {('.', 'exon'): 8,
                     ('.', 'gene'): 3208,
                     ('.', 'pseudogene'): 137,
                     ('.', 'rRNA'): 8,
                     ('GeneMarkS-2+', 'CDS'): 360,
                     ('Local', 'region'): 1,
                     ('Protein Homology', 'CDS'): 2916,
                     ('cmsearch', 'RNase_P_RNA'): 1,
                     ('cmsearch', 'SRP_RNA'): 1,
                     ('cmsearch', 'exon'): 7,
                     ('cmsearch', 'rRNA'): 4,
                     ('cmsearch', 'riboswitch'): 10,
                     ('cmsearch', 'tmRNA'): 1,
                     ('tRNAscan-SE', 'exon'): 55,
                     ('tRNAscan-SE', 'tRNA'): 55},
 'gff_type': {('CDS',): 3276,
              ('RNase_P_RNA',): 1,
              ('SRP_RNA',): 1,
              ('exon',): 70,
              ('gene',): 3208,
              ('pseudogene',): 137,
              ('rRNA',): 12,
              ('region',): 1,
              ('riboswitch',): 10,
              ('tRNA',): 55,
              ('tmRNA',): 1}}
------------------------------------------------------------

Error
Traceback (most recent call last):
  File "/usr/lib64/python3.7/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib64/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/project/gene_loci_comparison/test_gene_loci_comparison.py", line 129, in test_recreate_gff_bug
    for rec in GFF.parse(in_handle):
  File "/project/venvs/gene_loci_comparison/lib64/python3.7/site-packages/BCBio/GFF/GFFParser.py", line 746, in parse
    target_lines):
  File "/project/venvs/gene_loci_comparison/lib64/python3.7/site-packages/BCBio/GFF/GFFParser.py", line 327, in parse_in_parts
    cur_dict = self._results_to_features(cur_dict, results)
  File "/project/venvs/gene_loci_comparison/lib64/python3.7/site-packages/BCBio/GFF/GFFParser.py", line 369, in _results_to_features
    base = self._add_directives(base, results.get('directive', []))
  File "/project/venvs/gene_loci_comparison/lib64/python3.7/site-packages/BCBio/GFF/GFFParser.py", line 388, in _add_directives
    val = (val[0], int(val[1]) - 1, int(val[2]))
IndexError: tuple index out of range

To recreate the bug, here is the relevant gff file.

Thanks in advance.

Edit: bcbio-gff version 0.6.6

opened by MrTomRod 0

Did not find remapped ID location:

I'm trying to parse a gff file downloaded from NCBI (GCA_001536265) and when I iterate on the parser it gives me this error Did not find remapped ID location: gene670, [[42143, 44074], [44736, 45087], [45979, 46332], [47064, 47369]], [42143, 47369]

Inspecting the GFF with GFFExaminer gives no error at all.

opened by fbeghini 2

Incubator for useful bioinformatics code, primarily in Python and R

Related tags

Overview

Comments

Owner

Brad Chapman

NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Very useful and necessary functions that simplify working with data

Useful tool for inserting DataFrames into the Excel sheet.

Python Kalman filtering and optimal estimation library. Implements Kalman filter, particle filter, Extended Kalman filter, Unscented Kalman filter, g-h (alpha-beta), least squares, H Infinity, smoothers, and more. Has companion book 'Kalman and Bayesian Filters in Python'.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Sample code for Harry's Airflow online trainng course

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Example Of Splunk Search Query With Python And Splunk Python SDK

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

Python beta calculator that retrieves stock and market data and provides linear regressions.

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.