Incubator for useful bioinformatics code, primarily in Python and R

Related tags

Science bcbb

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics.

All code, images and documents in this repository are freely available for all uses. Code is available under the MIT license and images, documentations and talks under the Creative Commons No Rights Reserved (CC0) license.

Some projects which may be especially interesting:

  • CloudBioLinux -- An automated environment to install useful biological software and libraries. This is used to bootstrap blank machines, such as those you'd find on Cloud providers like Amazon, to ready to go analysis workstations. See the CloudBioLinux effort for more details. This project moved to its own repository at
  • gff -- A GFF parsing library in Python, aimed for inclusion into Biopython.
  • nextgen -- A python toolkit providing best-practice pipelines for fully automated high throughput sequencing analysis. This project has moved into its own repository:
  • distblast -- A distributed BLAST analysis running for identifying best hits in a wide variety of organisms for downstream phylogenetic analyses. The code is generalized to run on local multi-processor and distributed Hadoop clusters.
  • biopython->numpy interactive (y/n) while deploying pipeline

    biopython->numpy interactive (y/n) while deploying pipeline

    Even putting numpy >=1.6.1 in's before biopython, the following message pops up:

    Numerical Python (NumPy) is not installed.
    This package is required for many Biopython features.  Please install
    it before you install Biopython. You can install Biopython anyway, but
    anything dependent on NumPy will not work. If you do this, and later
    install NumPy, you should then re-install Biopython.
    You can find NumPy at
    Do you want to continue this installation? (y/N):

    Apparently install_requires packages are not installed in order, so no dependency order can be defined that way... are you aware of any "pre_install_requires" or similar in setuptools ? Couldn't find it after quickly checking docs :-/

    opened by brainstorm 14

    i'm not sure if it's a problem in the latest version of

    After updating to the pipeline with FastQC, I've got an extra base 'A' in the 3' of read 1.

    opened by tanglingfung 13
  • FastQC vs SolexaQA

    FastQC vs SolexaQA

    Brad, It's not really an issue. But I want to know, from your experience, how much time you would save from switching to FastQC from SolexaQA?

    Thanks, Paul

    opened by tanglingfung 13
  • doing bcl->qseq->fastq->analysis->galaxy in one machine

    doing bcl->qseq->fastq->analysis->galaxy in one machine


    We have a different setting here where the drive with the bcl files is mounted to the analysis machine and we would do everything there. Do you recommend we keep the messaging system in the pipeline? Just want to get some advices.

    Thanks, Paul

    opened by tanglingfung 13

    Hi Brad,

    it seems that it will keep finding CreateSequenceDictionary in /usr/share/java/picard even though I have specify another path in my config file? I have tried doing the setup again after I modified the config files, but still it didn't look up the path I specified.

    and I didn't seem to have specified the path of hg19.fa for GATK?

    Thanks, Paul

    opened by tanglingfung 12
  • Convert GFF file to Sequin TBL file

    Convert GFF file to Sequin TBL file

    Submitting to GenBank requires converting a GFF file to a Sequin TBL file, which is then converted ASN.1 using tbl2asn. I have searched, and I have not found a good (or any, really) converter from GFF to Sequin TBL. Would you be interested in adding such a tool? Here's the hacky script that I cobbled together for this purpose: gff3-to-tbl. It's not general purpose, but could be a useful starting point.

    opened by sjackman 11
  • merging of demuxed fastq files and project-based analyses

    merging of demuxed fastq files and project-based analyses

    Hi Brad,

    more of a question than an issue. I noticed you've added code (bcbio.pipeline.sample.merge_sample) to merge samples across lanes. I've been using save_diskspace=true in order to remove sam files, but this I noticed also removes the demultiplexed files, right? I just want to make sure because it affects our data delivery routines, as outlined below.

    In our setup, we have situations when we run several projects on one lane, which we distinguish with an extra "description" tag in run_info, so in principle each barcode could have a description with a different project name. We then partition fastq files in a lane based on the description tag when delivering data to customers.

    On a similar note, when I do analyses for customers, I've been doing it on a project-by-project basis (it makes more sense to me), and therefore written helper scripts (project_*, see EDIT: for this purpose. is almost a copy of, but starts off with demultiplexed files. Have you had this functionality in mind (or is it even already there)?



    opened by percyfal 11
  • Trailing Illumina 'A' and demultiplexing

    Trailing Illumina 'A' and demultiplexing

    Hi Brad,

    We are seeing some issues with unexpectedly many reads ending up in the 'unmatched' category after demultiplexing. After digging around a little, we think that this may be related to the trailing 'A' that the Illumina machines add after the barcode.

    More specifically, we allow one mismatch and no indels for the demuxing. It seems that the reads that are unexpectedly classified as unmatched have one mismatch in the actual 6-nucleotide barcode and are, in addition, having the trailing 'A' nucleotide miscalled.

    Reading the code, it does indeed seem that for Illumina reads, the last 7 nucleotides, including the trailing 'A', of each read are matched when demultiplexing. Can you confirm that this is the case?

    Our preference is to match just the 6-mer index sequence, excluding the last nucleotide in the read and it would be nice to have this done by default for Illumina reads, or at least be able to influence this behavior with a configuration option. What do you think?

    Thanks /Pontus

    opened by b97pla 11
  • GFFExaminer() displaying empty dict for UCSC GTF

    GFFExaminer() displaying empty dict for UCSC GTF

    I tried following to parse UCSC-generated GTF file.

    After executing


    the output was





    3: {'gff_id': {}, 'gff_source': {}, 'gff_source_type': {}, 'gff_type': {}}

    Trying to parse that same file with

    from BCBio import GFF
    for rec in GFF.parse(handle):
        print rec


    ID: chr1
    Name: <unknown name>
    Description: <unknown description>
    Number of features: 2
    UnknownSeq(14409, alphabet = Alphabet(), character = '?')

    Here are the first 10 lines from the GTF in question

    chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 12613 12721 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12595 12721 0.000000 + 1 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 12595 12721 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 13403 13636 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene stop_codon 13637 13639 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";

    opened by spock 10
  • num_cores: messages; socket.timeout: timed out

    num_cores: messages; socket.timeout: timed out

    Hey Brad, we are trying to use the distributed version of the pipeline.

    We have a couple of test sets that we use to quickly see if the pipeline is working. One that takes the normal pipeline about 3 hours to finish, and another much smaller that takes about 7 minutes (this is with 8 cores).

    When running the small test set on the messaging variant all files get generated as they should, and the program exits properly. Note that this small set consists of fastq files which are only 12 lines each, and I'm guessing much of the analysis gets skipped due to a lack of data.

    When we run the messaging version of the pipeline for the larger set, the programs work for a while (time varies, but say between 45 minutes and 1 hour 30 minutes), but then one of the jobs crashes with a socket.timeout error, (this specific job I believe is some master that coordinates what the other jobs should be doing.

    I'll include the output of that job here:

    [2012-02-25 02:55:26,856] Found YAML samplesheet, using /proj/a2010002/nobackup/illumina/pipeline_test/archive/000101_SN001_001_AABCD99XX/run_info.yaml instead of Galaxy API
    Traceback (most recent call last):
      File "/bubo/home/h10/vale/.virtualenvs/devel/bin/", line 7, in <module>
      File "/bubo/home/h10/vale/bcbb/nextgen/scripts/", line 117, in <module>
        main(*args, **kwargs)
      File "/bubo/home/h10/vale/bcbb/nextgen/scripts/", line 48, in main
        run_main(config, config_file, fc_dir, work_dir, run_info_yaml)
      File "/bubo/home/h10/vale/bcbb/nextgen/scripts/", line 65, in run_main
        lane_items = run_parallel("process_lane", lanes)
      File "/bubo/home/h10/vale/bcbb/nextgen/bcbio/distributed/", line 28, in run_parallel
        return runner_fn(fn_name, items)
      File "/bubo/home/h10/vale/bcbb/nextgen/bcbio/distributed/", line 67, in _run
        while not result.ready():
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/", line 306, in ready
        return all(result.ready() for result in self.results)
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/", line 306, in <genexpr>
        return all(result.ready() for result in self.results)
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/", line 108, in ready
        return self.status in self.backend.READY_STATES
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/", line 196, in status
        return self.state
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/", line 191, in state
        return self.backend.get_status(self.task_id)
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/backends/", line 237, in get_status
        return self.get_task_meta(task_id)["status"]
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/backends/", line 128, in get_task_meta
        return self.poll(task_id)
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/celery/backends/", line 153, in poll
        with as (_, channel):
      File "/sw/comp/python/2.7.1_kalkyl/lib/python2.7/", line 17, in __enter__
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/", line 789, in acquire_channel
        yield connection, connection.default_channel
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/", line 593, in default_channel
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/", line 586, in connection
        self._connection = self._establish_connection()
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/", line 546, in _establish_connection
        conn = self.transport.establish_connection()
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/transport/", line 252, in establish_connection
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/kombu/transport/", line 62, in __init__
        super(Connection, self).__init__(*args, **kwargs)
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/amqplib/client_0_8/", line 129, in __init__
        self.transport = create_transport(host, connect_timeout, ssl)
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/amqplib/client_0_8/", line 281, in create_transport
        return TCPTransport(host, connect_timeout)
      File "/bubo/home/h10/vale/.virtualenvs/devel/lib/python2.7/site-packages/amqplib/client_0_8/", line 85, in __init__
        raise socket.error, msg
    socket.timeout: timed out
    [INFO/MainProcess] process shutting down
    [DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
    [DEBUG/MainProcess] running the remaining "atexit" finalizers

    Have you encountered any issues with socket.timeout? Any ideas what we might be doing wrong?

    opened by vals 9
  • GFF parsing fails with most recent version of BioPython

    GFF parsing fails with most recent version of BioPython


    After upgrading to Biopython 1.68, GFF.parse() is now failing where it had no issues before.

    To Reproduce

    In a new virtualenv environment, run:

    pip install numpy
    pip install biopython
    pip install bcbio-gff

    Next, launch python and run:

    >>> from BCBio import GFF
    >>> gff = 'TriTrypDB-27_TcruziCLBrenerEsmeraldo-like.gff'
    >>> x=list(GFF.parse(gff))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/", line 737, in parse
      File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/", line 327, in parse_in_parts
        cur_dict = self._results_to_features(cur_dict, results)
      File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/", line 367, in _results_to_features
        results.get('child', []))
      File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/", line 428, in _add_parent_child_features
      File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/", line 471, in _add_children_to_parent
        cur_child, _ = self._add_children_to_parent(cur_child, children)
      File "/home/keith/.virtualenvs/gff/lib/python3.5/site-packages/BCBio/GFF/", line 477, in _add_children_to_parent
    AttributeError: 'SeqFeature' object has no attribute 'sub_features'
    >>> import Bio
    >>> Bio.__version__

    The same code worked with Biopython 1.67, so it seems likely to be an issue resulting from changes made in the 1.68 release.

    opened by khughitt 8
  • docs: Fix a few typos

    docs: Fix a few typos

    There are small typos in:

    • posts/conferences/
    • posts/seminars/


    • Should read suppressors rather than supressors.
    • Should read service rather than serivce.

    Semi-automated pull request generated by

    opened by timgates42 0
  • / Alternative Codon Table / Alternative Codon Table


    I'd like to add this issue for those who'd like to use the script with an alternative codon table.

    Example: If you want to translate the sequence with a codon table 6 you need to change the script like the following, protein_seq = gene_seq.translate(6)

    Regards, Zeynep

    opened by zeynepkurtw 0
  • / Reordering Fasta file / Reordering Fasta file


    Thank you for this very useful script!!

    I was wondering if it's possible to create the protein multi fasta file with the order of ref_file (contigs) instead of the glimmer_file (gff) file?

    I've re-ordered my assembly from large contigs to smaller ones. However, when I run this script I got my protein multi fasta file in the order of glimmer.gff instead of the contigs.

    Regards, Zeynep

    opened by zeynepkurtw 0
  • Any chance a new release will be made sometime soon?

    Any chance a new release will be made sometime soon?

    I was just wondering if it would be possible for a new release of the gff package to be made sometime soon. The fix from #126 would be really nice to have in a released version.

    opened by DavyCats 0
  • IndexError with NCBI gff

    IndexError with NCBI gff


    I annotated a bacterium (Acidipropionibacterium acidipropionici - strain FAM19036) with NCBI PGAP.

    I wanted to create SeqIO-objects from the gff file, but it failed:

    import pprint
    from BCBio import GFF
    from BCBio.GFF import GFFExaminer
    examiner = GFFExaminer()
    with open('data/FAM19036/annot.gff') as in_handle:
    with open('FAM19036/annot.gff') as in_handle:
        for rec in GFF.parse(in_handle):
    {'gff_id': {('CP040634.1',): 6772},
     'gff_source': {('.',): 3361,
                    ('GeneMarkS-2+',): 360,
                    ('Local',): 1,
                    ('Protein Homology',): 2916,
                    ('cmsearch',): 24,
                    ('tRNAscan-SE',): 110},
     'gff_source_type': {('.', 'exon'): 8,
                         ('.', 'gene'): 3208,
                         ('.', 'pseudogene'): 137,
                         ('.', 'rRNA'): 8,
                         ('GeneMarkS-2+', 'CDS'): 360,
                         ('Local', 'region'): 1,
                         ('Protein Homology', 'CDS'): 2916,
                         ('cmsearch', 'RNase_P_RNA'): 1,
                         ('cmsearch', 'SRP_RNA'): 1,
                         ('cmsearch', 'exon'): 7,
                         ('cmsearch', 'rRNA'): 4,
                         ('cmsearch', 'riboswitch'): 10,
                         ('cmsearch', 'tmRNA'): 1,
                         ('tRNAscan-SE', 'exon'): 55,
                         ('tRNAscan-SE', 'tRNA'): 55},
     'gff_type': {('CDS',): 3276,
                  ('RNase_P_RNA',): 1,
                  ('SRP_RNA',): 1,
                  ('exon',): 70,
                  ('gene',): 3208,
                  ('pseudogene',): 137,
                  ('rRNA',): 12,
                  ('region',): 1,
                  ('riboswitch',): 10,
                  ('tRNA',): 55,
                  ('tmRNA',): 1}}
    Traceback (most recent call last):
      File "/usr/lib64/python3.7/unittest/", line 59, in testPartExecutor
      File "/usr/lib64/python3.7/unittest/", line 628, in run
      File "/project/gene_loci_comparison/", line 129, in test_recreate_gff_bug
        for rec in GFF.parse(in_handle):
      File "/project/venvs/gene_loci_comparison/lib64/python3.7/site-packages/BCBio/GFF/", line 746, in parse
      File "/project/venvs/gene_loci_comparison/lib64/python3.7/site-packages/BCBio/GFF/", line 327, in parse_in_parts
        cur_dict = self._results_to_features(cur_dict, results)
      File "/project/venvs/gene_loci_comparison/lib64/python3.7/site-packages/BCBio/GFF/", line 369, in _results_to_features
        base = self._add_directives(base, results.get('directive', []))
      File "/project/venvs/gene_loci_comparison/lib64/python3.7/site-packages/BCBio/GFF/", line 388, in _add_directives
        val = (val[0], int(val[1]) - 1, int(val[2]))
    IndexError: tuple index out of range

    To recreate the bug, here is the relevant gff file.

    Thanks in advance.

    Edit: bcbio-gff version 0.6.6

    opened by MrTomRod 0
  • Did not find remapped ID location:

    Did not find remapped ID location:

    I'm trying to parse a gff file downloaded from NCBI (GCA_001536265) and when I iterate on the parser it gives me this error Did not find remapped ID location: gene670, [[42143, 44074], [44736, 45087], [45979, 46332], [47064, 47369]], [42143, 47369]

    Inspecting the GFF with GFFExaminer gives no error at all.

    opened by fbeghini 2
Brad Chapman
Biologist and programmer
Brad Chapman
CONCEPT (COsmological N-body CodE in PyThon) is a free and open-source simulation code for cosmological structure formation

CONCEPT (COsmological N-body CodE in PyThon) is a free and open-source simulation code for cosmological structure formation. The code should run on any Linux system, from massively parallel computer clusters to laptops.

Jeppe Dakin 62 Dec 8, 2022
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code

A Python framework for creating reproducible, maintainable and modular data science code.

QuantumBlack Labs 7.9k Jan 1, 2023
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Aesara

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 8.1k Dec 30, 2022
Efficient Python Tricks and Tools for Data Scientists

Why efficient Python? Because using Python more efficiently will make your code more readable and run more efficiently.

Khuyen Tran 944 Dec 28, 2022
CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers,, among many other sites.

CKAN: The Open Source Data Portal Software CKAN is the world’s leading open-source data portal platform. CKAN makes it easy to publish, share and work

ckan 3.6k Dec 27, 2022
3D visualization of scientific data in Python

Mayavi: 3D visualization of scientific data in Python Mayavi docs: TVTK docs:

Enthought, Inc. 1.1k Jan 6, 2023
Datamol is a python library to work with molecules

Datamol is a python library to work with molecules. It's a layer built on top of RDKit and aims to be as light as possible.

datamol 276 Dec 19, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Dec 31, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Jan 8, 2023
PennyLane is a cross-platform Python library for differentiable programming of quantum computers.

PennyLane is a cross-platform Python library for differentiable programming of quantum computers. Train a quantum computer the same way as a neural network.

PennyLaneAI 1.6k Jan 4, 2023
SCICO is a Python package for solving the inverse problems that arise in scientific imaging applications.

Scientific Computational Imaging COde (SCICO) SCICO is a Python package for solving the inverse problems that arise in scientific imaging applications

Los Alamos National Laboratory 37 Dec 21, 2022
Float2Binary - A simple python class which finds the binary representation of a floating-point number.

Float2Binary A simple python class which finds the binary representation of a floating-point number. You can find a class in file with the

Bora Canbula 3 Dec 14, 2021
A simple computer program made with Python on the brachistochrone curve.

Brachistochrone-curve This is a simple computer program made with Python on the brachistochrone curve. I decided to write it after a physics lesson on

Diego Romeo 1 Dec 16, 2021
A flexible package manager that supports multiple versions, configurations, platforms, and compilers.

Spack Spack is a multi-platform package manager that builds and installs multiple versions and configurations of software. It works on Linux, macOS, a

Spack 3.1k Dec 31, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 915 Dec 29, 2022
A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Cookiecutter Data Science A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. Project homepage

Jon C Cline 0 Sep 5, 2021
Incubator for useful bioinformatics code, primarily in Python and R

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics. All code, images and

Brad Chapman 560 Dec 24, 2022
Drug design and development team HackBio internship is a virtual bioinformatics program that introduces students and professional to advanced practical bioinformatics and its applications globally.

-Nyokong. Drug design and development team HackBio internship is a virtual bioinformatics program that introduces students and professional to advance

null 4 Aug 4, 2022
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, which allow you to try out and modify example code and analyses.

Jesse Zaneveld 33 Dec 28, 2022