Cash in on Expressed Barcode Tags (EBTs) from NGS Sequencing Data with Python

Overview

Cash in on Expressed Barcode Tags (EBTs) from NGS Sequencing Data with Python

Cashier is a tool developed by Russell Durrett for the analysis and extraction of expressed barcode tags.

This python implementation offers the same flexibility and simple command line operation.

Like it's predecessor it is a wrapper for the tools cutadapt, fastx-toolkit, and starcode.

Dependencies

  • cutadapt (sequence extraction)
  • starcode (sequence clustering)
  • fastx-toolkit (PHred score filtering)
  • pear (paired end read merging)
  • pysam (sam file convertion to fastq)

Recommended Installation Procedure

It's recommended to use conda to install and manage the dependencies for this package

conda env create -f https://raw.githubusercontent.com/brocklab/pycashier/main/environment.yml # or mamba env create -f ....
conda activate cashierenv
pycashier --help

Additionally you may install with pip. Though it will be up to you to ensure all the non-python dependencies are on the path and installed correctly.

pip install pycashier

Usage

Pycashier has one required argument which is the directory containing the fastq or sam files you wish to process.

conda activate cashierenv
pycashier ./fastqs

For additional parameters see pycashier -h.

As the files are processed two additional directories will be created pipeline and outs.

Currently all intermediary files generated as a result of the program will be found in pipeline.

While the final processed files will be found within the outs directory.

Merging Files

Pycashier can now take paired end reads and perform a merging of the reads to produce a fastq which can then be used with cashier's default feature.

pycashier ./fastqs -m

Processing Barcodes from 10X bam files

Pycashier can also extract gRNA barcodes along with 10X cell and umi barcodes.

Firstly we are only interested in the unmapped reads. From the cellranger bam output you would obtain these reads using samtools.

samtools view -f 4 possorted_genome_bam.bam > unmapped.sam

Then similar to normal barcode extraction you can pass a directory of these unmapped sam files to pycashier and extract barcodes. You can also still specify extraction parameters that will be passed to cutadapt as usual.

Note: The default parameters passed to cutadapt are unlinked adapters and minimum barcode length of 10 bp.

pycashier ./unmapped_sams -sc

When finished the outs directory will have a .tsv containing the following columns: Illumina Read Info, UMI Barcode, Cell Barcode, gRNA Barcode

Usage notes

Pycashier will NOT overwrite intermediary files. If there is an issue in the process, please delete either the pipeline directory or the requisite intermediary files for the sample you wish to reprocess. This will allow the user to place new fastqs within the source directory or a project folder without reprocessing all samples each time.

  • Currently, pycashier expects to find .fastq.gz files when merging and .fastq files when extracting barcodes. This behavior may change in the future.
  • If there are reads from multiple lanes they should first be concatenated with cat sample*R1*.fastq.gz > sample.R1.fastq.gz
  • Naming conventions:
    • Sample names are extracted from files using the first string delimited with a period. Please take this into account when naming sam or fastq files.
    • Each processing step will append information to the input file name to indicate changes, again delimited with periods.
You might also like...
A program made in PYTHON🐍 that automatically performs data insertions into a POSTGRES database 🐘 , using as base a .CSV file 📁 , useful in mass data insertions

A program made in PYTHON🐍 that automatically performs data insertions into a POSTGRES database 🐘 , using as base a .CSV file 📁 , useful in mass data insertions.

Explore-bikeshare-data - GitHub project as part of the Programming for Data Science with Python Nanodegree from Udacity

Date created February 10, 2022 Project Title Explore US Bikeshare Data Descripti

Viewflow is an Airflow-based framework that allows data scientists to create data models without writing Airflow code.
Viewflow is an Airflow-based framework that allows data scientists to create data models without writing Airflow code.

Viewflow Viewflow is a framework built on the top of Airflow that enables data scientists to create materialized views. It allows data scientists to f

resultados (data) de elecciones 2021 y código para extraer data de la ONPE

elecciones-peru-2021-ONPE Resultados (data) de elecciones 2021 y código para extraer data de la ONPE Data Licencia liberal, pero si vas a usarlo por f

 Improve current data preprocessing for FTM's WOB data to analyze Shell and Dutch Governmental contacts.
Improve current data preprocessing for FTM's WOB data to analyze Shell and Dutch Governmental contacts.

We're the hackathon leftovers, but we are Too Good To Go ;-). A repo by Lukas Schubotz and Raymon van Dinter. We aim to improve current data preprocessing for FTM's WOB data to analyze Shell and Dutch Governmental contacts.

Adansons Base is a data management tool that organizes metadata of unstructured data and creates and organizes datasets.

Adansons Base is a data management tool that organizes metadata of unstructured data and creates and organizes datasets. It makes dataset creation more effective and helps find essential insights from training results and improves AI performance.

Open-source data observability for modern data teams
Open-source data observability for modern data teams

Use cases Monitor your data warehouse in minutes: Data anomalies monitoring as dbt tests Data lineage made simple, reliable, and automated dbt operati

Run python scripts and pass data between multiple python and node processes using this npm module

Run python scripts and pass data between multiple python and node processes using this npm module. process-communication has a event based architecture for interacting with python data and errors inside nodejs.

Comments
  • non fastq file

    non fastq file

    Hi I am getting the following error ;

    ERROR! There is a non fastq file in the provided fastq directory: sample_R1.fastq.gz Exiting.

    This seems to be coming from cashier.py

    for f in fastqs: ext = os.path.splitext(f)[-1].lower() if ext != '.fastq': print('ERROR! There is a non fastq file in the provided fastq directory: {}'.format(f)) print('Exiting.')

    Please recommend the file formatting/naming

    opened by aedin 6
  • pysam fails to read file when using `docker`

    pysam fails to read file when using `docker`

    ┤ cashing in...[E::hts_open_format] Failed to open file "/data/sams/possorted_genome_unmapped.sam" : No such file or directory
    Traceback (most recent call last):
      File "/opt/conda/bin/pycashier", line 8, in <module>
        sys.exit(main())
      File "/opt/conda/lib/python3.10/site-packages/pycashier/cli.py", line 149, in main
        cli(prog_name="pycashier")
      File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
        return self.main(*args, **kwargs)
      File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1055, in main
        rv = self.invoke(ctx)
      File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 760, in invoke
        return __callback(*args, **kwargs)
      File "/opt/conda/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
        return f(get_current_context(), *args, **kwargs)
      File "/opt/conda/lib/python3.10/site-packages/pycashier/cli.py", line 132, in scrna
        pycashier.scrna(**kwargs)
      File "/opt/conda/lib/python3.10/site-packages/pycashier/pycashier.py", line 163, in scrna
        single_cell(
      File "/opt/conda/lib/python3.10/site-packages/pycashier/single_cell.py", line 231, in single_cell
        single_cell_process(
      File "/opt/conda/lib/python3.10/site-packages/pycashier/single_cell.py", line 147, in single_cell_process
        sam_to_name_labeled_fastq(sample, samfile, fastq, status)
      File "/opt/conda/lib/python3.10/site-packages/pycashier/single_cell.py", line 35, in sam_to_name_labeled_fastq
        with pysam.AlignmentFile(
      File "pysam/libcalignmentfile.pyx", line 751, in pysam.libcalignmentfile.AlignmentFile.__cinit__
      File "pysam/libcalignmentfile.pyx", line 950, in pysam.libcalignmentfile.AlignmentFile._open
    FileNotFoundError: [Errno 2] could not open alignment file `/data/sams/possorted_genome_unmapped.sam`: No such file or directory
    

    I've confirmed the file is present and the volume mounted properly.

    opened by daylinmorgan 0
  • Ignoring hidden files

    Ignoring hidden files

    File searching before extraction/merging attempts to prevent the user from using an unexpected data source.

    But it's probably a reasonable assumption there data would not include "hidden" files or anything prepended with a "."

    For instance something like .DS_store or .snakemake_timestamp, harmless files that we can ignore without frustrating the user.

    opened by daylinmorgan 0
  • File detection before extraction

    File detection before extraction

    pycashier extract doesn't detect if there are paired-end reads...should warn user that they should probably use pycashier merge first.

    Also should attempt to detect duplicate files based on sample name extraction.

    Additionally, since the first step is quality filtering with fastp it should be possible to supply .fastq.gz without any additional work besides disabling the exclusive .fastq check.

    opened by daylinmorgan 0
Owner
null
Webcash is an experimental e-cash (electronic cash)

Webcash Webcash is an experimental new electronic cash ("e-cash") that enables decentralized and instant payments to anyone, anywhere in the world. Us

Mark Friedenbach 0 Feb 26, 2022
A class to draw curves expressed as L-System production rules

A class to draw curves expressed as L-System production rules

Juna Salviati 6 Sep 9, 2022
dragmap-meth: Fast and accurate aligner for bisulfite sequencing reads using dragmap

dragmap_meth (dragmap_meth.py) Alignment of BS-Seq reads using dragmap. Intro This works for single-end reads and for paired-end reads from the direct

Shaojun Xie 3 Jul 14, 2022
Simple cash register system made with guizero

Eje-Casher なにこれ guizeroで作った簡易レジシステムです。実際にコミケで使う予定です。 これを誰かがそのまま使うかどうかというよりは、guiz

Akira Ouchi 4 Nov 7, 2022
A Lego Mindstorm robot for dealing out cards based on a birds-eye view of a poker table and given ArUco fiducial tags.

A Lego Mindstorm robot for dealing out cards based on a birds-eye view of a poker table and given ArUco fiducial tags.

null 4 Dec 6, 2021
Creates a release pull request updating changelog and tags with standard-version

standard version release branch Github action to open releases following convent

null 8 Sep 13, 2022
A Pythonic Data Catalog powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.

DeltaCAT DeltaCAT is a Pythonic Data Catalog powered by Ray. Its data storage model allows you to define and manage fast, scalable, ACID-compliant dat

null 45 Oct 15, 2022
Data Structures and Algorithms Python - Practice data structures and algorithms in python with few small projects

Data Structures and Algorithms All the essential resources and template code nee

Hesham 13 Dec 1, 2022
An unofficial python API for trading on the DeGiro platform, with the ability to get real time data and historical data.

DegiroAPI An unofficial API for the trading platform Degiro written in Python with the ability to get real time data and historical data for products.

Jorrick Sleijster 5 Dec 16, 2022
Python for downloading model data (HRRR, RAP, GFS, NBM, etc.) from NOMADS, NOAA's Big Data Program partners (Amazon, Google, Microsoft), and the University of Utah Pando Archive System.

Python for downloading model data (HRRR, RAP, GFS, NBM, etc.) from NOMADS, NOAA's Big Data Program partners (Amazon, Google, Microsoft), and the University of Utah Pando Archive System.

Brian Blaylock 194 Jan 2, 2023