Cash in on Expressed Barcode Tags (EBTs) from NGS Sequencing Data with Python

Last update: Sep 11, 2022

Related tags

Miscellaneous pycashier

Overview

Cash in on Expressed Barcode Tags (EBTs) from NGS Sequencing Data with Python

Cashier is a tool developed by Russell Durrett for the analysis and extraction of expressed barcode tags.

This python implementation offers the same flexibility and simple command line operation.

Like it's predecessor it is a wrapper for the tools cutadapt, fastx-toolkit, and starcode.

Dependencies

cutadapt (sequence extraction)
starcode (sequence clustering)
fastx-toolkit (PHred score filtering)
pear (paired end read merging)
pysam (sam file convertion to fastq)

Recommended Installation Procedure

It's recommended to use conda to install and manage the dependencies for this package

conda env create -f https://raw.githubusercontent.com/brocklab/pycashier/main/environment.yml # or mamba env create -f ....
conda activate cashierenv
pycashier --help

Additionally you may install with pip. Though it will be up to you to ensure all the non-python dependencies are on the path and installed correctly.

pip install pycashier

Usage

Pycashier has one required argument which is the directory containing the fastq or sam files you wish to process.

conda activate cashierenv
pycashier ./fastqs

For additional parameters see pycashier -h.

As the files are processed two additional directories will be created pipeline and outs.

Currently all intermediary files generated as a result of the program will be found in pipeline.

While the final processed files will be found within the outs directory.

Merging Files

Pycashier can now take paired end reads and perform a merging of the reads to produce a fastq which can then be used with cashier's default feature.

pycashier ./fastqs -m

Processing Barcodes from 10X bam files

Pycashier can also extract gRNA barcodes along with 10X cell and umi barcodes.

Firstly we are only interested in the unmapped reads. From the cellranger bam output you would obtain these reads using samtools.

samtools view -f 4 possorted_genome_bam.bam > unmapped.sam

Then similar to normal barcode extraction you can pass a directory of these unmapped sam files to pycashier and extract barcodes. You can also still specify extraction parameters that will be passed to cutadapt as usual.

Note: The default parameters passed to cutadapt are unlinked adapters and minimum barcode length of 10 bp.

pycashier ./unmapped_sams -sc

When finished the outs directory will have a .tsv containing the following columns: Illumina Read Info, UMI Barcode, Cell Barcode, gRNA Barcode

Usage notes

Pycashier will NOT overwrite intermediary files. If there is an issue in the process, please delete either the pipeline directory or the requisite intermediary files for the sample you wish to reprocess. This will allow the user to place new fastqs within the source directory or a project folder without reprocessing all samples each time.

Currently, pycashier expects to find .fastq.gz files when merging and .fastq files when extracting barcodes. This behavior may change in the future.
If there are reads from multiple lanes they should first be concatenated with cat sample*R1*.fastq.gz > sample.R1.fastq.gz
Naming conventions:
- Sample names are extracted from files using the first string delimited with a period. Please take this into account when naming sam or fastq files.
- Each processing step will append information to the input file name to indicate changes, again delimited with periods.

A program made in PYTHON🐍 that automatically performs data insertions into a POSTGRES database 🐘 , using as base a .CSV file 📁 , useful in mass data insertions

A program made in PYTHON🐍 that automatically performs data insertions into a POSTGRES database 🐘 , using as base a .CSV file 📁 , useful in mass data insertions.

1 Oct 17, 2022

Explore-bikeshare-data - GitHub project as part of the Programming for Data Science with Python Nanodegree from Udacity

Date created February 10, 2022 Project Title Explore US Bikeshare Data Descripti

1 Feb 14, 2022

Viewflow is an Airflow-based framework that allows data scientists to create data models without writing Airflow code.

Viewflow Viewflow is a framework built on the top of Airflow that enables data scientists to create materialized views. It allows data scientists to f

114 Oct 12, 2022

resultados (data) de elecciones 2021 y código para extraer data de la ONPE

elecciones-peru-2021-ONPE Resultados (data) de elecciones 2021 y código para extraer data de la ONPE Data Licencia liberal, pero si vas a usarlo por f

21 Jun 14, 2021

The purpose of this code base is to add a specified signal-to-noise ratio noise from MUSAN dataset to a pure speech signal and to generate far-field speech data using room impulse response data from BUT Speech@FIT Reverb Database.

Add_noise_and_rir_to_speech The purpose of this code base is to add a specified signal-to-noise ratio noise from MUSAN dataset to a pure speech signal

7 Oct 30, 2022

Improve current data preprocessing for FTM's WOB data to analyze Shell and Dutch Governmental contacts.

We're the hackathon leftovers, but we are Too Good To Go ;-). A repo by Lukas Schubotz and Raymon van Dinter. We aim to improve current data preprocessing for FTM's WOB data to analyze Shell and Dutch Governmental contacts.

5 Dec 9, 2021

Adansons Base is a data management tool that organizes metadata of unstructured data and creates and organizes datasets.

Adansons Base is a data management tool that organizes metadata of unstructured data and creates and organizes datasets. It makes dataset creation more effective and helps find essential insights from training results and improves AI performance.

27 Oct 22, 2022

Open-source data observability for modern data teams

Use cases Monitor your data warehouse in minutes: Data anomalies monitoring as dbt tests Data lineage made simple, reliable, and automated dbt operati

889 Jan 1, 2023

Run python scripts and pass data between multiple python and node processes using this npm module

Run python scripts and pass data between multiple python and node processes using this npm module. process-communication has a event based architecture for interacting with python data and errors inside nodejs.

2 Aug 6, 2021

Comments

non fastq file

Hi I am getting the following error ;

ERROR! There is a non fastq file in the provided fastq directory: sample_R1.fastq.gz Exiting.

This seems to be coming from cashier.py

for f in fastqs: ext = os.path.splitext(f)[-1].lower() if ext != '.fastq': print('ERROR! There is a non fastq file in the provided fastq directory: {}'.format(f)) print('Exiting.')

Please recommend the file formatting/naming

opened by aedin 6

pysam fails to read file when using `docker`

┤ cashing in...[E::hts_open_format] Failed to open file "/data/sams/possorted_genome_unmapped.sam" : No such file or directory
Traceback (most recent call last):
  File "/opt/conda/bin/pycashier", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/pycashier/cli.py", line 149, in main
    cli(prog_name="pycashier")
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pycashier/cli.py", line 132, in scrna
    pycashier.scrna(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pycashier/pycashier.py", line 163, in scrna
    single_cell(
  File "/opt/conda/lib/python3.10/site-packages/pycashier/single_cell.py", line 231, in single_cell
    single_cell_process(
  File "/opt/conda/lib/python3.10/site-packages/pycashier/single_cell.py", line 147, in single_cell_process
    sam_to_name_labeled_fastq(sample, samfile, fastq, status)
  File "/opt/conda/lib/python3.10/site-packages/pycashier/single_cell.py", line 35, in sam_to_name_labeled_fastq
    with pysam.AlignmentFile(
  File "pysam/libcalignmentfile.pyx", line 751, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 950, in pysam.libcalignmentfile.AlignmentFile._open
FileNotFoundError: [Errno 2] could not open alignment file `/data/sams/possorted_genome_unmapped.sam`: No such file or directory

I've confirmed the file is present and the volume mounted properly.

opened by daylinmorgan 0

Ignoring hidden files

File searching before extraction/merging attempts to prevent the user from using an unexpected data source.

But it's probably a reasonable assumption there data would not include "hidden" files or anything prepended with a "."

For instance something like .DS_store or .snakemake_timestamp, harmless files that we can ignore without frustrating the user.

opened by daylinmorgan 0
File detection before extraction

pycashier extract doesn't detect if there are paired-end reads...should warn user that they should probably use pycashier merge first.

Also should attempt to detect duplicate files based on sample name extraction.

Additionally, since the first step is quality filtering with fastp it should be possible to supply .fastq.gz without any additional work besides disabling the exclusive .fastq check.

opened by daylinmorgan 0

Owner

GitHub

Webcash is an experimental e-cash (electronic cash)

Webcash Webcash is an experimental new electronic cash ("e-cash") that enables decentralized and instant payments to anyone, anywhere in the world. Us

0 Feb 26, 2022

A class to draw curves expressed as L-System production rules

6 Sep 9, 2022

dragmap-meth: Fast and accurate aligner for bisulfite sequencing reads using dragmap

dragmap_meth (dragmap_meth.py) Alignment of BS-Seq reads using dragmap. Intro This works for single-end reads and for paired-end reads from the direct

3 Jul 14, 2022

Simple cash register system made with guizero

Eje-Casher なにこれ guizeroで作った簡易レジシステムです。実際にコミケで使う予定です。これを誰かがそのまま使うかどうかというよりは、guiz

4 Nov 7, 2022

A Lego Mindstorm robot for dealing out cards based on a birds-eye view of a poker table and given ArUco fiducial tags.

4 Dec 6, 2021

Creates a release pull request updating changelog and tags with standard-version

standard version release branch Github action to open releases following convent

8 Sep 13, 2022

A Pythonic Data Catalog powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.

DeltaCAT DeltaCAT is a Pythonic Data Catalog powered by Ray. Its data storage model allows you to define and manage fast, scalable, ACID-compliant dat

45 Oct 15, 2022

Data Structures and Algorithms Python - Practice data structures and algorithms in python with few small projects

Data Structures and Algorithms All the essential resources and template code nee

13 Dec 1, 2022

An unofficial python API for trading on the DeGiro platform, with the ability to get real time data and historical data.

DegiroAPI An unofficial API for the trading platform Degiro written in Python with the ability to get real time data and historical data for products.

5 Dec 16, 2022

Python for downloading model data (HRRR, RAP, GFS, NBM, etc.) from NOMADS, NOAA's Big Data Program partners (Amazon, Google, Microsoft), and the University of Utah Pando Archive System.

194 Jan 2, 2023

Cash in on Expressed Barcode Tags (EBTs) from NGS Sequencing Data with Python

Related tags

Overview

Cash in on Expressed Barcode Tags (EBTs) from NGS Sequencing Data with Python

Dependencies

Recommended Installation Procedure

Usage

Merging Files

Processing Barcodes from 10X bam files

Usage notes

You might also like...

A program made in PYTHON🐍 that automatically performs data insertions into a POSTGRES database 🐘 , using as base a .CSV file 📁 , useful in mass data insertions

Explore-bikeshare-data - GitHub project as part of the Programming for Data Science with Python Nanodegree from Udacity

Viewflow is an Airflow-based framework that allows data scientists to create data models without writing Airflow code.

resultados (data) de elecciones 2021 y código para extraer data de la ONPE

The purpose of this code base is to add a specified signal-to-noise ratio noise from MUSAN dataset to a pure speech signal and to generate far-field speech data using room impulse response data from BUT Speech@FIT Reverb Database.

Improve current data preprocessing for FTM's WOB data to analyze Shell and Dutch Governmental contacts.

Adansons Base is a data management tool that organizes metadata of unstructured data and creates and organizes datasets.

Open-source data observability for modern data teams

Run python scripts and pass data between multiple python and node processes using this npm module

Comments

non fastq file

pysam fails to read file when using `docker`

Ignoring hidden files

File detection before extraction

Owner

Webcash is an experimental e-cash (electronic cash)

A class to draw curves expressed as L-System production rules

dragmap-meth: Fast and accurate aligner for bisulfite sequencing reads using dragmap

Simple cash register system made with guizero

A Lego Mindstorm robot for dealing out cards based on a birds-eye view of a poker table and given ArUco fiducial tags.

Creates a release pull request updating changelog and tags with standard-version

A Pythonic Data Catalog powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.

Data Structures and Algorithms Python - Practice data structures and algorithms in python with few small projects

An unofficial python API for trading on the DeGiro platform, with the ability to get real time data and historical data.

Python for downloading model data (HRRR, RAP, GFS, NBM, etc.) from NOMADS, NOAA's Big Data Program partners (Amazon, Google, Microsoft), and the University of Utah Pando Archive System.