Pipeline to convert a haploid assembly into diploid

Mikhail Kolmogorov

Last update: Jan 5, 2023

Related tags

Data Analysis hapdup

Overview

HapDup

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes preserve heterozygous structural variants (in addition to small variants) and are locally phased.

Version 0.4

Input requirements

HapDup takes as input a haploid long-read assembly, such as produced with Flye or Shasta. Currenty, ONT reads (Guppy 5+ recommended) and PacBio HiFi reads are supported.

HapDup is currently designed for low-heterozygosity genomes (such as human). The expectation is that the assembly has most of the diploid genome collapsed into a single haplotype. For assemblies with partially resolved haplotypes, alternative alleles could be removed prior to running the pipeline using purge_dups. We expect to add a better support of highly heterozygous genomes in the future.

The first stage is to realign the original long reads on the assembly using minimap2. We recommend to use the latest minimap2 release.

minimap2 -ax map-ont -t 30 assembly.fasta reads.fastq | samtools sort -@ 4 -m 4G > lr_mapping.bam
samtools index -@ 4 assembly_lr_mapping.bam

Quick start using Docker

HapDup is available on the Docker Hub.

If Docker is not installed in your system, you need to set it up first following this guide.

Next steps assume that your assembly.fasta and lr_mapping.bam are in the same directory, which will also be used for HapDup output. If it is not the case, you might need to bind additional directories using the Docker's -v / --volume argument. The number of threads (-t argument) should be adjusted according to the available resources. For PacBio HiFi input, use --rtype hifi instead of --rtype ont.

cd directory_with_assembly_and_alignment
HD_DIR=`pwd`
docker run -v $HD_DIR:$HD_DIR -u `id -u`:`id -g` mkolmogo/hapdup:0.4 \
  hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64 --rtype ont

Quick start using Singularity

Alternatively, you can use Singularity. First, you will need install the client as descibed in the manual. One way to do it is through conda:

conda install singularity

Next steps assume that your assembly.fasta and lr_mapping.bam are in the same directory, which will also be used for HapDup output. If it is not the case, you might need to bind additional directories using the --bind argument. The number of threads (-t argument) should be adjusted according to the available resources. For PacBio HiFi input, use --rtype hifi instead of --rtype ont.

singularity pull docker://mkolmogo/hapdup:0.4
HD_DIR=`pwd`
singularity exec --bind $HD_DIR hapdup_0.4.sif \
  hapdup --assembly $HD_DIR/assembly.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup -t 64 --rtype ont

Output files

The output directory will contain:

haplotype_{1,2}.fasta - final assembled haplotypes
phased_blocks_hp{1,2}.bed - phased blocks coordinates

Haplotypes generated by the pipeline contain homozogous and heterozygous varinats (small and structural). Becuase the pipeline is only using long-read (ONT) data, it does not achieve chromosome-level phasing. Fully-phased blocks are given in the the phased_blocks* files.

Pipeline overview

HapDup starts with filtering alignments that are likely originating from the unassembled parts of the genome. Such alignments may later create false haplotypes if not removed (e.g. if reads from a segmental duplication with two copies can create four haplotypes).
Afterwards, PEPPER is used to call SNPs from the filtered alignment file
Then we use Margin to phase SNPs and haplotype reads
We then use Flye to polish the initiall assembly with the reads from each of the two haplotypes independently
Finally, we find (heterozygous) breakpoints in long-read alignments and apply the corresponding structural changes to the corresponding polished haplotypes. Currently, it allows to recover large heterozygous inversions.

Benchmarks

We evaluated HapDup haplotypes in terms of reconstructed structural variants signatures (heterozygous & homozygous) using the HG002 for which the curated set of SVs is available. We used the recent ONT data basecalled with Guppy 5.

Given HapDup haplotypes, we called SV using dipdiff. We also compare SV set against hifiasm assemblies, even though they were produced from HiFi, rather than ONT reads. Evaluated using truvari with -r 2000 option. GT refers to genotype-considered benchmarks.

Method	Precision	Recall	F1-score	GT Precision	GT Recall	GT F1-score
Shasta+HapDup	0.9500	0.9551	0.9525	0.934	0.9543	0.9405
Sniffles	0.9294	0.9143	0.9219	0.8284	0.9051	0.8605
CuteSV	0.9324	0.9428	0.9376	0.9119	0.9416	0.9265
hifiasm	0.9512	0.9734	0.9622	0.9129	0.9723	0.9417

Yak k-mer based evaluations:

Hap	QV	Switch err	Hamming err
1	35	0.0389	0.1862
2	35	0.0385	0.1845

Given a minimap2 alignment, HapDup runs in ~400 CPUh and uses ~80 Gb of RAM.

Source installation

If you prefer, you can install from source as follows:

#create a new conda environemnt and activate it
conda create -n hapdup python=3.8
conda activate hapdup

#get HapDup source
git clone https://github.com/fenderglass/hapdup
cd hapdup
git submodule update --init --recursive

#build and install Flye
pushd submodules/Flye/ && python setup.py install && popd

#build and install Margin
pushd submodules/margin/ && mkdir build && cd build && cmake .. && make && cp ./margin $CONDA_PREFIX/bin/ && popd

#build and install PEPPER and its dependencies
pushd submodules/pepper/ && python -m pip install . && popd

To run, ensure that the conda environemnt is activated and then execute:

conda activate hapdup
./hapdup.py --assembly assembly.fasta --bam lr_mapping.bam --out-dir hapdup -t 64 --rtype ont

Acknowledgements

The major parts of the HapDup pipeline are:

Authors

The pipeline was developed at UC Santa Cruz genomics institute, Benedict Paten's lab.

Pipeline code contributors:

Mikhail Kolmogorov

PEPPER/Margin/Shasta support:

Kishwar Shafin
Trevor Pesout
Paolo Carnevali

Citation

If you use HapDup in your research, the most relevant papers to cite are:

Kishwar Shafin, Trevor Pesout, Pi-Chuan Chang, Maria Nattestad, Alexey Kolesnikov, Sidharth Goel, Gunjan Baid et al. "Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks." bioRxiv (2021). doi:10.1101/2021.03.04.433952

Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin and Pavel Pevzner, "Assembly of Long Error-Prone Reads Using Repeat Graphs", Nature Biotechnology, 2019 doi:10.1038/s41587-019-0072-8

License

HapDup is distributed under a BSD license. See the LICENSE file for details. Other software included in this discrubution is released under either MIT or BSD licenses.

How to get help

A preferred way report any problems or ask questions is the issue tracker.

In case you prefer personal communication, please contact Mikhail at [email protected].

Comments

pthread_setaffinity_np failed Error while running pepper
Hi,

I'm trying to run HapDup on my assembly from Flye. However, an error has occurred while running Pepper: RuntimeError: /onnxruntime_src/onnxruntime/core/platform/posix/env.cc:173 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed, error code: 0 error msg: there's also a warning before this runtime error: /usr/local/lib/python3.8/dist-packages/torch/onnx/symbolic_opset9.py:2095: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LSTM can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model. warnings.warn("Exporting a model to ONNX with a batch_size other than 1, " + Do you have any idea why this happens?

The commands that I use are like:

reads=NA24385_ONT_Promethion.fastq outdir=`pwd` assembly=${outdir}/assembly.fasta hapdup_sif=../HapDup/hapdup_0.4.sif time minimap2 -ax map-ont -t 30 ${assembly} ${reads} | samtools sort -@ 4 -m 4G > assembly_lr_mapping.bam samtools index -@ 4 assembly_lr_mapping.bam time singularity exec --bind ${outdir} ${hapdup_sif}\ hapdup --assembly ${assembly} --bam ${outdir}/assembly_lr_mapping.bam --out-dir ${outdir}/hapdup -t 64 --rtype ont

Thank you
bug
opened by LYC-vio 11

Docker fails to mount HD_DIR

Hi! The following error occurs when I try to run:

sudo docker run -v $HD_DIR:$HD_DIR -u `id -u`:`id -g` mkolmogo/hapdup:0.2   hapdup --assembly $HD_DIR/barcode11CBS1878_v2.fasta --bam $HD_DIR/lr_mapping.bam --out-dir $HD_DIR/hapdup

docker: Error response from daemon: error while creating mount source path '/home/user/data/volume_2/hapdup_results': mkdir /home/user/data: file exists.
ERRO[0000] error waiting for container: context canceled``

opened by alkminion1 8

Incorrect genotype for large deletion

Hi,

I have used Hapdup to make a haplotype-resolved assembly from Illumina-corrected ONT reads (haploid assembly made with Flye 2.9) and I am particularly interested in a large 32kb deletion. Here is a screenshot of IGV (from top to bottom: HAP1, HAP2 and haploid assembly):

I believe the position and size of the deletion are near correct. However, the deletion is homozygous while it should be heterozygous. I have assembled with Hifiasm this proband and its parents using 30x PacBio HiFi: the 3 assemblies support an heterozygous call in the proband. I can also see from the corrected ONT that there is support for a heterozygous call. Finally, we can see this additional contig in the haploid assembly which I guess also support a heterozygous call.

Hence, my question is: even if MARGIN manages to correctly separate reads with the deletion from reads without the deletion, can the polishing of Flye actually "fix" such a large event in one of the haplotype assembly?

Thanks, Guillaume

opened by GuillaumeHolley 7
invalid contig

Hi, I got an error when I ran the third step, and here is the error reporting information. Skipped filtering phase Skipped pepper phase Skipped margin phase Skipped Flye phase Finding breakpoints Parsed 304552 reads 14590 split reads Running: flye-minimap2 -ax asm5 -t 64 -K 5G /usr_storage/zyl/SY_haplotype/ZSP192L/ZSP192L.fasta /usr_storage/zyl/SY_haplotype/ZSP192L/hapdup/flye_hap_1/polished_1.fasta 2>/dev/null | flye-samtools sort -m 4G -@4 > /usr_storage/zyl/SY_haplotype/ZSP192L/hapdup/structural/liftover_hp1.bam [bam_sort_core] merging from 0 files and 4 in-memory blocks... Traceback (most recent call last): File "/usr/local/bin/hapdup", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/hapdup/main.py", line 173, in main bed_liftover(inversions_bed, minimap_out, open(inversions_hp, "w")) File "/usr/local/lib/python3.8/dist-packages/hapdup/bed_liftover.py", line 76, in bed_liftover proj_start_chr, proj_start_pos, proj_start_sign = project(bam_file, chr_id, chr_start) File "/usr/local/lib/python3.8/dist-packages/hapdup/bed_liftover.py", line 9, in project name, pos, sign = project_flank(bam_path, ref_seq, ref_pos, 1) File "/usr/local/lib/python3.8/dist-packages/hapdup/bed_liftover.py", line 23, in project_flank for pileup_col in samfile.pileup(ref_seq, max(0, ref_pos - flank), ref_pos + flank, truncate=True, File "pysam/libcalignmentfile.pyx", line 1335, in pysam.libcalignmentfile.AlignmentFile.pileup File "pysam/libchtslib.pyx", line 685, in pysam.libchtslib.HTSFile.parse_region ValueError: invalid contig Contig125
bug

opened by tongyin121 7
hapdup fails with multiple primary alignments
I've got one sample in a series which is failing in hapdup with the following error.

Any suggestions?

Thanks

`

Starting merge Expected three tokens in header line, got 2 This usually means you have multiple primary alignments with the same read ID. You can identify whether this is the case with this command:

samtools view -F 0x904 YOUR.bam | cut -f 1 | sort | uniq -c | awk '$1 > 1'

Expected three tokens in header line, got 2 This usually means you have multiple primary alignments with the same read ID. You can identify whether this is the case with this command:

samtools view -F 0x904 YOUR.bam | cut -f 1 | sort | uniq -c | awk '$1 > 1'

[2022-12-13 17:15:57] ERROR: Missing output: hapdup/margin/MARGIN_PHASED.haplotagged.bam Traceback (most recent call last): File "/data/test_data/GIT/hapdup/hapdup.py", line 24, in sys.exit(main()) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 206, in main file_check(haplotagged_bam) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 114, in file_check raise Exception("Missing output") Exception: Missing output `
opened by mattloose 4
ZeroDivisionError

Hi,

I'm running hapdup version 0.8 on a number of human genomes.

It appears to fail fairly regulary with an error in flye:

[2022-10-20 15:33:48] INFO: Running Flye polisher [2022-10-20 15:33:48] INFO: Polishing genome (1/1) [2022-10-20 15:33:48] INFO: Polishing with provided bam [2022-10-20 15:33:48] INFO: Separating alignment into bubbles [2022-10-20 15:37:12] ERROR: Thread exception [2022-10-20 15:37:12] ERROR: Traceback (most recent call last): File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/flye/polishing/bubbles.py", line 79, in _thread_worker indels_profile = _get_indel_clusters(ctg_aln, profile, ctg_region.start) File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/flye/polishing/bubbles.py", line 419, in _get_indel_clusters get_clusters(deletions, add_last=True) File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/flye/polishing/bubbles.py", line 410, in get_clusters support = len(reads) / region_coverage ZeroDivisionError: division by zero

The flye version is 2.9-b1778

Has anyone else seen this and have any suggestions on how to fix?

Thanks.
bug

opened by mattloose 4
Super cool tool! Maybe in the README mention that FASTQ files are required if using the container

As far as know, inputting FASTA reads aligned to the reference as the bam file will result in Pepper failing to find any variants using the default settings of the Docker/Singularity container as the config for Pepper requires a minimum base quality setting.

opened by jelber2 3

Option to set minimap2 -I flag

Hi,

Ran into this error while trying to run hapdup:

[2022-03-08 10:23:36] INFO: Running: flye-minimap2 -ax asm5 -t 10 -K 5G <PATH>/assembly.fasta <PATH>/hapdup/flye_hap_1/polished_1.fasta 2>/dev/null | flye-samtools sort -m 4G -@4 > <PATH>/hapdup/structural/liftover_hp1.bam
[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 2
samtools sort: truncated file. Aborting
Traceback (most recent call last):
  File "/usr/local/bin/hapdup", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/hapdup/main.py", line 245, in main
    subprocess.check_call(" ".join(minimap_cmd), shell=True)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'flye-minimap2 -ax asm5 -t 10 -K 5G <PATH>/assembly.fasta <PATH>/hapdup/flye_hap_1/polished_1.fasta 2>/dev/null | flye-samtools sort -m 4G -@4 > <PATH>/hapdup/structural/liftover_hp1.bam' returned non-zero exit status 1.

I suspect it could be because of the default minimap2 -I flag being too small (4G)? If this is the case, maybe an option to specify this could be added, or adjust it automatically depending on genome size?

Thanks!

bug

opened by fellen31 3

Using Pepper-MARGIN r0.7?

Hi,

Would it be possible to include the latest version of Pepper-MARGIN (r0.7) in hapdup? I haven't been able to run hapdup on my data so far because of some issues in Pepper-MARGIN r0.6 (now solved in r0.7).

Thank you!

Guillaume
enhancement

opened by GuillaumeHolley 3
Please add CLI option to specify location of BAM index

Could you add an additional command line parameter, allowing the specification of the BAM index location? Eg,

hapdup --bam /some/where/abam.bam --bam-index /other/location/a_bam_index.bam.csi

And then pass that optional value to pysam.AlignmentFile() argument filepath_index?

Motivation: I'm wrapping hapdup and some other steps in a WDL script, and need to pass each file separately (ie, they are localized as individual files, and there is no guarantee they end up in the same directory when hapdup is invoked). The current hapdup assumes the index and bam are in the same directory, and fails.

Thanks!

CC: @0seastar0
enhancement

opened by bkmartinjr 3
Singularity

Hi,

Thank you for this tool, I am really excited to try it! Would it be possible to have hapdup available as a Singularity image or to have the Docker image available online in a container repo (such that it can be converted to a Singularity image with a singularity pull)?

Thanks, Guillaume

opened by GuillaumeHolley 3
--overwrite fails

[2022-11-03 20:16:22] INFO: Filtering alignments Traceback (most recent call last): File "/data/test_data/GIT/hapdup/hapdup.py", line 24, in sys.exit(main()) File "/data/test_data/GIT/hapdup/hapdup/main.py", line 153, in main filter_alignments_parallel(args.bam, filtered_bam, min(args.threads, 30), File "/data/test_data/GIT/hapdup/hapdup/filter_misplaced_alignments.py", line 188, in filter_alignments_parallel pysam.merge("-@", str(num_threads), bam_out, *bams_to_merge) File "/home/plzmwl/anaconda3/envs/hapdup/lib/python3.8/site-packages/pysam/utils.py", line 69, in call raise SamtoolsError( pysam.utils.SamtoolsError: "samtools returned with error 1: stdout=, stderr=[bam_merge] File 'hapdup/filtered.bam' exists. Please apply '-f' to overwrite. Abort.\n"

Looks as though when you run with --overwrite the command is not being correctly passed through to sub processes.
bug

opened by mattloose 2
phase block number compared to Whatshap

Hello, thank you for the great tool!

I was just testing HapDup v0.7 on our fish genome. Comparing the output with phasing done with WhatsHap (WH), I wondered why there is such a big difference in phased block size and block number between HapDup and the WH pipeline?

For the fish chromosomes, WH was generating 679 blocks using 2'689'114 phased SNPs. Margin (HapDup pipeline) was generating 5352 blocks using 3'862'108 phased SNPs.

The main difference seems to be the prior read filtering and usage of MarginPhase for the phasing in HapDup, but does this explain such a big difference?

I was wondering if phase blocks of HapDup could be concatenated using whatshap SNP and block information to increase continuity? I imagine it would be a straightforward approach overlapping SNP positions between Margin and WH with phase block ids and lift-over phase ids from WH. I will do some visual inspections and scripting to test if there is overlap of called SNPs and agreement on block boarders.

Cheers, Michel

opened by MichelMoser 3

Releases(0.10)

0.10(Oct 29, 2022)
Margin version update

Option to use unphased reads for polishing

Het insertion polishing improvement

A few small bigfixes

Source code(tar.gz)
Source code(zip)
0.6(Mar 4, 2022)
Fixed issue with PEPPER model quantization causing the pipeline to hang on some systems

Speed-up of the last breakpoint analysis part of the pipeline, causing bottlenecks on some datasets

Source code(tar.gz)
Source code(zip)
0.5(Feb 6, 2022)
Update to PEPPER 0.7

Added new option --pepper-model for custom pepper models

Added new option --bam-index to provide a non-standard path to alignment index file

Source code(tar.gz)
Source code(zip)
0.4(Nov 19, 2021)
Added HiFi reads support

Source code(tar.gz)
Source code(zip)

Owner

Mikhail Kolmogorov

Postdoc @ UCSC CGL, Paten lab. I work on building algorithms for computational genomics.

GitHub

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022

Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

1 Feb 15, 2022

pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

182 Nov 11, 2022

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

1 Dec 3, 2021

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

ETL Pipeline with Airflow, Spark, s3, MongoDB and Amazon Redshift

214 Jan 2, 2023

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

1 Nov 1, 2021

Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Machine Learning and Data Analytics Lab FAU

3 Dec 7, 2022

A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

2 May 15, 2022

Full automated data pipeline using docker images

Create postgres tables from CSV files This first section is only relate to creating tables from CSV files using postgres container alone. Just one of

1 Nov 21, 2021

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 7, 2022

ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

0 Jul 7, 2021

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

5 Sep 28, 2022

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Pipeline to convert a haploid assembly into diploid

Related tags

Overview

HapDup

Version 0.4

Input requirements

Quick start using Docker

Quick start using Singularity

Output files

Pipeline overview

Benchmarks

Source installation

Acknowledgements

Authors

Citation

License

How to get help

Comments

Releases(0.10)

0.10(Oct 29, 2022)

0.6(Mar 4, 2022)

0.5(Feb 6, 2022)

0.4(Nov 19, 2021)

Owner

Mikhail Kolmogorov

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Udacity-api-reporting-pipeline - Udacity api reporting pipeline

pipeline for migrating lichess data into postgresql

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

Pipeline and Dataset helpers for complex algorithm evaluation.

A pipeline that creates consensus sequences from a Nanopore reads. I

Full automated data pipeline using docker images

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

ETL pipeline on movie data using Python and postgreSQL

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

Convert tables stored as images to an usable .csv file

Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

For making Tagtog annotation into csv dataset

A Python module for clustering creators of social media content into networks