An adaptable Snakemake workflow which uses GATKs best practice recommendations to perform germline mutation calling starting with BAM files

Overview

Germline Mutation Calling

This Snakemake workflow follows the GATK best-practice recommandations to call small germline variants.

The pipeline requires as inputs aligned BAM files (e.g. with BWA) where the duplicates are already marked (e.g. with Picard or sambamba). It then performed Base Quality Score Recalibration and joint genotyping of multiple samples, which is automatically parallized over user defined intervals (for examples see intervals.txt) and chromosomes.

Filtering is performed using GATKs state-of-the-art Variant Quality Score Recalibration

At the end of the worklow, the Variant Effect Predictor is used to annotate the identified germline mutations.

A high level overview of the performed steps can be seen below:

DAG

As seen by the execution graph, an arbitrary number of samples/BAM files can be processed in parallel up to the joint variant calling.

Installation

Required tools:

The majority of the listed tools can be quite easily installed with conda which is recommanded.

Usage

First, modify the config_wgs.yaml and resources.yaml files. Both files contain detailed description what is expected. The config_wgs.yaml also contains links to some reference resources. Be careful, they are all specific for the GRCh37/hg19/b37 genome assembly.

After setting up all the config files and installing all tools, you can simply run:

snakemake --latency-wait 300 -j 5 --cluster "sbatch --mem={resources.mem_mb} --time {resources.runtime_min} --cpus-per-task {threads} --job-name={rule}.%j --output snakemake_cluster_submit_log/{rule}.%j.out --mail-type=FAIL"

This assumes that the cluster you are using is running SLURM. If this is not the case, you have to adjust the command after --cluster. The log information of each job will be safed in the snakemake_cluster_submit_log directory. This directory will not be created automatically.

-j specifies the number of jobs/rules should be submitted in parallel.

I recommand running this command in a detached session with tmux or screen.

Output

Below is the output of the tree command, after the workflow has finished for one patient H005-00ML. Usually you would include many patients simultaneously (>50). This is just to illustrate the created output files.

.
├── cohort
│ ├── benchmark
│ │ ├── ApplyVQSR_indel.txt
│ │ ├── ApplyVQSR_snp.txt
│ │ ├── CombineGVCFs.txt
│ │ ├── GenotypeGVCFs.txt
│ │ ├── MergeCohortVCFs.txt
│ │ ├── SelectVariants.txt
│ │ ├── VEP.txt
│ │ ├── VQSR_indel.txt
│ │ └── VQSR_snp.txt
│ ├── cohort.recalibrated.pass.vep.vcf.gz
│ ├── cohort.recalibrated.pass.vep.vcf.gz_summary.html
│ ├── cohort.recalibrated.vcf.gz
│ ├── cohort.recalibrated.vcf.gz.tbi
│ └── logs
│     ├── ApplyVQSR_indel.out
│     ├── ApplyVQSR_snp.out
│     ├── CombineGVCFs
│     ├── CombineGVCFs.1.out
│     ├── CombineGVCFs.2.out
│     ├── ...
│     ├── ...
│     ├── CombineGVCFs.Y.out
│     ├── GenotypeGVCFs.1.out
│     ├── GenotypeGVCFs.2.out
│     ├── ...
│     ├── ...
│     ├── GenotypeGVCFs.Y.out
│     ├── MakeSitesOnly.out
│     ├── MergeCohortVCFs.out
│     ├── SelectVariants.err
│     ├── VEP.out
│     ├── VQSR_indel.out
│     └── VQSR_snp.out
├── config
│ ├── config_wgs.yaml
│ └── resources.yaml
├── H005-00ML
│ ├── benchmark
│ │ ├── ApplyBQSR.txt
│ │ ├── BaseRecalibrator.txt
│ │ ├── GatherBQSRReports.txt
│ │ ├── GatherRecalBamFiles.txt
│ │ ├── HaplotypeCaller.txt
│ │ ├── IndexBam.txt
│ │ ├── MergeHaplotypeCaller.txt
│ │ └── SortBam.txt
│ ├── H005-00ML.germline.merged.g.vcf.gz
│ ├── H005-00ML.germline.merged.g.vcf.gz.tbi
│ └── logs
│     ├── ApplyBQSR
│     ├── ApplyBQSR.0000-scattered.interval_list.out
│     ├── ApplyBQSR.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── ApplyBQSR.0049-scattered.interval_list.out
│     ├── BaseRecalibrator
│     ├── BaseRecalibrator.0000-scattered.interval_list.out
│     ├── BaseRecalibrator.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── BaseRecalibrator.0049-scattered.interval_list.out
│     ├── GatherBQSRReports.out
│     ├── GatherRecalBamFiles.out
│     ├── HaplotypeCaller
│     ├── HaplotypeCaller.0000-scattered.interval_list.out
│     ├── HaplotypeCaller.0001-scattered.interval_list.out
│     ├── ...
│     ├── ...
│     ├── HaplotypeCaller.0049-scattered.interval_list.out
│     ├── IndexBam.out
│     ├── MergeHaplotypeCaller.out
│     └── SortBam.out
├── rules
│ ├── BaseQualityScoreRecalibration.smk
│ ├── JointGenotyping.smk
│ ├── VEP.smk
│ └── VQSR.smk
├── Snakefile
├── snakemake_cluster_submit_log
│ ├── ApplyBQSR.24720887.out
│ ├── ApplyVQSR_snp.24777265.out
│ ├── BaseRecalibrator.24710227.out
│ ├── CombineGVCFs.24772984.out
│ ├── GatherBQSRReports.24715726.out
│ ├── GatherRecalBamFiles.24722478.out
│ ├── GenotypeGVCFs.24773026.out
│ ├── HaplotypeCaller.24769848.out
│ ├── IndexBam.24768728.out
│ ├── MergeCohortVCFs.24776018.out
│ ├── MergeHaplotypeCaller.24772183.out
│ ├── SelectVariants.24777733.out
│ ├── SortBam.24768066.out
│ ├── VEP.24777739.out
│ ├── VQSR_indel.24776035.out
│ └── VQSR_snp.24776036.out

For each analyzed patient, a seperate directory gets created. Along with the patient specific gvcf file, this directory contains log files for all the processing steps that were performed for that patient (log directory) as well as benchmarks for each rule, e.g. how long the step took or how much CPU/RAM was used (benchmark directory).

The cohort directory contains the multi-sample VCF file, which gets created after performing the joint variant calling. The cohort.recalibrated.vcf.gz is the product of GATKs Variant Quality Score Recalibration. The cohort.recalibrated.pass.vep.vcf.gz is the filtered and VEP annotated version of cohort.recalibrated.vcf.gz (only variants with PASS are kept).

For most applications, the cohort.recalibrated.pass.vep.vcf.gz file, is the file you want to continue working with.

You might also like...
Backend app for visualizing CANedge log files in Grafana (directly from local disk or S3)
Backend app for visualizing CANedge log files in Grafana (directly from local disk or S3)

CANedge Grafana Backend - Visualize CAN/LIN Data in Dashboards This project enables easy dashboard visualization of log files from the CANedge CAN/LIN

Hg002-qc-snakemake - HG002 QC Snakemake

HG002 QC Snakemake To Run Resources and data specified within snakefile (hg002QC

Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.
Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

Snakemake workflow: name A Snakemake workflow for description Usage The usage of this workflow is described in the Snakemake Workflow Catalog. If

A Snakemake workflow for standardised sc/snRNAseq analysis

single_snake_sequencing - sc/snRNAseq Snakemake Workflow A Snakemake workflow for standardised sc/snRNAseq analysis. Every single cell analysis is sli

Trio Assembly Snakemake Workflow

Trio Assembly Snakemake Workflow Input HiFi reads for child in bam format Either

Looks for Bitcoin Wallets starting 1 compresses and Uncompressesed,  segwit address and MultiSig starting 3.
Looks for Bitcoin Wallets starting 1 compresses and Uncompressesed, segwit address and MultiSig starting 3.

Looks for Bitcoin Wallets starting 1 compresses and Uncompressesed, segwit address and MultiSig starting 3. Pick your starting and stop numbers to start looking. Need a database of addresses to check. Made with https://pypi.org/project/PySimpleGUI/

Fully Automated YouTube Channel ▶️with Added Extra Features.

Fully Automated Youtube Channel ▒█▀▀█ █▀▀█ ▀▀█▀▀ ▀▀█▀▀ █░░█ █▀▀▄ █▀▀ █▀▀█ ▒█▀▀▄ █░░█ ░░█░░ ░▒█░░ █░░█ █▀▀▄ █▀▀ █▄▄▀ ▒█▄▄█ ▀▀▀▀ ░░▀░░ ░▒█░░ ░▀▀▀ ▀▀▀░

This is an API developed in python with the FastApi framework and putting into practice the recommendations of the book Clean Architecture in Python by Leonardo Giordani,

This is an API developed in python with the FastApi framework and putting into practice the recommendations of the book Clean Architecture in Python by Leonardo Giordani,

Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

A folder automation made using Watch-dog, it only works in linux for now but I assume, it will be adaptable to mac and PC as well

folder-automation A folder automation made using Watch-dog, it only works in linux for now but I assume, it will be adaptable to mac and PC as well Th

Extendable, adaptable rewrite of django.contrib.admin
Extendable, adaptable rewrite of django.contrib.admin

django-admin2 One of the most useful parts of django.contrib.admin is the ability to configure various views that touch and alter data. django-admin2

Extendable, adaptable rewrite of django.contrib.admin
Extendable, adaptable rewrite of django.contrib.admin

django-admin2 One of the most useful parts of django.contrib.admin is the ability to configure various views that touch and alter data. django-admin2

Self-Adaptable Point Processes with Nonparametric Time Decays

NPPDecay This is our implementation for the paper Self-Adaptable Point Processes with Nonparametric Time Decays, by Zhimeng Pan, Zheng Wang, Jeff M. P

Snakemake worflow to process and filter long read data from Oxford Nanopore Technologies.
Snakemake worflow to process and filter long read data from Oxford Nanopore Technologies.

Nanopore-Workflow Snakemake workflow to process and filter long read data from Oxford Nanopore Technologies. It is designed to compare whole human gen

A best practice for tensorflow project template architecture.
A best practice for tensorflow project template architecture.

A best practice for tensorflow project template architecture.

Python example making use of best practice file structure and multithreading.

Python example making use of best practice file structure and multithreading.

MagTape is a Policy-as-Code tool for Kubernetes that allows for evaluating Kubernetes resources against a set of defined policies to inform and enforce best practice configurations.
MagTape is a Policy-as-Code tool for Kubernetes that allows for evaluating Kubernetes resources against a set of defined policies to inform and enforce best practice configurations.

MagTape is a Policy-as-Code tool for Kubernetes that allows for evaluating Kubernetes resources against a set of defined policies to inform and enforce best practice configurations. MagTape includes variable policy enforcement, notifications, and targeted metrics.

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

Owner
null
AB-test-analyzer - Python class to perform AB test analysis

AB-test-analyzer Python class to perform AB test analysis Overview This repo con

null 13 Jul 16, 2022
A Python Binder that merge 2 files with any extension by creating a new python file and compiling it to exe which runs both payloads.

Update ! ANONFILE MIGHT NOT WORK ! About A Python Binder that merge 2 files with any extension by creating a new python file and compiling it to exe w

Vesper 15 Oct 12, 2022
Farhad Davaripour, Ph.D. 1 Jan 5, 2022
Lumen provides a framework for visual analytics, which allows users to build data-driven dashboards from a simple yaml specification

Lumen project provides a framework for visual analytics, which allows users to build data-driven dashboards from a simple yaml specification

HoloViz 120 Jan 4, 2023
YOPO is an interactive dashboard which generates various standard plots.

YOPO is an interactive dashboard which generates various standard plots.you can create various graphs and charts with a click of a button. This tool uses Dash and Flask in backend.

ADARSH C 38 Dec 20, 2022
BGraph is a tool designed to generate dependencies graphs from Android.bp soong files.

BGraph BGraph is a tool designed to generate dependencies graphs from Android.bp soong files. Overview BGraph (for Build-Graphs) is a project aimed at

Quarkslab 10 Dec 19, 2022
Extract and visualize information from Gurobi log files

GRBlogtools Extract information from Gurobi log files and generate pandas DataFrames or Excel worksheets for further processing. Also includes a wrapp

Gurobi Optimization 56 Nov 17, 2022
Glue is a python project to link visualizations of scientific datasets across many files.

Glue Glue is a python project to link visualizations of scientific datasets across many files. Click on the image for a quick demo: Features Interacti

null 675 Dec 9, 2022
A minimal Python package that produces slice plots through h5m DAGMC geometry files

A minimal Python package that produces slice plots through h5m DAGMC geometry files Installation pip install dagmc_geometry_slice_plotter Python API U

Fusion Energy 4 Dec 2, 2022
VDLdraw - Batch plot the log files exported from VisualDL using Matplotlib

VDLdraw Batch plot the log files exported from VisualDL using Matplotlib. At pre

Yizhou Chen 5 Sep 26, 2022