Location of public benchmarking; primarily final results

HudsonAlpha Institute for Biotechnology

Last update: Jun 13, 2022

Related tags

Miscellaneous CSL_public_benchmark

Overview

CSL_public_benchmark

This repo is intended to provide a periodically-updated, public view into genome sequencing benchmarks managed by HudsonAlpha's Clinical Services Lab (CSL). The benchmarking results primarily provide the CSL a systematic approach to evalute various reference genome, aligner, and variant caller combinations against each other. All of the datasets we used for testing were generated at HudsonAlpha. The short-read PCR-free datasets were generated using standard clinical processes in the CSL and are currently private datasets. The long-read PacBio datasets were generated by the Genome Sequencing Center and are publicly hosted through the Genome in a Bottle consortium (see below).

The benchmarks or "truth sets" themselves are large-scale publicly available benchmarks created for a handful of reference samples. Most of the benchmarks we use were generated by the Genome in a Bottle (GIAB) Consortium.

Current status

This initial release just includes the final results files that are reviewed after the pipelines have completed.

What is the pipeline?

The benchmarking pipeline itself is maintained in a private repo. Briefly, it is a snakemake pipeline that built around a systematic final evaluation that mostly uses RTG vcfeval to measure sensitivity and precision. The primary "wildcards" in this evaluation are the reference, the aligner, and the variant caller; with versioning where appropriate. This allows us to quickly add new tools by defining new rules to run a particular tool (typically one per aligner or caller), and then evaluate in a standard way. In general, we try to use docker images or conda environments when these are already available to increase downstream portability; however, these are not always available.

As a result, many rules are tied to our cluster ecosystem, either through modules and/or file paths to installed software. Additionally, all the metadata (e.g. fastq pairs for a given sample) is tracked using an internal system. This means that this pipeline, even if publically available, would definitely not run "out-of-the-box" for anyone outside of HudsonAlpha. A very long-term goal would be to create a public version that can run out-of-the-box given user-provided metadata.

However, in the interest of transparency, we will be making efforts to clarify any questions about the implementation over time. This will largely be driven by questions we receive from the community (i.e. create issues if you have questions, so we can begin tackling this). Examples of things already on the TODO radar:

Rules for aligners and callers
Rules for evaluation
Description/links to specific reference files

Comments

truvari options

Hi, Thanks for testing dysgu, I have a few questions about whats causing the low precision values of dysgu. Firstly, I just wanted to compare notes about how you test using truvari. I have tested with: grep '#\|PASS' HG002.dysgu.vcf | grep '#\|DEL' > HG002.dysgu_pass.del.vcf; bgzip HG002.dysgu_pass.del.vcf; tabix HG002.dysgu_pass.del.vcf; truvari bench -b HG002_SVs_Tier1_v0.6.vcf.gz --includebed HG002_SVs_Tier1_v0.6_include.bed --sizemax 260000000 --giabreport -c HG002.dysgu_pass.del.vcf.gz -o truvari_dysgu_pass_only_del --pctsim 0 -s 50 --passonly Also was interested to know what coverage and read length your samples are. Finally I noticed in the dysgu results the stdev of the precision values seemed very high, for example on the hg38 GIAB masked dragmap-1.2.1 precision score was 0.4321+-0.2749. Im not sure whats causing this, but possibly the insert size metrics are not being worked out properly? This should be available in the log file from dysgu. Thanks!

opened by kcleal 10
Adds the tandem-repeat option to sniffles
This adds the tandem-repeat option under temporary name "sniffles_tr-{version}"; this will eventually replace "sniffles-{version}"
opened by holtjma 2
Deletion split
Splits deletion benchmark into RESTRICTED (requiring a high-confidence BED file) and UNRESTRICTED (no BED files, it's everything in the benchmark VCF)

RESTRICTED now only contains HG002

UNRESTRICTED has both HG001 and HG002, but with caveats around precision

Sniffles v2.0.2 happened to drop while I was testing this, so I went ahead and added it (it ran quite fast)
opened by holtjma 2
Benchmark metadata
Adds a benchmarks folder to describe where benchmark files came from

Small variants and CMRG both have a simple shell script for downloading the exact files used

Deletion files are originally in hg19, and a semi-manual process was used to liftover the files to hg38. The steps for this process are described in the README and the final VCF files are stored in this repo due to their relatively small size.
opened by holtjma 1
References metadata
Primarily adds reference metadata for the two primary reference files we are currently testing with:

hg38_asm5_alt - includes scripts for download, creating ALT contigs, and the final alt files

hg38_GIAB_masked - includes scripts for download and the dummy alt file
opened by holtjma 1
Add more information regarding the references

We've had some questions about the reference origins. We should probably add some links to file and bash scripts where applicable with regards to reference acquisition.
documentation

opened by holtjma 1
Release 20220225
Adds hg38_T2T_masked reference genome and metadata around generated corresponding reference files

Benchmark results adds hg38_T2T_masked results as well

On the backend, Truvari was updated to v3.1.0, this did not seem to have a significant impact on results
opened by holtjma 0
adding results files
Added haplotyping results for caller cyrius

Updated versions of pbmm2 and pbsv; there are some changes associated with the results of these, so the previous version is maintained for this release
opened by holtjma 0

Releases(2022-06-10)

2022-06-10(Jun 10, 2022)
Tool changes:

Adds DeepVariant v1.4.0 to both short- and long-read tests

Source code(tar.gz)
Source code(zip)
results_20220610.pdf(149.96 KB)
small_summary_20220610.csv(6.74 KB)
2022-05-19(May 19, 2022)
This release is primarily adding two new samples to the PCR-free datasets. The following updates occurred as a result of this change:

Adds two new samples to our PCR-free datasets corresponding to HG006 and HG007

PCR-free results all shifted slightly (the vast majority to slightly worse performance); we did not notice any drastic changes across the results; all average sensitivities, precisions, and F1-scores shifted <0.001

Updates our expected CYP2D6 outputs to include expectations for HG006 and HG007

Source code(tar.gz)
Source code(zip)
results_20220519.pdf(143.74 KB)
small_summary_20220519.csv(5.36 KB)
2022-04-29(Apr 29, 2022)
Adds dysgu v1.3.10 pass-only to the SV callers for all data types

Removed full dysgu results from evaluations, pass-only is recommended

Source code(tar.gz)
Source code(zip)
results_20220429.pdf(144.49 KB)
small_summary_20220429.csv(5.31 KB)
2022-04-08(Apr 8, 2022)
Adds clair3-v0.1-r11 to both Illumina and PacBio test sets; thanks to @zhengzhenxian for assistance and quick TAT while debugging some issues!

Illumina results shows a slight drop in recall

PacBio results are borderline identical, but only required ~60% of the compute time compared to r9

Source code(tar.gz)
Source code(zip)
results_20220408.pdf(150.14 KB)
small_summary_20220408.csv(6.65 KB)
2022-03-18(Mar 18, 2022)
Adds PEPPER-Margin-DeepVariant r0.8

Removes hg38_GIAB_masked reference, it is recommend to use hg38_T2T_masked instead now

Source code(tar.gz)
Source code(zip)
results_20220318.pdf(146.21 KB)
small_summary_20220318.csv(5.93 KB)
2022-02-25(Feb 25, 2022)
Reference changes:

Adds the hg38_T2T_masked reference which is version 2 of the hg38_GIAB_masked reference. A brief description and direct download links are provided with the reference metadata.

The hg38_T2T_masked results tend to be very slightly better than the v1 results, so hg38_GIAB_masked will likely be retired in a future release.

Method changes:

Truvari was updated to v3.1.0 after the release of a Truvari preprint on bioRxiv. This had a negligible impact on results.

Source code(tar.gz)
Source code(zip)
results_20220225.pdf(164.79 KB)
small_summary_20220225.csv(8.97 KB)
2022-02-18(Feb 18, 2022)
Software changes:

Updated all other Sentieon-based processes to v202112.01; the vast majority of associated results did not change at all with this update

Source code(tar.gz)
Source code(zip)
results_20220218.pdf(147.06 KB)
small_summary_20220218.csv(6.01 KB)
2022-02-11(Feb 11, 2022)
Software changes:

Added dnascope-1.0-202112.01-PO for PCR-free datasets, dnascope-0.5-202010.04-PO will be removed in future releases. Additionally, the pass-only filter (e.g. -PO) is recommended for DNAscope, so the unfiltered version has been remove from reporting. Thanks to @DonFreed for the recommendations!

Added dysgu-1.3.4-PO, which is a pass-only filtered version of dysgu-1.3.4, for PCR-free and PacBio datasets. Additionally, the pass-only filter (e.g. -PO) is recommended for dysgu, so the unfiltered version will be removed in future released. Thanks to @kcleal for the recommendations!

Other changes:

Added a note in the README on release cadence. In order to reduce overhead, going forward we will limit formal releases to at most once a week. New or partial results may appear through the week with the intention to summarize any changes in the weekly release.

Source code(tar.gz)
Source code(zip)
results_20220211.pdf(157.26 KB)
small_summary_20220211.csv(8.80 KB)
2022-02-09(Feb 9, 2022)
Software changes:

Adds a temporary sniffles_tr-2.0.2 that incorporates the --tandem-repeats option using the save file as pbsv. This method is shown beside sniffles-2.0.2(no repeat file) for this release to demonstrate the impact of the repeat file on variant calling. It will replace sniffles-2.0.2 in the next release. Thanks to @fritzsedlazeck for the suggestion!

Source code(tar.gz)
Source code(zip)
results_20220209.pdf(154.79 KB)
small_summary_20220209.csv(8.83 KB)
2022-02-08(Feb 8, 2022)
Method change:

Splits deletion benchmark into RESTRICTED (requiring a high-confidence BED file) and UNRESTRICTED (no BED files, it's everything in the benchmark VCF)

RESTRICTED now only contains HG002 with the Tier1 regions. This set is more fair when judging the precision of the aligner/caller pair.

UNRESTRICTED has both HG001 and HG002, but with caveats around precision. This set includes more total variants and two samples, but precision is less accurate.

Software addition:

Sniffles v2.0.2 was added as a new caller

Source code(tar.gz)
Source code(zip)
results_20220208.pdf(153.97 KB)
small_summary_20220208.csv(8.83 KB)
2022-02-07(Feb 7, 2022)
Primarily adds dysgu v1.3.4 to both PCR-free and PacBio deletion benchmarks

Secondarily added combinations in PCR-free that were not previously available (e.g. octopus is now paired with all active aligner combinations)

Source code(tar.gz)
Source code(zip)
results_20220207.pdf(144.96 KB)
small_summary_20220207.csv(8.83 KB)
2022-01-28(Jan 28, 2022)
Added the first haplotyping caller to our results with cyrius-1.1.1; note that this caller is designed to work on short-read datasets and the upstream tooling (both reference and aligner) can have significant impact on its performance

Updated versions of pbmm2 (1.4.0 -> 1.7.0) and pbsv (2.6.2 -> 2.8.0); there are some changes in performance between the previous versions so they are retained in this release; they will be removed in subsequent releases

Source code(tar.gz)
Source code(zip)
2022-01-07(Jan 7, 2022)
Two variant caller updates:

Clair3 was update to v0.1-r9: Our previous version was v0.1-r5, and it was running in a conda environment after some back-and-forth with the developers. They now have a docker image that is much easier to use, so we have switched to that for both the Illumina and PacBio tests.

PEPPER-Margin-DeepVariant was added a full caller on version r0.7: Previously, we were treating this process as a BAM modifier (basically for phasing) and ignoring any variant calling results. With this change, it now operates as a variant caller and the VCF is analyzed with the rest of the callers. We are using the developer-released docker image for our analysis.

We have removed old versions of both tools to avoid any confusion around the analysis implementation

Source code(tar.gz)
Source code(zip)
results_20220107.pdf(145.82 KB)
2021-12-17(Dec 17, 2021)
Two main results changes:

Adds the dnascope-0.5-202010.04-PO variant caller: This is the same data as dnascope-0.5-202010.04 but with a PASS-only filter applied to the VCF file. The short-read DNAscope callers uses the FILTER field to annotate variants that are rejected by the model as likely false positives. This has significant impact on the results and is recommended as best practice by the developers. Thanks to @DonFreed for helping diagnose the issue!

Updates minimap2 aligner from v2.22 to v2.23: Overall, this had minimal impact in our benchmark. v2.22 will be retired from the benchmark next release.

Source code(tar.gz)
Source code(zip)
results_20211217.pdf(145.62 KB)
2021-12-13(Dec 13, 2021)
Metadata changes:

Adds a references folder for tracking references that are used in the analysis

Adds the hg38_asm5_alt reference including links to the reference and a script demonstrating how the ALT contigs were remapped

Adds the hg38_GIAB_masked reference including links to the reference and a dummy ALT file used for the pipeline

Results changes:

Added the SNAP-2.0.0 caller that was recently released, this was run with the -hc- option so GATK-based results are expected to not be as accurate

Source code(tar.gz)
Source code(zip)
results_20211213.pdf(140.63 KB)

Owner

HudsonAlpha Institute for Biotechnology

GitHub

Woltcheck - Python script to check if a wolt restaurant is ready to deliver to your location

woltcheck Python script to check if a wolt restaurant is ready to deliver to you

30 Sep 13, 2022

4Geeks Academy Full-Stack Developer program final project.

Final Project Chavi, Clara y Pablo 4Geeks Academy Full-Stack Developer program final project. Authors Javier Manteca - Coding - chavisam Clara Rojano

1 Feb 5, 2022

Final Fantasy XIV Auto House Clicker

0 Mar 31, 2022

Advanced Developing of Python Apps Final Exercise

Advanced-Developing-of-Python-Apps-Final-Exercise This is an exercise that I did for a python advanced learning course. The exercise is divided into t

1 Dec 4, 2021

A simple flashcard app built as a final project for a databases class.

CS2300 Final Project - Flashcard app 'FlashStudy' Tech stack Backend Python (Language) Django (Web framework) SQLite (Database) Frontend HTML/CSS/Java

2 Feb 3, 2022

Final project in KAIST AI class

mmodal_mixer MLP-Mixer based Multi-modal image-text retrieval Image: Original image is cropped with 16 x 16 patch size without overlap. Then, it is re

5 May 30, 2022

Covid-19-Trends - A project that me and my friends created as the CSC110 Final Project at UofT

Covid-19-Trends Introduction The COVID-19 pandemic has caused severe financial s

1 Jan 7, 2022

Compiler Final Project - Lisp Interpreter

2 Jan 23, 2022

WorldsCollide - Final Fantasy VI Randomizer

FFVI Worlds Collide Worlds Collide is an open worlds randomizer for Final Fantas

8 Jun 13, 2022

chiarose(XCR) based on chia(XCH) source code fork, open source public chain

chia-rosechain 一个无耻的小活动 | A shameless little event 如果您喜欢这个项目，请点击star 将赠送您520朵玫瑰，可以去 facebook 留下您的(xcr)地址，和github用户名。 If you like this project, please

376 Dec 14, 2022

Show Public IP Information In Linux Taskbar

IP Information In Linux Taskbar ?? How Use IP Script? ?? Download ip.py script and save somewhere in your system. Add command applet in your taskbar a

2 Jan 25, 2022

eyes is a Public Opinion Mining System focusing on taiwanese forums such as PTT, Dcard.

eyes is a Public Opinion Mining System focusing on taiwanese forums such as PTT, Dcard. Features ?? Article monitor: helps you capture the trend at a

116 Dec 29, 2022

WildHack 2021 solution by Nuclear Foxes team (public version).

WildHack 2021 Nuclear Foxes Team This repo contains our project for the Wildberries Hackathon 2021. Task 2: Searching tags Implement an algorithm of r

1 Apr 18, 2022

Public Management System for ACP's 24H TT Fronteira 2021

CROWD MANAGEMENT SYSTEM 24H TT Vila de Froteira 2021 This python script creates a dashboard with realtime updates regarding the capacity of spectactor

1 Nov 24, 2021

Results of Robot Framework 5.0 survey

Robot Framework 5.0 survey results We had a survey asking what features Robot Framework community members would like to see in the forthcoming Robot F

2 Oct 16, 2021

This Python3 script will monitor Upwork RSS feed and then email you the results.

Upwork RSS Parser This Python3 script will monitor Upwork RSS feed and then email you the results. Table of Contents General Info Technologies Used Fe

5 Nov 29, 2021

Bring A Trailer(BAT) is a popular online auction website for enthusiast cars. This traverse auction results and saves them as CSV

BaT Data Grabber Bring A Trailer(BAT) is a popular online auction website for enthusiast cars. This traverse auction results and saves them as CSV Bri

2 Oct 31, 2021

This is a simple python script for checking A/L Examination results of srilankan students

AL-Result-Checker This is a simple python script for checking A/L Examination results of srilankan students INSTALLATION [Termux] [Linux] : apt-get up

8 Oct 24, 2022

A faster Python generator that get function results from multi-process workers

multiyield This package implements a Python generator that get function results from multi-process workers. The faster_fifo Queue (instead of the stan

1 Nov 18, 2021

Location of public benchmarking; primarily final results

Related tags

Overview

CSL_public_benchmark

Current status

What is the pipeline?

Comments

Releases(2022-06-10)

2022-06-10(Jun 10, 2022)

2022-05-19(May 19, 2022)

2022-04-29(Apr 29, 2022)

2022-04-08(Apr 8, 2022)

2022-03-18(Mar 18, 2022)

2022-02-25(Feb 25, 2022)

2022-02-18(Feb 18, 2022)

2022-02-11(Feb 11, 2022)

2022-02-09(Feb 9, 2022)

2022-02-08(Feb 8, 2022)

2022-02-07(Feb 7, 2022)

2022-01-28(Jan 28, 2022)

2022-01-07(Jan 7, 2022)

2021-12-17(Dec 17, 2021)

2021-12-13(Dec 13, 2021)

Owner

HudsonAlpha Institute for Biotechnology

Woltcheck - Python script to check if a wolt restaurant is ready to deliver to your location

4Geeks Academy Full-Stack Developer program final project.

Final Fantasy XIV Auto House Clicker

Advanced Developing of Python Apps Final Exercise

A simple flashcard app built as a final project for a databases class.

Final project in KAIST AI class

Covid-19-Trends - A project that me and my friends created as the CSC110 Final Project at UofT

Compiler Final Project - Lisp Interpreter

WorldsCollide - Final Fantasy VI Randomizer

chiarose(XCR) based on chia(XCH) source code fork, open source public chain

Show Public IP Information In Linux Taskbar

eyes is a Public Opinion Mining System focusing on taiwanese forums such as PTT, Dcard.

WildHack 2021 solution by Nuclear Foxes team (public version).

Public Management System for ACP's 24H TT Fronteira 2021

Results of Robot Framework 5.0 survey

This Python3 script will monitor Upwork RSS feed and then email you the results.

Bring A Trailer(BAT) is a popular online auction website for enthusiast cars. This traverse auction results and saves them as CSV

This is a simple python script for checking A/L Examination results of srilankan students

A faster Python generator that get function results from multi-process workers