SSrehab

dependencies:

python 3.8+
a GNU/Linux with bash v4 or 5.
python packages in requirements.txt
bcftools (only for prepare_dbSNPs)
gz-sort (only for prepare_dbSNPs)

Installation and basics

clone this repo

git clone https://github.com/Kukuster/SSrehab.git

install requirements

pip install -r requirements.txt

run the eponymous script in the cloned directory using the following syntax:

python3 SSrehab.py <command> [keys]

Use diagnose to check the validity of entries in the GWAS SS file.

Use fix to restore missing/invalid data in the GWAS SS file.

Use prepare_dbSNPs to preprocess a given dbSNP dataset into 2 datasets, which are used in the fix command.

Use sort to format the input GWAS SS file and sort either by Chr and BP or by rsID.

To use the fix command to its fullest, a user needs:

SNPs datasets in the target build, preprocessed with the prepare_dbSNPs command.
chain file, if the GWAS SS file is provided in build different from the target build

Tutorial

1. Download dbSNP dataset

Download dbSNP datasets from NCBI, in the target build, in vcf, vcf.gz, bcf, or bcf.gz format. The latest versions are recommended. dbSNP datasets are used to restore the following data: Chr, BP, rsID, OA, EA, EAF. Although only builds 37 and 38 are explicitly supported, build 36 may work as well.

For example, curently latest datasets for build 38 and build 37 can be downloaded here:

https://ftp.ncbi.nih.gov/snp/latest_release/VCF/

2. Download the chain file

A chain file is necessary to perform liftover. If a GWAS SS file is provided in the target build, then a chain file is not used.

3. Preprocess dbSNPs datasets

3.1 Download and install bcftools and gz-sort

see instructions on their websites and/or githubs

recommended bcftools version: 1.11

NOTE: after preprocessing of the necessary dbSNPs is finished, these tools are no longer needed

3.2 Run preprocessing

Run prepare_dbSNPs using the following syntax:

python3 SSrehab.py prepare_dbSNPs --dbsnp DBSNP --OUTPUT OUTPUT --gz-sort GZ_SORT --bcftools BCFTOOLS
                                  [--buffer BUFFER]

where:

DBSNP is the dbSNP dataset in vcf, vcf.gz, bcf, or bcf.gz format referencing build 38 or 37
OUTPUT is the base name for the two output dbSNPs datasets
GZ_SORT is a path to the gz-sort executable
BCFTOOLS is a path to the bcftools executable
BUFFER is buffer size for sorting (size of presort), supports k/M/G suffix. Defaults to 1G. Recommended: at least 200M, ideally: 4G or more

Depending on the size of the dataset, specified buffer size, and specs of the machine, preprocessing may take somewhere from 30 minutes to 6 hours.

After preprocessing, steps 4 and 5 may be repeated ad-lib.

4. Create a config file for your GWAS SS file

Config file is used as meta data for GWAS SS file, and contains:

columns' indices (indices start from 0)
input build slug (such as "GRCh38", "GRCh37", "hg18", "hg19")

This config file has to have the same file name as the GWAS SS file but with an additional .json extension.

For example, if your GWAS SS file is named WojcikG_PMID_htn.gz, and the first 5 lines in the unpacked file are:

Chr     Position_hg19   SNP     Other-allele    Effect-allele   Effect-allele-frequency Sample-size     Effect-allele-frequency-cases   Sample-size-cases       Beta    SE      P-val    INFO-score      rsid
1       10539   1:10539:C:A     C       A       0.004378065     49141   0.003603676     27123   -0.1041663      0.1686092       0.5367087       0.46    rs537182016
1       10616   rs376342519:10616:CCGCCGTTGCAAAGGCGCGCCG:C      CCGCCGTTGCAAAGGCGCGCCG  C       0.9916342       49141   0.9901789       27123   -0.1738814      0.109543        0.1124369        0.604   rs376342519
1       10642   1:10642:G:A     G       A       0.006042409     49141   0.007277901     27123   0.1794246       0.1482529       0.226179        0.441   rs558604819
1       11008   1:11008:C:G     C       G       0.1054568       49141   0.1042446       27123   -0.007140072    0.03613677      0.84337 0.5     rs575272151

your config file should have the name WojcikG_PMID_htn.gz.json and the following contents:

{
    "Chr": 0,
    "BP": 1,
    "rsID": 13,
    "OA": 3,
    "EA": 4,
    "EAF": 5,
    "beta": 9,
    "SE": 10,
    "pval": 11,
    "INFO": 12,

    "build": "grch37"
}

Notes:

SSrehab will only consider data from the columns which indices are specified in the config file. If one of the above columns is present in the SS file but wasn't specified in the config file, then SSrehab treats the column as missing.
In this example, all the 10 columns from the list of supported columns are present. But none of the columns above are mandatory. If certain columns are missing, the fix command will attempt to restore them if possible.

5. Run the `fix` command

When the config file is created, and dbSNP datasets are preprocessed, the chain file is downloaded if necessary, then the fix command can use all its features.

Although it is normally a part of the execution of the fix command, a user may choose to manually run diagnose beforehand.

If diagnose is ran without additional arguments, it is "read-only", i.e. doesn't write into the file system.

Run diagnose as follows:

python3 SSrehab.py diagnose --INPUT INPUT_GWAS_FILE

where INPUT_GWAS_FILE is the path to the GWAS SS file with the corresponding config file at *.json

as a result, it will generate the main plot: stacked histogram plot, and an additional bar chart plot for each of the bins in the stacked histogram plot.

These plots will pop up in a new matplotlib window.

The stacked histogram maps the number of invalid SNPs against p-value, allowing assessment of the distribution of invalid SNPs by significance. On the histogram, valid SNPs are shown as blue, and SNPs that have issues are shown as red. The height of the red plot over each bin with the red caption represents the proportion of invalid SNPs in the corresponding bin.

A bar chart is generated for each bin of the stacked histogram plot and reports the number of issues that invalidate the SNP entries in a particular bin.

If a Linux system runs without GUI, the report should be saved on the file system. For this, run the command as follows:

python3 SSrehab.py diagnose --INPUT INPUT_GWAS_FILE --REPORT-DIR REPORT_DIR

where REPORT_DIR is an existing or not existing directory under which the generated report will be contained. When saved onto a disk, the report also includes a small table with exact numbers of invalid fields and other issues in the GWAS SS file.

Finally, a user may want to decide to run the fix command.

A user should run the fix command as follows:

python3 SSrehab.py fix --INPUT INPUT_GWAS_FILE --OUTPUT OUTPUT_FILE
                       [--dbsnp-1 DBSNP1_FILE] [--dbsnp-2 DBSNP2_FILE]
                       [--chain-file CHAIN_FILE]
                       [--freq-db FREQ_DATABASE_SLUG]

where:

INPUT_GWAS_FILE is the input GWAS SS file with the corresponding .json config file create at step 4
OUTPUT_FILE is the base name for the fixed file(s)
DBSNP1_FILE is a path to the preprocessed dbSNP #1
DBSNP2_FILE is a path to the preprocessed dbSNP #2
CHAIN_FILE is a path to the chain file
FREQ_DATABASE_SLUG is a population slug from a frequency database in dbSNP

example:

python3 SSrehab.py fix --INPUT "29559693.tsv" --OUTPUT "SSrehab_fixed/29559693" --dbsnp-1 "dbSNP_155_b38.1.tsv.gz" --dbsnp-2 "dbSNP_155_b38.2.tsv.gz" --chain-file "hg19_to_hg38.chain" --freq-db TOPMED

As the normal process of fix, a report will be generated for the input file, as well as for the file after each step of processing. Depending on the availability of invalid/missing data in the GWAS SS file and the input arguments, a different number of steps may be required for a complete run of the fix command, with 1 or 2 loops performed on the GWAS SS file. All steps are performed automatically without prompt. The process of fixing is represented in logging to the standard output and may take anywhere from 5 minutes to 1.5 hours, depending on the size of the file and the number of steps.

As a result, if 1 loop was required to fix the file, then the resulting file will be available with the suffix .SSrehabed.tsv. If 2 loops were required, then the resulting file is available with the suffix .SSrehabed-twice.tsv.

The report made with a diagnose command will be available in a separate directory for:

the input file
for the file after 1 loop of fixing
for the file after 2 loops of fixing (applicable only if 2 loops were required)

Manual

Please refer to the instructions by running

python3 SSrehab.py -h

python3 SSrehab.py <command> -h

NOTES

"standard" format

file is in the tsv format, i.e. tabular tab-separated format (bare, zipped, or gzipped)
there's a one-line header in the file on the first line. All other lines are the data entries
the file has precisely columns defined as STANDARD_COLUMN_ORDER in lib/standard_column_order.py.
- file has exactly these columns, exactly this number of columns, and no other columns
- columns are in this exact order
- if the original file was missing a column, an empty column should be taking its place (entries are the empty string)

BACKLOG

upon execution of the fix command, a config file has to be generated with all the names of the intermediary files. This will improve refactoring into the actual pipeline.
(maybe) improve restoring alleles by adding checks for an exact match of flipped alleles if other checks didn't help. This requires having all SNPs for a particular ChrBP in the memory and is relevant only for restoring alleles by looping through the file sorted by Chr and BP.
add the ability to specify additional columns from the GWAS SS file that the user wants to include in the end file. This would be an array of integers in the json config file for the input GWAS SS file.
improve code in the main file: SSrehab.py
improve resolver architecture in loop_fix.py: make a separate function loopDB1 and loopDB2 that will loop through enough entries in a DB before every resolver and rewrite a "global" object with properties to be fields from the DB: rsID, Chr, BP, alleles, EAF. So resolvers for rsID and ChrBP will be similar to ones for alleles and EAF. Resolvers for these fields then should operate on fields and that object with fields from a DB. This way a really strong optimization, flexibility, and modularity of resolvers will be achieved. run_all doesn't have to have resolvers and resolvers_args object to be passed, it can just use the global ones.
improve the interface for liftover. SSrehab fix should work for all sorts of liftovers between builds 36, 37, and 38, including back liftover. If the user omits the preprocessed dbSNP databases as input but specifies the chain file, it can perform liftover only.
add support for OR, and, maybe, restoration of OR from beta or vice versa.
add a keyword argument that will cause SSrehab fix to clean up all intermediate files and leave only the last resulting file after the processing.
add a keyword argument that specifies a temp directory for intermediate files. GWAS SS files are usually 1-4 Gigs unpacked.
set alleles column to uppercase during preparation (in prepare_GWASSS_columns.py script).
feature: save a human-readable textual report about the overall results of restoration (e.g. "performed a liftover, n rsIDs restored, n Chrs lost, ...")
add a WARNING that beta will be restored with an accurate sign only when the standard error is signed.
at the moment of 2021.11.14, the following executables are assumed to be available in PATH: bash, cut, paste, sort, awk, gzip, gunzip, head, tail, rm, wc. Need to test SSrehab with a different versions of bash, awk (including gawk, nawk, mawk. E.g. even though gawk is default for GNU/Linux, Ubuntu has mawk by default).
make SSrehab installable via pip

The fix command fails halfway through execution, at Step 7. Error log:

  === Step 6: Analyze the report after REHAB ===
Going to sort the GWAS SS file by Chr and BP
Sorted by Chr and BP
Step 6 finished in 115.71317148208618 seconds 

 === Step 7: REHAB: loopping through the GWAS SS file again and fixing entries ===
An error occured while looping through the SNPs file (see below)
An error occured on line 1 of the GWAS SS file (see below)
Traceback (most recent call last):
 File "/usr/local/bin/SumStatsRehab", line 11, in <module> load_entry_point('SumStatsRehab==1.1.2', 'console_scripts', 'SumStatsRehab')()
 File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.1.2-py3.8.egg/SumStatsRehab.py", line 722, in main fix(args.INPUT_GWAS_FILE, args.OUTPUT_FILE,
 File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.1.2-py3.8.egg/SumStatsRehab.py", line 437, in fix loop_fix(
 File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.1.2-py3.8.egg/lib/loop_fix.py", line 813, in loop_fix raise e
 File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.1.2-py3.8.egg/lib/loop_fix.py", line 804, in loop_fix run_all(resolvers, fields, resolvers_args)
 File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.1.2-py3.8.egg/lib/loop_fix.py", line 696, in run_all resolvers[res_i](fields, *args[res_i])
 File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.1.2-py3.8.egg/lib/loop_fix.py", line 616, in resolve_rsID raise e
 File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.1.2-py3.8.egg/lib/loop_fix.py", line 589, in resolve_rsID chr_snps, bp_snps, rsid, ref, alt, freq = read_dbSNP1_data_row(SNPs_FILE_o)
 File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.1.2-py3.8.egg/lib/loop_fix.py", line 253, in read_dbSNP1_data_row int(words[1]), # BP
ValueError: invalid literal for int() with base 10: 'ID'

Reported by Mahantesh B

bug

[minor] if an issue with p-value filed was identified, then issues in other fields on the same row are not counted in the report

In the diagnose command (validate_GWASSS_entries.py) during the diagnosis of problems with a SNP entry (a row in a table), if an issue with p-value was identified (which is a first field that is checked), then other issues in the same row are not counted and not included in the invalid_entries.csv report.

The key block of the code that implements this behavior is at line 229 in the file validate_GWASSS_entries.py.
bug

opened by Kukuster 0
Fix command fails at step 2 for a file missing rsID and pval

Step 2: Validate entries in the formatted GWAS SS file and save the report === number of lines in the file: 28987535 validating entries : 100%|███████████████████████████████████████████████████████████████████████████████████████████| 28987534/28987534 [03:02<00:00, 159012.85it/s] calculating reports: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 28987534/28987534 [01:04<00:00, 446130.26it/s] generating reports found issues: rsID: 28987534/28987534 (100.0%) EAF: 5145971/28987534 (17.75%) SE: 5145971/28987534 (17.75%) beta: 5145971/28987534 (17.75%) pval: 28987534/28987534 (100.0%) Traceback (most recent call last): File "/usr/local/bin/SumStatsRehab", line 11, in <module> load_entry_point('SumStatsRehab==1.2.1', 'console_scripts', 'SumStatsRehab')() File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.2.1-py3.8.egg/SumStatsRehab.py", line 730, in main fix(args.INPUT_GWAS_FILE, args.OUTPUT_FILE, File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.2.1-py3.8.egg/SumStatsRehab.py", line 199, in fix validate_GWASSS_entries( File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.2.1-py3.8.egg/lib/validate_GWASSS_entries.py", line 551, in validate_GWASSS_entries write_report_to_dir(issues_count, REPORT_ABS_DIR) File "/usr/local/lib/python3.8/dist-packages/SumStatsRehab-1.2.1-py3.8.egg/lib/report_utils.py", line 23, in write_report_to_dir with open(os.path.join(REPORT_DIR, INVALID_ENTRIES_REPORT_FILENAME), 'w') as f: PermissionError: [Errno 13] Permission denied: '/home/carlos/ssfiles/Bronchitis-row1722.bgz_input-report/invalid_entries.csv
bug

opened by carlostellophd 0

Fix command fails at Step 7 when encountered a str value 'ID' in a BP column

opened by Kukuster 0

GWAS summary statistics files QC tool

Related tags

Overview

SSrehab

dependencies:

Installation and basics

Tutorial

1. Download dbSNP dataset

2. Download the chain file

3. Preprocess dbSNPs datasets

3.1 Download and install bcftools and gz-sort

3.2 Run preprocessing

4. Create a config file for your GWAS SS file

5. Run the fix command

Manual

NOTES

"standard" format

BACKLOG

You might also like...

Python library to natively send files to Trash (or Recycle bin) on all platforms.

Viewer for NFO files

python3 scrip for case conversion of source code files writen in fixed form fortran

List of short Codeforces problems with a statement of 1000 characters or less. Python script and data files included.

Plugin to generate BOM + CPL files for JLCPCB

The only purpose of a byte-sized application is to help you create .desktop entry files for downloaded applications.

A simple API to upload notes or files to KBFS

Simple tools to make/dump CPC+ CPR cartridge files

Add-In for Blender to automatically save files when rendering

Comments

[minor] if an issue with p-value filed was identified, then issues in other fields on the same row are not counted in the report

Fix command fails at step 2 for a file missing rsID and pval

Fix command fails at Step 7 when encountered a str value 'ID' in a BP column

Owner

A tool to flash .ofp files in bootloader mode without needing MSM Tool, an alternative to official realme tool

Sample python script for monitoring Rocketchat database and get statistics of users.

Library for Memory Trace Statistics in Python

Osu statistics right on your desktop, made with pyqt

Statistics Calculator module for all types of Stats calculations.

gwcheck is a tool to check .gnu.warning.* sections in ELF object files and display their content.

Tool for working with Direct System Calls in Cobalt Strike's Beacon Object Files (BOF) via Syswhispers2

Small tool to use hero .json files created with Optolith for The Dark Eye/ Das Schwarze Auge 5 to perform talent probes.

bib2xml - A tool for getting Word formatted XML from Bibtex files

This bot uploads telegram files to MixDrop.co,File.io.

5. Run the `fix` command