SARS-Cov-2 Recombinant Finder for fasta sequences

Lena Schimmel

Last update: Oct 3, 2022

Related tags

Overview

Sc2rf - SARS-Cov-2 Recombinant Finder

Pronounced: Scarf

What's this?

Sc2rf can search genome sequences of SARS-CoV-2 for potential recombinants - new virus lineages that have (partial) genes from more than one parent lineage.

Is it already usable?

This is a very young project, started on March 5th, 2022. As such, proceed with care. Results may be wrong or misleading, and with every update, anything can still change a lot.

Anyway, I'm happy that scientists are already seeing benefits from Sc2rf and using it to prepare lineage proposals for cov-lineages/pango-designation.

Though I already have a lot of ideas and plans for Sc2rf (see at the bottom of this document), I'm very open for suggestions and feature requests. Please write an issue, start a discussion or get in touch via mail or twitter!

Example output

Requirements and Installation

You need at least Python 3.6 and you need to install the requirements first. You might use something like python3 -m pip install -r requirements.txt to do that. There's a setup.py which you should probably ignore, since it's work in progress and does not work as intented yet.

Also, you need a terminal which supports ANSI control sequences to display colored text. On Linux, MacOS, etc. it should probably work.

On Windows, color support is tricky. On a recent version of Windows 10, it should work, but if it doesn't, install Windows Terminal from GitHub or Microsoft Store and run it from there.

Basic Usage

Start with a .fasta file with one or more sequences which might contain recombinants. Your sequences have to be aligned to the reference.fasta. If they are not, you will get an error message like:

Sequence hCoV-19/Phantasialand/EFWEFWD not properly aligned, length is 29718 instead of 29903.

(For historical reasons, I always used Nextclade to get aligned sequences, but you might also use Nextalign or any other tool. Installing them is easy on Linux or MacOS, but not on Windows. You can also use a web-based tool like MAFFT.)

Then call:

sc2rf.py <your_filename.fasta>

If you just need some fasta files for testing, you can search the pango-lineage proposals for recombinant issues with fasta-files, or take some files from my shared-sequences repository, which might not contain any actual recombinants, but hundreds of sequences that look like they were!

No output / some sequences not shown

By default, a lot filters are active to show only the likely recombinants, so that you can input 10000s of sequences and just get output for the interesting ones. If you want, you can disable all filters like that, which is only recommended for small input files with less than 100 sequences:

sc2rf.py --parents 1-35 --breakpoints 0-100 \
--unique 1 --max-ambiguous 10000 <your_filename.fasta>

or even

sc2rf.py --parents 1-35 --breakpoints 0-100 \
--unique 1 --max-ambiguous 10000 --force-all-parents \
--clades all <your_filename.fasta>

The meaning of these parameters is described below.

Advanced Usage

You can execute sc2rf.py -h to get excactly this help message:

usage: sc2rf.py [-h] [--primers [PRIMER ...]]
                [--primer-intervals [INTERVAL ...]]
                [--parents INTERVAL] [--breakpoints INTERVAL]
                [--clades [CLADES ...]] [--unique NUM]
                [--max-intermission-length NUM]
                [--max-intermission-count NUM]
                [--max-name-length NUM] [--max-ambiguous NUM]
                [--force-all-parents]
                [--select-sequences INTERVAL]
                [--enable-deletions] [--show-private-mutations]
                [--rebuild-examples] [--mutation-threshold NUM]
                [--add-spaces [NUM]] [--sort-by-id [NUM]]
                [--verbose] [--ansi] [--hide-progress]
                [--csvfile CSVFILE]
                [input ...]

Analyse SARS-CoV-2 sequences for potential, unknown recombinant
variants.

positional arguments:
  input                 input sequence(s) to test, as aligned
                        .fasta file(s) (default: None)

optional arguments:
  -h, --help            show this help message and exit

  --primers [PRIMER ...]
                        Filenames of primer set(s) to visualize.
                        The .bed formats for ARTIC and EasySeq
                        are recognized and supported. (default:
                        None)

  --primer-intervals [INTERVAL ...]
                        Coordinate intervals in which to
                        visualize primers. (default: None)

  --parents INTERVAL, -p INTERVAL
                        Allowed number of potential parents of a
                        recombinant. (default: 2-4)

  --breakpoints INTERVAL, -b INTERVAL
                        Allowed number of breakpoints in a
                        recombinant. (default: 1-4)

  --clades [CLADES ...], -c [CLADES ...]
                        List of variants which are considered as
                        potential parents. Use Nextstrain clades
                        (like "21B"), or Pango Lineages (like
                        "B.1.617.1") or both. Also accepts "all".
                        (default: ['20I', '20H', '20J', '21I',
                        '21J', 'BA.1', 'BA.2', 'BA.3'])

  --unique NUM, -u NUM  Minimum of substitutions in a sample
                        which are unique to a potential parent
                        clade, so that the clade will be
                        considered. (default: 2)

  --max-intermission-length NUM, -l NUM
                        The maximum length of an intermission in
                        consecutive substitutions. Intermissions
                        are stretches to be ignored when counting
                        breakpoints. (default: 2)

  --max-intermission-count NUM, -i NUM
                        The maximum number of intermissions which
                        will be ignored. Surplus intermissions
                        count towards the number of breakpoints.
                        (default: 8)

  --max-name-length NUM, -n NUM
                        Only show up to NUM characters of sample
                        names. (default: 30)

  --max-ambiguous NUM, -a NUM
                        Maximum number of ambiguous nucs in a
                        sample before it gets ignored. (default:
                        50)

  --force-all-parents, -f
                        Force to consider all clades as potential
                        parents for all sequences. Only useful
                        for debugging.

  --select-sequences INTERVAL, -s INTERVAL
                        Use only a specific range of input
                        sequences. DOES NOT YET WORK WITH
                        MULTIPLE INPUT FILES. (default: 0-999999)

  --enable-deletions, -d
                        Include deletions in lineage comparision.

  --show-private-mutations
                        Display mutations which are not in any of
                        the potential parental clades.

  --rebuild-examples, -r
                        Rebuild the mutations in examples by
                        querying cov-spectrum.org.

  --mutation-threshold NUM, -t NUM
                        Consider mutations with a prevalence of
                        at least NUM as mandatory for a clade
                        (range 0.05 - 1.0, default: 0.75).

  --add-spaces [NUM]    Add spaces between every N colums, which
                        makes it easier to keep your eye at a
                        fixed place. (default without flag: 0,
                        default with flag: 5)

  --sort-by-id [NUM]    Sort the input sequences by the ID. If
                        you provide NUM, only the first NUM
                        characters are considered. Useful if this
                        correlates with meaning full meta
                        information, e.g. the sequencing lab.
                        (default without flag: 0, default with
                        flag: 999)

  --verbose, -v         Print some more information, mostly
                        useful for debugging.

  --ansi                Use only ASCII characters to be
                        compatible with ansilove.

  --hide-progress       Don't show progress bars during long
                        task.

  --csvfile CSVFILE     Path to write results in CSV format.
                        (default: None)

An Interval can be a single number ("3"), a closed interval
("2-5" ) or an open one ("4-" or "-7"). The limits are inclusive.
Only positive numbers are supported.

Interpreting the output

To be written...

There already is a short Twitter thread which explains the basics.

Source material attribution

virus_properties.json contains data from LAPIS / cov-spectrum which uses data from NCBI GenBank, prepared and hosted by Nextstrain, see blog post.
reference.fasta is taken from Nextstrain's nextclade_data, see NCBI for attribution.
mapping.csv is a modified version of the table on the covariants homepage by Nextstrain.
Example output / screenshot based on Sequences published by the German Robert-Koch-Institut.
Primers:
- ARTIC primers CC-BY-4.0 by the ARTICnetwork project
- ~~EasySeq primers by Coolen, J. P., Wolters, F., Tostmann, A., van Groningen, L. F., Bleeker-Rovers, C. P., Tan, E. C., ... & Melchers, W. J.~~ Removed until I understand the format if the .bed file. There will be an issue soon.
- midnight primers CC-BY-4.0 by Silander, Olin K, Massey University

The initial version of this program was written in cooperation with @flauschzelle.

TODO / IDEAS / PLANS

Comments

ENH: provide output optionally as csv/tsv for automated analysis/sharing

Right now the output is good for interactive human analysis, but there's a lack of csv/tsv machine readable output for sharing/further analysis.

From my experience with Nextclade, main difficulty here is the design of the specs of the file, which columns to include etc, which separators to take if you need an intra-column separator etc.

Maybe best to discuss on this issue before implementing something as one will kind of get locked in to the format.

opened by corneliusroemer 28
Find and use better source for typical mutations of lineages

See this comment by @AngieHinrichs which even contains an alternative.

Thanks a lot for your detailed explanation! I'm trying to move this over here so it's easier to find for me.

(Also, if the comment thread over at pange-designation gets locked down after too many "off topic" comments, I won't be able to comment there at all. Already happened in other issues.)

opened by lenaschimmel 14
BUG: Problem using covSpectrum mutation share - Ns are treated as reference

There's a bit of a problem with using covSpectrum's current mutation API implementation: Ns in any sample is treated as reference.

This can cause confusion. For example, I thought that this intermission here within Spike was a bad sign: https://github.com/cov-lineages/pango-designation/issues/498

But it isn't! Both 22813 and 22882 are defining for both BA.1 and BA.2. However, both are apparently N in 40% of sequences in BA.1. Causing sc2rf to think that it's in fact not a defining mutation in BA.1 making spurious intermissions appear.

I'm not sure how to work around this best. Really, this should be fixed in covSpectrum: Ns should be left out of mutation proportion calculations - and not be treated as reference (implicitly).

@chaoran-chen can you think of a workaround? How can one get the share of Ns for a query? Could that maybe be supplied by a new API endpoint?

Usually, Ns don't make up 40% of a site, but sometimes they do and that can cause problems like here, where one falsely thinks there's a non-clean breakpoint.

opened by corneliusroemer 10
Way to pipe results to png, txt files

This is a fantastic tool, and I've already put it to good use in Arkansas to research some strange lineages. Great work!

I do have to share the visuals, and I an wondering if there is a way to pipe the results to an outside file, such as png or txt. I am more of an applied researcher, so if I missed something, I would appreciate any directions.

Again, great tool already!

Thanks,

opened by bdelavan 7
Q: Why show all donors not just the relevant ones?

I'm analyzing one sequence and am wondering why you output all potential donors/parents, not just the two that seem most relevant here: BA.1/21J?

Are my arguments wrong? When I reduce parents to 0-5, I get not output which is weird. Don't quite understand what's going on here.

opened by corneliusroemer 6

Crash related to tdqm

Originally posted by @Vjimenez-vasquez in https://github.com/lenaschimmel/sc2rf/issues/25#issuecomment-1089053922:

Hi there,

I ran the following command :

python3 sc2rf.py test2.fasta --unique 1

And got the following message :

Traceback (most recent call last):
  File "sc2rf.py", line 987, in <module>
    main()
  File "sc2rf.py", line 132, in main
    reference = read_fasta('reference.fasta', None)['MN908947 (Wuhan-Hu-1/2019)']
  File "sc2rf.py", line 476, in read_fasta
    with my_tqdm(total=os.stat(path).st_size, desc="Read " + path, unit_scale=True) as pbar:
  File "sc2rf.py", line 199, in my_tqdm
    return tqdm(*margs, delay=0.1, colour="green", disable=bool(args.hide_progress), **kwargs)
  File "/home/hp/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 922, in __init__
    TqdmKeyError("Unknown argument(s): " + str(kwargs)))
tqdm.std.TqdmKeyError: "Unknown argument(s): {'delay': 0.1, 'colour': 'green'}"

Do you have any suggestion, please ?

opened by lenaschimmel 4

Make tool pip-installable

Shouldn't be difficult, you need a setup.py and account of Pypi

You can have a look at this repo of mine that can be installed via Pypi as a command line tool (if you install it, the command becomes automatically available in Path!) https://github.com/corneliusroemer/fasta_zstd_sqlite/blob/master/setup.py

opened by corneliusroemer 4
--csvfile option does not work

Hey!

First of all, great tool to find the potential recombinants. Made my life easy. I needed to parse the output of sc2rf only to get the potential recombinant sequences and the breakpoints of it. I see the --csvfile option in the README. But, it must not have been included in the sc2rf python executable. I get this error.

sc2rf.py: error: unrecognized arguments: --csvfile output.csv

Any idea if I could get the ouput in the way I need?

opened by think-o 2
ENH: show progress bar, say how many files were read in, how processing is going

Would be nice to see how things are going

tqdm makes this very easy with python

a bit more logging while the analysis is going would be cool too, just so that I know what's going on, instead of seeing nothing for a minute

opened by corneliusroemer 2
Python version requirement 3.9

Thanks for the tool.

Just had a quick note that I think Python 3.9 is required due to the | operator in dict.

I was getting an error before trying it with 3.9.

opened by benkraj 2
TypeError: unsupported operand type(s) for |: 'dict' and 'dict'

Getting this error while trying to run the program:

Reading reference genome, lineage definitions... Done. Reading actual input. Traceback (most recent call last): File "search_recombinants.py", line 539, in <module> main() File "search_recombinants.py", line 96, in main all_samples = all_samples | read_samples TypeError: unsupported operand type(s) for |: 'dict' and 'dict'

opened by arodzh-sudo 1
Bug/question with --force-all-parents --clades all
Hi there,

I just was wondering why I have no output and tried the second example from here: https://github.com/lenaschimmel/sc2rf#no-output--some-sequences-not-shown

So I added --clades all --force-all-parent to my call, but it seems that they can't be used both:

The number of allowed parents, the number of selected clades, and the --force-all-parents conflict so that the results must be empty.

Also, --clades all can't be used as the last argument (before the input) because the input won't be recognized

Input sequences must be provided, except when rebuilding the examples. Use --help for more info. Program exits.

I'm not sure if this is only my setup/input problem.

Would you suggest to use -c all or -f? My full command is

python3 sc2rf.py --csvfile ../${name}_sc2rf.csv --parents 1-35 --breakpoints 1-2 \ --max-intermission-count 3 --max-intermission-length 1 \ --unique 1 --max-ambiguous 10000 --max-name-length 55 \ ### --clades all --force-all-parents \ ### ../${fasta}

Best Marie
opened by MarieLataretu 3
Bridging the gap between sc2rf result and Pangolin X* lineages

First, thanks to the authors for bringing the useful tool for us.

We have been using sc2rf to scan for recombinant sequences and determine breakpoint, but i found from the result to the Pangolin X* lineage calls there is a gap. I was wondering whether it is possible to bridge the gap by: 1. take in the lineage designation from Pangolin X* lineages, scan and store the profiles for each of the recombinant lineages; 2. for a new query sequence, if the breakpoint profile matches existing Pangolin X* lineages, in the result not just suggest the parent lineages and breakpoint, provide a possible X* lineage call as well. More or less in the way of how the Scorpio Constellation works.

I expect this would be a more accurate way of assigning recombinant lineages than the current UShER calls, where the breakpoint positions may not match.

Thanks for considering the suggestion.

opened by bioinforME 0
GISAID XT recombinant not detected by sc2rf

Hi, I've noticed that sc2rf.py (version sc2rf-7427d2f94b69c965362034c2597b643c5dfaa1cf) could not find any recombination for XT samples available on GISAID python sc2rf.py nextclade.aligned_XT_Gisaid.fasta. Here are the available aligned sequences. nextclade.aligned_XT_Gisaid.txt

Nextclade: sc2rf:

Thanks for looking into this and other lineages that might be in the same situation.

opened by BenjaminDelisle 4
Option to ignore shared substitutions
I've been experimenting with a flag --ignore-shared that ignores positions that are shared (have the exact same nucleotide) across all parents/examples.

I like this option because it makes the breakpoints visually clearer, as there's a direct color change (red -> green) rather than having the intermediate shared positions (red -> white -> green)

For testing, a nextclade fasta alignment of XM-like recombinants (public on genbank): XM.txt

Do you think this is scientifically sound for reporting? And if so,

Would you be interested in a PR if I tidy up the code?

Default Output:

python3 sc2rf.py XM.fasta --ansi --unique 1

Proposed Option:

python3 sc2rf.py XM.fasta --ansi --unique 1 --ignore-shared
opened by ktmeaton 0
Terminal Ns not recognized as missing
While investigating https://github.com/cov-lineages/pango-designation/issues/590, I noticed that samples with the BA.2 S2M deletion (29734:29759) were being incorrectly visualized as having reference bases in sc2rf:

Consensus View:

sc2rf View:

I think this could be for a couple of reasons:

When --enable-deletions is used, perhaps deletions should not be considered missing data?

missings_matches = ["N"] if not args.enable_deletions: missings_matches.append("-")

I think there is missing logic when detecting a run of Ns, to catch if that runs proceeds to the end of the genome?

if s in missings_matches: # we've been tracking a run of N's, this base marks the end if start_n == -1: start_n = i # mark the start of possible run of N's elif start_n >= 0: missings.append((start_n, i-1)) # Python-style (closed, open) interval start_n = -1 # Missing logic to catch missing data at the end of the genome? if i == len(reference) and s in missings_matches: missings.append((start_n, i-1))

With these changes, the sc2rf output more closely matches the consensus sequence/my expectation:

I think this is a bug, but if it's the intended behaviour for deletions, please let me know. Thanks!
opened by ktmeaton 1

Owner

Lena Schimmel

GitHub

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

CoaDTI Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2 Abstract Environment The test was conducted i

7 Nov 14, 2022

Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

Analysis of SARS-CoV-2 reads in sequencing of 2018-2019 Antarctica samples in PRJNA692319 The samples analyzed here are described in this preprint, wh

4 Feb 9, 2022

A script written in Python that returns a consensus string and profile matrix of a given DNA string(s) in FASTA format.

1 Feb 1, 2022

Campsite Reservation Finder

yellowstone-camping UPDATE: yellowstone-camping is being expanded and renamed to camply. The updated tool now interfaces with the Recreation.gov API a

233 Jan 8, 2023

Implementation of the "PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences" paper.

PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences Introduction Point cloud sequences are irregular and unordered in the spatial dimen

63 Dec 9, 2022

Official implementation of the network presented in the paper "M4Depth: A motion-based approach for monocular depth estimation on video sequences"

M4Depth This is the reference TensorFlow implementation for training and testing depth estimation models using the method described in M4Depth: A moti

76 Jan 3, 2023

Model-free Vehicle Tracking and State Estimation in Point Cloud Sequences

Model-free Vehicle Tracking and State Estimation in Point Cloud Sequences 1. Introduction This project is for paper Model-free Vehicle Tracking and St

92 Jan 3, 2023

Implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTorch

Neural Distance Embeddings for Biological Sequences Official implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTo

56 Dec 23, 2022

Sign Language is detected in realtime using video sequences. Our approach involves MediaPipe Holistic for keypoints extraction and LSTM Model for prediction.

RealTime Sign Language Detection using Action Recognition Approach Real-Time Sign Language is commonly predicted using models whose architecture consi

15 Aug 20, 2022

A Protein-RNA Interface Predictor Based on Semantics of Sequences

PRIP PRIP：A Protein-RNA Interface Predictor Based on Semantics of Sequences installation gensim==3.8.3 matplotlib==3.1.3 xgboost==1.3.3 prettytable==2

0 Mar 25, 2022

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera. This project prepares training and testing data for various deep learning projects such as 6D object pose estimation projects singleshotpose, as well as object detection and instance segmentation projects.

305 Dec 16, 2022

Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana

DeepGeneAnnotator: A tool to annotate the gene in the genome The master thesis of the "Using deep learning to predict gene structures of the coding ge

3 Sep 9, 2022

Official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition" in AAAI2022.

AimCLR This is an official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Reco

44 Dec 17, 2022

Codes for TIM2021 paper "Anchor-Based Spatio-Temporal Attention 3-D Convolutional Networks for Dynamic 3-D Point Cloud Sequences"

Intelligent Robotics and Machine Vision Lab

4 Jul 19, 2022

SARS-Cov-2 Recombinant Finder for fasta sequences

Related tags

Overview

Sc2rf - SARS-Cov-2 Recombinant Finder

What's this?

Is it already usable?

Example output

Requirements and Installation

Basic Usage

No output / some sequences not shown

Advanced Usage

Interpreting the output

Source material attribution

TODO / IDEAS / PLANS

Comments

Default Output:

Proposed Option:

Owner

Lena Schimmel

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

A script written in Python that returns a consensus string and profile matrix of a given DNA string(s) in FASTA format.

Campsite Reservation Finder

Implementation of the "PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences" paper.

Official implementation of the network presented in the paper "M4Depth: A motion-based approach for monocular depth estimation on video sequences"

Model-free Vehicle Tracking and State Estimation in Point Cloud Sequences

Implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTorch

Sign Language is detected in realtime using video sequences. Our approach involves MediaPipe Holistic for keypoints extraction and LSTM Model for prediction.

A Protein-RNA Interface Predictor Based on Semantics of Sequences

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana

Official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition" in AAAI2022.

Codes for TIM2021 paper "Anchor-Based Spatio-Temporal Attention 3-D Convolutional Networks for Dynamic 3-D Point Cloud Sequences"

A library built upon PyTorch for building embeddings on discrete event sequences using self-supervision

Request execution of Galaxy SARS-CoV-2 variation analysis workflows on input data you provide.

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

🦠 A simple and fast (< 200ms) API for tracking the global coronavirus (COVID-19, SARS-CoV-2) outbreak.

Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

Linux GUI app to codon optimize many single-fasta files with coding sequences , using many taxonomy ids