Sequence lineage information extracted from RKI sequence data repo

Overview

Pango lineage information for German SARS-CoV-2 sequences

This repository contains a join of the metadata and pango lineage tables of all German SARS-CoV-2 sequences published by the Robert-Koch-Institut on Github.

The data here is updated every hour, automatically through a Github action, so whenever new data appears in the RKI repo, you will see it here within at most an hour.

The resulting dataset can be downloaded here, beware it's currently around 50MB in size: https://raw.githubusercontent.com/corneliusroemer/desh-data/main/data/meta_lineages.csv

Omicron share plot

Omicron Logit Plot

Omicron Logit Plot

Description of data

Column description:

  • IMS_ID: Unique identifier of the sequence
  • DATE_DRAW: Date the sample was taken from the patient
  • SEQ_REASON: Reason for sequencing, one of:
    • X: Unknown
    • N: Random sampling
    • Y: Targeted sequencing (exact reason unknown)
    • A[<reason>]: Targeted sequencing because variant PCR indicated VOC
  • PROCESSING_DATE: Date the sample was processed by the RKI and added to Github repo
  • SENDING_LAB_PC: Postcode (PLZ) of lab that did the initial PCR
  • SEQUENCING_LAB_PC: Postcode (PLZ) of lab that did the sequencing
  • lineage: Pango lineage as reported by pangolin
  • scorpio_call: Alternative, rough, variant as determined by scorpio (part of pangolin), this is less precise but a bit more robust than pangolin.

Excerpt

Here are the first 10 lines of the dataset.

IMS_ID,DATE_DRAW,SEQ_REASON,PROCESSING_DATE,SENDING_LAB_PC,SEQUENCING_LAB_PC,lineage,scorpio_call
IMS-10294-CVDP-00001,2021-01-14,X,2021-01-25,40225,40225,B.1.1.297,
IMS-10025-CVDP-00001,2021-01-17,N,2021-01-26,10409,10409,B.1.389,
IMS-10025-CVDP-00002,2021-01-17,N,2021-01-26,10409,10409,B.1.258,
IMS-10025-CVDP-00003,2021-01-17,N,2021-01-26,10409,10409,B.1.177.86,
IMS-10025-CVDP-00004,2021-01-17,N,2021-01-26,10409,10409,B.1.389,
IMS-10025-CVDP-00005,2021-01-18,N,2021-01-26,10409,10409,B.1.160,
IMS-10025-CVDP-00006,2021-01-17,N,2021-01-26,10409,10409,B.1.1.297,
IMS-10025-CVDP-00007,2021-01-18,N,2021-01-26,10409,10409,B.1.177.81,
IMS-10025-CVDP-00008,2021-01-18,N,2021-01-26,10409,10409,B.1.177,
IMS-10025-CVDP-00009,2021-01-18,N,2021-01-26,10409,10409,B.1.1.7,Alpha (B.1.1.7-like)
IMS-10025-CVDP-00010,2021-01-17,N,2021-01-26,10409,10409,B.1.1.7,Alpha (B.1.1.7-like)
IMS-10025-CVDP-00011,2021-01-17,N,2021-01-26,10409,10409,B.1.389,

Suggested import into pandas

You can import the data into pandas as follows:

#%%
import pandas as pd

#%%
df = pd.read_csv(
    'https://raw.githubusercontent.com/corneliusroemer/desh-data/main/data/meta_lineages.csv',
    index_col=0,
    parse_dates=[1,3],
    infer_datetime_format=True,
    cache_dates=True,
    dtype = {'SEQ_REASON': 'category',
             'SENDING_LAB_PC': 'category',
             'SEQUENCING_LAB_PC': 'category',
             'lineage': 'category',
             'scorpio_call': 'category'
             }
)
#%%
df.rename(columns={
    'DATE_DRAW': 'date',
    'PROCESSING_DATE': 'processing_date',
    'SEQ_REASON': 'reason',
    'SENDING_LAB_PC': 'sending_pc',
    'SEQUENCING_LAB_PC': 'sequencing_pc',
    'lineage': 'lineage',
    'scorpio_call': 'scorpio'
    },
    inplace=True
)
df

License

The underlying files that I use as input are licensed by RKI under CC-BY 4.0, see more details here: https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland#lizenz.

The software here is licensed under the "Unlicense". You can do with it whatever you want.

For the data, just cite the original source, no need to cite this repo since it's just a trivial join.

You might also like...
Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

Repo for CVPR2021 paper
Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

Code repo for EMNLP21 paper
Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Zero-Shot Information Extraction as a Unified Text-to-Triple Translation Source code repo for paper Zero-Shot Information Extraction as a Unified Text

Adversarial-Information-Bottleneck - Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck (NeurIPS21) DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.

DeepConsensus DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS)

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.
Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

ULMFiT for Genomic Sequence Data
ULMFiT for Genomic Sequence Data

Genomic ULMFiT This is an implementation of ULMFiT for genomics classification using Pytorch and Fastai. The model architecture used is based on the A

A package to predict protein inter-residue geometries from sequence data

trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte

Simulate genealogical trees and genomic sequence data using population genetic models

msprime msprime is a population genetics simulator based on tskit. Msprime can simulate random ancestral histories for a sample of individuals (consis

Comments
  • feat: add grid lines and minor ticks

    feat: add grid lines and minor ticks

    Although I know that this data is only a rough approximation of the actual infections, I found myself looking really close at the graphs to see which day a data point belongs to, or where along the Y-axis it might be.

    This PR enables a major and minor grid, and minor ticks along the X-axis.

    Here's an example of the results:

    omicron_N_linear omicron_N_logit

    opened by lenaschimmel 3
  • Write better tick label formatter for logit scale that produces 50% and 99.99% instead of 50.00 or 100%

    Write better tick label formatter for logit scale that produces 50% and 99.99% instead of 50.00 or 100%

    The current tick label formatter is a bad hack. We need something more robust that produces the following behaviour: 1%,10%,50%,99.9%,99.999% etc.

    This is the current hack (from SO I think): https://github.com/corneliusroemer/desh-data/blob/696469da1e402fc1d30a1740eed26ee4a8e18b80/scripts/omicron_plot.py#L84

    The challenge is to display decimals only for trailing 9s but not for trailing zeros.

    This might do the job, but not 100% sure: https://numpy.org/doc/stable/reference/generated/numpy.format_float_positional.html

    @lenaschimmel interested?

    Edit: np.format_float_positional does the job:

    np.format_float_positional(1.000, trim='-') # 1
    np.format_float_positional(99.99, trim='-') # 99.99
    
    enhancement help wanted good first issue 
    opened by corneliusroemer 2
  • What about non-BA.1-variants of Omicron?

    What about non-BA.1-variants of Omicron?

    Currently only cases with lineage == 'BA.1' are counted as Omicron (see source).

    There are some cases with lineage BA.2, BA.3 or just B.1.1.529. Shouldn't they be counted as well? Otherwise, I think the wording on the graph should be updated from "Omikron" to "BA.1".

    To date, these are all 32 of them (sorted by DATA_DRAW), making up 1,37% of total Omicron cases including those:

    IMS_ID                                               DATE_DRAW   SEQ_REASON    PROCESSING_DATE  SENDING_LAB_PC  SEQUENCING_LAB_PC  lineage    scorpio_call
    IMS-10183-CVDP-81E05ED2-68B2-45C9-AE92-FE0747BD7C1A  2021-11-30  Y             2021-12-10       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10261-CVDP-0EC19B38-8711-4617-8D20-B19F3C75E2F8  2021-12-01  A[B.1.1.529]  2021-12-13       32105           32105              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10004-CVDP-E64B5426-4FB5-4D41-AFEC-77D84720E886  2021-12-02  A[B.1.1.529]  2021-12-20       21502           21502              BA.3       Omicron (BA.3-like)
    IMS-10338-CVDP-DEB4E3F4-4E65-4E95-9E9B-77EB04A50226  2021-12-03  X             2021-12-17       64283                              B.1.1.529  Omicron (B.1.1.529-like)
    IMS-10641-CVDP-677D2DB5-8A78-4238-BF38-CC4BC8247275  2021-12-03  N             2021-12-27       06120           06120              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10013-CVDP-2857098B-37D6-49EA-B92A-748F97328D42  2021-12-06  N             2021-12-18       01665           04779              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10004-CVDP-17A54357-705F-43BD-81F4-1A87C79F9FA4  2021-12-06  N             2021-12-20       21502           21502              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10209-CVDP-93C23280-BFE2-4DD7-A9DE-460B5420EE08  2021-12-06  X             2021-12-28       78467           78467              BA.2       Omicron (BA.2-like)
    IMS-10036-CVDP-B81B32E6-AD2D-4E05-9109-7B35544A6407  2021-12-07  A             2021-12-21       12247           16321              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10001-CVDP-FD7B08A6-39E9-462A-BB81-34D2D72DE174  2021-12-07  A[Y]          2021-12-25       87435           87435              B.1.1.529  Omicron (B.1.1.529-like)
    IMS-10183-CVDP-DB2FDBCC-5F6A-445D-9F75-20D87840C180  2021-12-09  N             2021-12-17       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10183-CVDP-75514806-B96C-4825-B5FD-EF389CC8D1EA  2021-12-10  Y             2021-12-17       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10183-CVDP-DCCF53C4-C30E-4D1C-A2B7-ECD99B7551EE  2021-12-10  N             2021-12-17       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10261-CVDP-BB754EC4-4185-4B28-A872-DA062436D447  2021-12-13  A[B.1.1.529]  2021-12-22       32105           32105              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10004-CVDP-40148A1B-A7BC-4302-B4EB-9993F89C48F8  2021-12-13  A[B.1.1.529]  2021-12-28       21502           21502              BA.2       Omicron (BA.2-like)
    IMS-10001-CVDP-0EA49D87-CBD9-48B0-8536-7F5AFFAC321F  2021-12-14  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10001-CVDP-36C59D9E-72E8-4B2C-A635-0D69C4B9C9FB  2021-12-14  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10150-CVDP-1D7B1F19-0AA1-486C-BFE2-2DE49596B981  2021-12-16  X             2021-12-22       51375           92637              BA.2       Omicron (BA.2-like)
    IMS-10183-CVDP-FF1E061C-F0E6-41BE-9DA0-35154066D3C0  2021-12-17  N             2021-12-24       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10261-CVDP-DFACA834-5290-4855-BC54-AC7AB9B0B49B  2021-12-17  A[B.1.1.529]  2021-12-27       32105           32105              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10001-CVDP-F576FBA5-8F15-4E9E-8E70-F3287A33FDDB  2021-12-19  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10001-CVDP-909F8C1F-9DF7-47B0-AA3C-C981406B56C0  2021-12-19  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10001-CVDP-4BBE02BC-9479-4E6F-8B28-9F575E60A615  2021-12-19  A[Y]          2021-12-25       87435           87435              B.1.1.529  Omicron (B.1.1.529-like)
    IMS-10001-CVDP-295624E6-E260-4456-9B36-E67512ACEA20  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10001-CVDP-047F5A00-3CE6-4038-8308-6F85FA8E40E5  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10001-CVDP-FEEEA8B2-0F57-40BE-A50D-4D7A6B0031E6  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10001-CVDP-AB0193AA-F6E3-4569-8C70-4E507F1037D0  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10001-CVDP-76508575-1AC0-4E0F-94C6-8FCDE164BE02  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Omicron (B.1.1.529-like)
    IMS-10001-CVDP-1E673F95-62A2-4576-A94F-8A46797FEF14  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10337-CVDP-6549EC0D-4E8F-427A-96ED-6E2F47E00941  2021-12-20  X             2021-12-28       23538           23538              BA.2       Omicron (BA.2-like)
    IMS-10001-CVDP-3F43636E-F55C-4C1C-BD9A-EF792ED6E550  2021-12-21  A[B.1.617.2]  2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
    IMS-10004-CVDP-E9819B99-144D-4AE6-A47C-46042F231AEF  2021-12-22  N             2021-12-28       21502           21502              B.1.1.529  Probable Omicron (B.1.1.529-like)
    
    enhancement 
    opened by lenaschimmel 5
Owner
Cornelius Roemer
Cornelius Roemer
This package is for running the semantic SLAM algorithm using extracted planar surfaces from the received detection

Semantic SLAM This package can perform optimization of pose estimated from VO/VIO methods which tend to drift over time. It uses planar surfaces extra

Hriday Bavle 125 Dec 2, 2022
This repo contains the code and data used in the paper "Wizard of Search Engine: Access to Information Through Conversations with Search Engines"

Wizard of Search Engine: Access to Information Through Conversations with Search Engines by Pengjie Ren, Zhongkun Liu, Xiaomeng Song, Hongtao Tian, Zh

null 19 Oct 27, 2022
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

Abhay Gupta 161 Dec 8, 2022
Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

zhaohu xing 112 Dec 16, 2022
Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

Sunbow Liu 22 Nov 25, 2022
[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Fudan Zhang Vision Group 897 Jan 5, 2023
Sequence to Sequence Models with PyTorch

Sequence to Sequence models with PyTorch This repository contains implementations of Sequence to Sequence (Seq2Seq) models in PyTorch At present it ha

Sandeep Subramanian 708 Dec 19, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
An implementation of a sequence to sequence neural network using an encoder-decoder

Keras implementation of a sequence to sequence model for time series prediction using an encoder-decoder architecture. I created this post to share a

Luke Tonin 195 Dec 17, 2022