Sequence lineage information extracted from RKI sequence data repo

Cornelius Roemer

Last update: Oct 26, 2022

Related tags

Deep Learning sequencing germany dataset pangolin sars-cov-2 robert-koch-institut lineages

Overview

Pango lineage information for German SARS-CoV-2 sequences

This repository contains a join of the metadata and pango lineage tables of all German SARS-CoV-2 sequences published by the Robert-Koch-Institut on Github.

The data here is updated every hour, automatically through a Github action, so whenever new data appears in the RKI repo, you will see it here within at most an hour.

The resulting dataset can be downloaded here, beware it's currently around 50MB in size: https://raw.githubusercontent.com/corneliusroemer/desh-data/main/data/meta_lineages.csv

Omicron share plot

Description of data

Column description:

IMS_ID: Unique identifier of the sequence
DATE_DRAW: Date the sample was taken from the patient
SEQ_REASON: Reason for sequencing, one of:
- X: Unknown
- N: Random sampling
- Y: Targeted sequencing (exact reason unknown)
- A[<reason>]: Targeted sequencing because variant PCR indicated VOC
PROCESSING_DATE: Date the sample was processed by the RKI and added to Github repo
SENDING_LAB_PC: Postcode (PLZ) of lab that did the initial PCR
SEQUENCING_LAB_PC: Postcode (PLZ) of lab that did the sequencing
lineage: Pango lineage as reported by pangolin
scorpio_call: Alternative, rough, variant as determined by scorpio (part of pangolin), this is less precise but a bit more robust than pangolin.

Excerpt

Here are the first 10 lines of the dataset.

IMS_ID,DATE_DRAW,SEQ_REASON,PROCESSING_DATE,SENDING_LAB_PC,SEQUENCING_LAB_PC,lineage,scorpio_call
IMS-10294-CVDP-00001,2021-01-14,X,2021-01-25,40225,40225,B.1.1.297,
IMS-10025-CVDP-00001,2021-01-17,N,2021-01-26,10409,10409,B.1.389,
IMS-10025-CVDP-00002,2021-01-17,N,2021-01-26,10409,10409,B.1.258,
IMS-10025-CVDP-00003,2021-01-17,N,2021-01-26,10409,10409,B.1.177.86,
IMS-10025-CVDP-00004,2021-01-17,N,2021-01-26,10409,10409,B.1.389,
IMS-10025-CVDP-00005,2021-01-18,N,2021-01-26,10409,10409,B.1.160,
IMS-10025-CVDP-00006,2021-01-17,N,2021-01-26,10409,10409,B.1.1.297,
IMS-10025-CVDP-00007,2021-01-18,N,2021-01-26,10409,10409,B.1.177.81,
IMS-10025-CVDP-00008,2021-01-18,N,2021-01-26,10409,10409,B.1.177,
IMS-10025-CVDP-00009,2021-01-18,N,2021-01-26,10409,10409,B.1.1.7,Alpha (B.1.1.7-like)
IMS-10025-CVDP-00010,2021-01-17,N,2021-01-26,10409,10409,B.1.1.7,Alpha (B.1.1.7-like)
IMS-10025-CVDP-00011,2021-01-17,N,2021-01-26,10409,10409,B.1.389,

Suggested import into pandas

You can import the data into pandas as follows:

#%%
import pandas as pd

#%%
df = pd.read_csv(
    'https://raw.githubusercontent.com/corneliusroemer/desh-data/main/data/meta_lineages.csv',
    index_col=0,
    parse_dates=[1,3],
    infer_datetime_format=True,
    cache_dates=True,
    dtype = {'SEQ_REASON': 'category',
             'SENDING_LAB_PC': 'category',
             'SEQUENCING_LAB_PC': 'category',
             'lineage': 'category',
             'scorpio_call': 'category'
             }
)
#%%
df.rename(columns={
    'DATE_DRAW': 'date',
    'PROCESSING_DATE': 'processing_date',
    'SEQ_REASON': 'reason',
    'SENDING_LAB_PC': 'sending_pc',
    'SEQUENCING_LAB_PC': 'sequencing_pc',
    'lineage': 'lineage',
    'scorpio_call': 'scorpio'
    },
    inplace=True
)
df

License

The underlying files that I use as input are licensed by RKI under CC-BY 4.0, see more details here: https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland#lizenz.

The software here is licensed under the "Unlicense". You can do with it whatever you want.

For the data, just cite the original source, no need to cite this repo since it's just a trivial join.

You might also like...

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

1.4k Jan 8, 2023

Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

105 Dec 23, 2022

Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Zero-Shot Information Extraction as a Unified Text-to-Triple Translation Source code repo for paper Zero-Shot Information Extraction as a Unified Text

88 Dec 31, 2022

Adversarial-Information-Bottleneck - Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck (NeurIPS21)

NeurIPS 2021 Title: Distilling Robust and Non-Robust Features in Adversarial Exa

35 Dec 26, 2022

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.

DeepConsensus DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS)

149 Dec 19, 2022

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

323 Jan 1, 2023

ULMFiT for Genomic Sequence Data

Genomic ULMFiT This is an implementation of ULMFiT for genomics classification using Pytorch and Fastai. The model architecture used is based on the A

276 Dec 12, 2022

A package to predict protein inter-residue geometries from sequence data

trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte

185 Jan 7, 2023

Simulate genealogical trees and genomic sequence data using population genetic models

msprime msprime is a population genetics simulator based on tskit. Msprime can simulate random ancestral histories for a sample of individuals (consis

150 Dec 14, 2022

Comments

feat: add grid lines and minor ticks

Although I know that this data is only a rough approximation of the actual infections, I found myself looking really close at the graphs to see which day a data point belongs to, or where along the Y-axis it might be.

This PR enables a major and minor grid, and minor ticks along the X-axis.

Here's an example of the results:

opened by lenaschimmel 3
Write better tick label formatter for logit scale that produces 50% and 99.99% instead of 50.00 or 100%
The current tick label formatter is a bad hack. We need something more robust that produces the following behaviour: 1%,10%,50%,99.9%,99.999% etc.

This is the current hack (from SO I think): https://github.com/corneliusroemer/desh-data/blob/696469da1e402fc1d30a1740eed26ee4a8e18b80/scripts/omicron_plot.py#L84

The challenge is to display decimals only for trailing 9s but not for trailing zeros.

This might do the job, but not 100% sure: https://numpy.org/doc/stable/reference/generated/numpy.format_float_positional.html

@lenaschimmel interested?

Edit: np.format_float_positional does the job:

np.format_float_positional(1.000, trim='-') # 1 np.format_float_positional(99.99, trim='-') # 99.99
enhancement help wanted good first issue
opened by corneliusroemer 2

What about non-BA.1-variants of Omicron?

Currently only cases with lineage == 'BA.1' are counted as Omicron (see source).

There are some cases with lineage BA.2, BA.3 or just B.1.1.529. Shouldn't they be counted as well? Otherwise, I think the wording on the graph should be updated from "Omikron" to "BA.1".

To date, these are all 32 of them (sorted by DATA_DRAW), making up 1,37% of total Omicron cases including those:

IMS_ID                                               DATE_DRAW   SEQ_REASON    PROCESSING_DATE  SENDING_LAB_PC  SEQUENCING_LAB_PC  lineage    scorpio_call
IMS-10183-CVDP-81E05ED2-68B2-45C9-AE92-FE0747BD7C1A  2021-11-30  Y             2021-12-10       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10261-CVDP-0EC19B38-8711-4617-8D20-B19F3C75E2F8  2021-12-01  A[B.1.1.529]  2021-12-13       32105           32105              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10004-CVDP-E64B5426-4FB5-4D41-AFEC-77D84720E886  2021-12-02  A[B.1.1.529]  2021-12-20       21502           21502              BA.3       Omicron (BA.3-like)
IMS-10338-CVDP-DEB4E3F4-4E65-4E95-9E9B-77EB04A50226  2021-12-03  X             2021-12-17       64283                              B.1.1.529  Omicron (B.1.1.529-like)
IMS-10641-CVDP-677D2DB5-8A78-4238-BF38-CC4BC8247275  2021-12-03  N             2021-12-27       06120           06120              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10013-CVDP-2857098B-37D6-49EA-B92A-748F97328D42  2021-12-06  N             2021-12-18       01665           04779              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10004-CVDP-17A54357-705F-43BD-81F4-1A87C79F9FA4  2021-12-06  N             2021-12-20       21502           21502              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10209-CVDP-93C23280-BFE2-4DD7-A9DE-460B5420EE08  2021-12-06  X             2021-12-28       78467           78467              BA.2       Omicron (BA.2-like)
IMS-10036-CVDP-B81B32E6-AD2D-4E05-9109-7B35544A6407  2021-12-07  A             2021-12-21       12247           16321              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10001-CVDP-FD7B08A6-39E9-462A-BB81-34D2D72DE174  2021-12-07  A[Y]          2021-12-25       87435           87435              B.1.1.529  Omicron (B.1.1.529-like)
IMS-10183-CVDP-DB2FDBCC-5F6A-445D-9F75-20D87840C180  2021-12-09  N             2021-12-17       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10183-CVDP-75514806-B96C-4825-B5FD-EF389CC8D1EA  2021-12-10  Y             2021-12-17       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10183-CVDP-DCCF53C4-C30E-4D1C-A2B7-ECD99B7551EE  2021-12-10  N             2021-12-17       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10261-CVDP-BB754EC4-4185-4B28-A872-DA062436D447  2021-12-13  A[B.1.1.529]  2021-12-22       32105           32105              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10004-CVDP-40148A1B-A7BC-4302-B4EB-9993F89C48F8  2021-12-13  A[B.1.1.529]  2021-12-28       21502           21502              BA.2       Omicron (BA.2-like)
IMS-10001-CVDP-0EA49D87-CBD9-48B0-8536-7F5AFFAC321F  2021-12-14  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10001-CVDP-36C59D9E-72E8-4B2C-A635-0D69C4B9C9FB  2021-12-14  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10150-CVDP-1D7B1F19-0AA1-486C-BFE2-2DE49596B981  2021-12-16  X             2021-12-22       51375           92637              BA.2       Omicron (BA.2-like)
IMS-10183-CVDP-FF1E061C-F0E6-41BE-9DA0-35154066D3C0  2021-12-17  N             2021-12-24       22081           22081              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10261-CVDP-DFACA834-5290-4855-BC54-AC7AB9B0B49B  2021-12-17  A[B.1.1.529]  2021-12-27       32105           32105              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10001-CVDP-F576FBA5-8F15-4E9E-8E70-F3287A33FDDB  2021-12-19  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10001-CVDP-909F8C1F-9DF7-47B0-AA3C-C981406B56C0  2021-12-19  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10001-CVDP-4BBE02BC-9479-4E6F-8B28-9F575E60A615  2021-12-19  A[Y]          2021-12-25       87435           87435              B.1.1.529  Omicron (B.1.1.529-like)
IMS-10001-CVDP-295624E6-E260-4456-9B36-E67512ACEA20  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10001-CVDP-047F5A00-3CE6-4038-8308-6F85FA8E40E5  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10001-CVDP-FEEEA8B2-0F57-40BE-A50D-4D7A6B0031E6  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10001-CVDP-AB0193AA-F6E3-4569-8C70-4E507F1037D0  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10001-CVDP-76508575-1AC0-4E0F-94C6-8FCDE164BE02  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Omicron (B.1.1.529-like)
IMS-10001-CVDP-1E673F95-62A2-4576-A94F-8A46797FEF14  2021-12-20  A[Y]          2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10337-CVDP-6549EC0D-4E8F-427A-96ED-6E2F47E00941  2021-12-20  X             2021-12-28       23538           23538              BA.2       Omicron (BA.2-like)
IMS-10001-CVDP-3F43636E-F55C-4C1C-BD9A-EF792ED6E550  2021-12-21  A[B.1.617.2]  2021-12-25       87435           87435              B.1.1.529  Probable Omicron (B.1.1.529-like)
IMS-10004-CVDP-E9819B99-144D-4AE6-A47C-46042F231AEF  2021-12-22  N             2021-12-28       21502           21502              B.1.1.529  Probable Omicron (B.1.1.529-like)

enhancement

opened by lenaschimmel 5

Owner

Cornelius Roemer

GitHub

This package is for running the semantic SLAM algorithm using extracted planar surfaces from the received detection

Semantic SLAM This package can perform optimization of pose estimated from VO/VIO methods which tend to drift over time. It uses planar surfaces extra

125 Dec 2, 2022

This repo contains the code and data used in the paper "Wizard of Search Engine: Access to Information Through Conversations with Search Engines"

Wizard of Search Engine: Access to Information Through Conversations with Search Engines by Pengjie Ren, Zhongkun Liu, Xiaomeng Song, Hongtao Tian, Zh

19 Oct 27, 2022

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

161 Dec 8, 2022

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

112 Dec 16, 2022

Sequence lineage information extracted from RKI sequence data repo

Related tags

Overview

Pango lineage information for German SARS-CoV-2 sequences

Omicron share plot

Description of data

Excerpt

Suggested import into pandas

License

You might also like...

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Adversarial-Information-Bottleneck - Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck (NeurIPS21)

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

ULMFiT for Genomic Sequence Data

A package to predict protein inter-residue geometries from sequence data

Simulate genealogical trees and genomic sequence data using population genetic models

Comments

feat: add grid lines and minor ticks

Write better tick label formatter for logit scale that produces 50% and 99.99% instead of 50.00 or 100%

What about non-BA.1-variants of Omicron?

Owner

Cornelius Roemer

This package is for running the semantic SLAM algorithm using extracted planar surfaces from the received detection

This repo contains the code and data used in the paper "Wizard of Search Engine: Access to Information Through Conversations with Search Engines"

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sequence to Sequence Models with PyTorch

Sequence-to-Sequence learning using PyTorch

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

An implementation of a sequence to sequence neural network using an encoder-decoder