Collapse a set of redundant kmers to use IUPAC degenerate bases

Alex Reynolds

Last update: Jan 6, 2022

Related tags

Overview

kmer-collapse

Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

Given an input set of kmers, find the smallest set of kmers that encapsulates all diversity in the input set using IUPAC degenerate bases. This aims to solve the problem described here: https://www.biostars.org/p/9498272/

Usage

Install the marisa-trie library, if necessary.

Modify the script's input variable to specify desired sequences, and then run the script:

$ python kmer-collapse.py
{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA",
        "ASAAAAAAAA"
    ]
}

Notes

This has not been tested with any kmer sets but those examples provided. However, it aims to be scalable by pruning combinations of sub-kmers along the way, which would otherwise yield incorrect encodings. This also uses a trie for faster prefix testing. If futher performance is needed, some easy wins would be to cache sub-kmer tests, since most of these test outcomes would be redundant.

Additionally, no error checking is done on the input kmer alphabet or on the consistency of kmer lengths. It may be useful to validate input before using this script.

Examples

These examples are available from the script by uncommenting the relevant input.

A

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA"
    ]
}

B

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "GCGAAAAAAA"
    ],
    "encoded_output": [
        "GCGAAAAAAA",
        "WAAAAAAAAA"
    ]
}

C

{
    "input": [
        "AAAAAAAAAA"
    ],
    "encoded_output": [
        "AAAAAAAAAA"
    ]
}

D

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA"
    ]
}

E

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "TTAAAAAAAA",
        "ATAAAAAAAA"
    ],
    "encoded_output": [
        "WWAAAAAAAA"
    ]
}

F

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ]
}

G

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "ASAAAAAAAA",
        "WAAAAAAAAA"
    ]
}

Set up a sidechain for the XRPL quickly and easily

Sidechain Launch Kit Introduction This directory contains python scripts to tests and explore side chains. This document walks through the steps to se

15 Dec 8, 2022

Set named timers for cooking, watering plants, brewing tea and more.

Timer Set named timers for cooking, watering plants, brewing tea and more. About Use Mycroft when your hands are messy or you need more that the one t

3 Nov 2, 2022

A simple python script that print the Mandelbrot set for every power of the formal formula.

Python Mandelbrot A simple python script that print the Mandelbrot set for every power of the formal formula.

2 Apr 15, 2022

A set of scripts for a two-step procedure to measure the value of access to destinations across several modes of travel within a geographic area.

Institute for Transportation and Development Policy

2 Oct 16, 2022

combs is a package used to generate all possible combinations of a given length k on a given set.

The package combs is a package used to generate all possible combinations of a given length k on a given set. The set is given as a list, and k must b

1 Dec 24, 2021

Digdata presented 'BrandX' as a clothing brand that wants to know the best places to set up a 'pop up' store.

Digdata presented 'BrandX' as a clothing brand that wants to know the best places to set up a 'pop up' store. I used the dataset given to write a program that ranks these places.

1 Dec 11, 2021

This repository provides a set of easy to understand and tested Python samples for using Acronis Cyber Platform API.

3 Aug 11, 2022

Set of scripts that schedules employees for shifts throughout the week based on availability, shift times, and shift necessities

Automatic-Scheduler Set of scripts that schedules employees for shifts throughout the week based on availability, shift times, and shift necessities *

1 May 1, 2022

Python script to autodetect a base set of swiftlint rules.

swiftlint-autodetect Python script to autodetect a base set of swiftlint rules. Installation brew install pipx

24 Sep 20, 2022

Collapse a set of redundant kmers to use IUPAC degenerate bases

Related tags

Overview

kmer-collapse

Overview

Usage

Notes

Examples

A

B

C

D

E

F

G

You might also like...

Set up a sidechain for the XRPL quickly and easily

Set named timers for cooking, watering plants, brewing tea and more.

A simple python script that print the Mandelbrot set for every power of the formal formula.

A set of scripts for a two-step procedure to measure the value of access to destinations across several modes of travel within a geographic area.

combs is a package used to generate all possible combinations of a given length k on a given set.

Digdata presented 'BrandX' as a clothing brand that wants to know the best places to set up a 'pop up' store.

This repository provides a set of easy to understand and tested Python samples for using Acronis Cyber Platform API.

Set of scripts that schedules employees for shifts throughout the week based on availability, shift times, and shift necessities

Python script to autodetect a base set of swiftlint rules.

Owner

Alex Reynolds

TB Set color display - Add-on for Blender to set multiple objects and material Display Color at once.

🟥This is an overview of how to set up and use DataStore3 in your Roblox experiences

Purge your likes and wall comments from VKontakte. Set yourself free from your digital footprint.

OnTime is a small python that you set a time and on that time, app will send you notification and also play an alarm.

Set of tools to analyze Tinynuke samples

Analyzes crypto candles over a set time period and then trades based on winning patterns found

This is a Fava extension to display a grouped portfolio view in Fava for a set of Beancount accounts.

A set of tools for ripping music from Konami mobile games

Exercise to teach a newcomer to the CLSP grid to set up their environment and run jobs

This python code will get requests from SET (The Stock Exchange of Thailand) a previously-close stock price and return it in Thai Baht currency using beautiful soup 4 HTML scrapper.