Simulate genealogical trees and genomic sequence data using population genetic models

Tskit developers

Last update: Dec 14, 2022

Related tags

Deep Learning python genomics genetics simulation trees coalescent tskit msprime

Overview

msprime

msprime is a population genetics simulator based on tskit. Msprime can simulate random ancestral histories for a sample of individuals (consistent with a given demographic model) under a range of different models and evolutionary processes. Msprime can also simulate mutations on a given ancestral history (which can be produced by msprime or other programs supporting tskit) under a variety of genome sequence evolution models.

Please see the documentation for more details, including installation instructions.

Research notice

Please note that this repository is participating in a study into sustainability of open source projects. Data will be gathered about this repository for approximately the next 12 months, starting from 2021-06-04.

Data collected will include number of contributors, number of PRs, time taken to close/merge these PRs, and issues closed.

For more information, please visit our informational page or download our participant information sheet.

Comments

Compare msprime against seq-gen #1018

closes #1018 . Completes initial code for comparing with Seq-Gen for JC69. Will be extended to accommodate for the mutation models that will be implemented later on.

opened by GertjanBisschop 67
Finite sites branchwise
I'm having a go at #553. The two methods sketched out there for doing this are roughly:

build the trees and step along site-by-site

throw down mutations and go back and deal with state later (including rejection sampling)

This PR is the latter. It doesn't work yet. Without fully explaining (yet), the idea is that:

the previous mutgen code can almost lay down multiple mutations at the same site...

however, it had no facilities for actually keeping track of those mutation; so I had to add that.

edges come in order by time-ago

compute_mutation_parents wants mutations to come in order by time (ie, reverse of the order that they are produced by mutgen)

So, once I've done this right, I should be able to

generate mutations

compute mutation parents

figure out state (and, there's python code in this PR to do this bit)

This is not yet right, and I've probably done something terribly wrong C-wise, but maybe you'll get the idea. Feel free to ignore for the moment (but suggestions appreciated).
opened by petrelharp 64
add individual table

This is a start at that (and needs to be rebased already, looks like). WIP.

Maybe this wants to be kept separate for now? But, we need it for SLiM, so I thought I'd give it a go.

Of note: For two things (spatial position and genomes) I found myself wanting to put 2d arrays in kastore. I could have done this with several columns, of course (and maybe I should have), but it seemed pretty natural. We could make this more natural by adding a "width" attribute to columns as well as "length".

opened by petrelharp 59
Function in DemographyDebugger to check where lineages are possible

This function in the demography debugger returns a epoch times and indicators for whether lineages are possible in each population from the population configurations. There are some simple examples in the tests for this function, and I've tried to make it play nicely with ancient samples.

If no samples are given, it assumes we are sampling from each population at time zero. Otherwise, samples must be given as a list of msprime.Sample objects, which have times and populations.

cc: @grahamgower, @jeromekelleher

opened by apragsdale 56
Add pop size and coalescent rate trajectories to demography debugger

These changes add 3 new methods to msprime.DemographyDebugger

1: pop_size_and_migration_at_t(self, t) - Which returns all population sizes and the migration matrix at any time (ago), t

2: population_size_trajectory(self, end, num_steps=10) - This function allows the user to specify an end time and a number of steps to take between (and including) - [0,end]. The function will return the steps, as well as each population's respective population size at each step.

3: coalescence_rate_trajectory(self, end, num_samples, num_steps=10, min_pop_size=1) - The kicker (brought to light and solved by @petrelharp), This function follows the same regime as population_size_trajectory except it instead computes the ground truth coalescent rates for populations with multiple locations and migration. num_samples should be a list the same length as the number of populations, and min_pop_size is there for math reasons (read the doc string for more). Return the steps, respective coalescent rates, along with with the sum of a matrix P which represents the total probability that two lineages have not yet coalesced.

There are currently only two unit tests for this, and I am working on writing more. In the meantime, feel free to make suggestions.

Below is what we are computing. 2.HEIC.pdf 3.HEIC.pdf 1.HEIC.pdf

opened by jgallowa07 41

Declarative demography

After a discussion with @grahamgower this morning, we decided it would be nice to have a declarative structure for demography, where we describe what our populations are and how they relate to each other. Here's a first pass at doing the OOA model using a toml description:

Warning: the parameters are WRONG - DO NOT COPY THEM!!

description = "Gutenkunst et al three population Out-of-Africa"
generation_time = 25
time_units = "kya"

[populations]
    [populations.ancestral_human]
        size = 7300
        time = 220

    [populations.ancestral_african]
        ancestor = "ancestral_human"
        population_size = 12300
        time = 140

    [populations.ancestral_eurasian]
        ancestor = "ancestral_africans"
        size = 12300
        time = 21.2

    [populations.YRI]
        ancestor = "ancestral_africans"
        size = 12300

    [populations.CEU]
        ancestor = "ancestral_eurasian"
        size = 1000
        growth_rate = 0.0055

    [populations.CHB]
        ancestor = "ancestral_eurasian"
        size = 510
        growth_rate = 0.004


# The limitations of toml here become apparent when we try to 
# declare some migration relationships between these populations.
[migration]
    symmetric = [
        # Annoyingly, the rate must be enclosed in a list for toml
        [["ancestral_african", "ancestral_eurasian"], [25e-5]],
        [["YRI", "CEU"], [3e-5]],
        [["YRI", "CHB"], [1.9e-5]],
        [["CEU", "CHB"], [9.6e-5]],
    ]
    # Could have asymmetric migration here also
   
    # Sticking in an admixture event, just to see if we can handle mass migration.
    admixture = [
        # (source, dest), (time, fraction). We need this lameness 
        # because toml won't accept mixed array types. This isn't 
        # great.
        [["YRI", "CEU"], [1.0, 0.1]]
    ]

Some parts of this are quite nice, and others are pretty nasty. I think toml is too restrictive and will have another go in a minute with yaml.

This is an inherently graphical description of the populations, inspired by @apragsdale's approach in the demography package. A "population" in this context is modelled as a unit in which the parameters don't change and no mergers with other populations occur. Each population has an ancestor attribute, which points to the population that it was derived from. There is also a time attribute, which (forwards in time) is the time at which this population goes extinct (if it hasn't split into other populations).

Migration is treated separately to this, and set up as relationships between the populations, outside the population inheritance tree. This ended up being ugly to express in toml, but it might be easier to do in yaml, say.

Any thoughts? Is this population inheritance tree a useful tool, or just overly simplistic?

opened by jeromekelleher 40

Likelihood evaluation

Both hudson_recombination_event and common_ancestor_event now begin by setting store_full_arg = 1, which is what I envisage turning into a parameter in the args object.

The calls to store_edge for tree sequence output (as opposed to full ARG) have been placed inside if-checks for store_full_arg = 0.

Correspondingly, both hudson_recombination_event and common_ancestor_event create a new node(s), and corresponding edges, when store_full_arg = 1. That's always accompanied by a while-loop, which makes sure that all segments that used to point to the old node are updated to point to the new one. I'm not sure whether this is the cleanest way of doing it, though don't have a better one in mind either.

opened by JereKoskela 38
generalize mutation models

Here's how I'd like to make mutation models more flexible, as outlined in #1006. Mutation models would only need to have the 'pick root allele' and 'transition allele' methods; what we've done so far is the special case of MutationMatrixModels. I've coded the (small) changes up in python, with the SLiM mutation model.

I've also made the check that we aren't sticking a mutation above an existing one optional (in python only, again).

How's this look?

Note: with this in place, it'll be easy to do models of microsats, etcetera.

opened by petrelharp 35

Very rarely, waiting time until coalesence = 0. What to do?

In very, very, very rare cases msprime can generate a zero waiting time until the next coalescent event, resulting in an error being thrown. This script for example, does it:

import msprime as msp

def ancient_sample_test(
        num_modern=1000, anc_pop = 0, anc_num = 1, anc_time = 200, split_time_anc = 400,
        Ne0 = 10000, Ne1 = 10000, length = 1000):
    samples = [msp.Sample(population = 0, time = 0)]*num_modern
    samples.extend([msp.Sample(population = anc_pop, time = anc_time)]*(2*anc_num))
    pop_config = [msp.PopulationConfiguration(initial_size = Ne0), msp.PopulationConfiguration(initial_size = Ne1)]
    divergence = [msp.MassMigration(time = split_time_anc, source = 1, destination = 0, proportion = 1.0)]
    seed = 94320219
    sims = msp.simulate(
        samples=samples,Ne=Ne0,population_configurations=pop_config,
        demographic_events=divergence, length=length,
        random_seed=seed)

if __name__ == "__main__":
    num_ind = 10
    ancient_sample_test(
        num_modern=100,anc_pop=0,anc_num=num_ind,Ne0=3000,Ne1=3000,anc_time=5419,split_time_anc=5919,length=500)

we get

  File "ancient_genotypes_simulation.py", line 19, in <module>
    num_modern=100,anc_pop=0,anc_num=num_ind,Ne0=3000,Ne1=3000,anc_time=5419,split_time_anc=5919,length=500)
  File "ancient_genotypes_simulation.py", line 14, in ancient_sample_test
    random_seed=seed)
  File "/home/jk/work/github/msprime/msprime/simulations.py", line 485, in simulate
    sim, mutation_generator, 1, provenance_dict, end_time))
  File "/home/jk/work/github/msprime/msprime/simulations.py", line 163, in _replicate_generator
    sim.run(end_time)
  File "/home/jk/work/github/msprime/msprime/simulations.py", line 704, in run
    self.ll_sim.run(end_time)
_msprime.LibraryError: The simulation model supplied resulted in a parent node having a time value <= to its child. This can occur either as a result of multiple bottlenecks happening at the same time or because of numerical imprecision with very small population sizes.

What happens is, this line returns exactly zero and so the minimum time until the next event is 0 and naturally enough this gets chosen as the next event. But, we end up with a zero branch length, which tskit won't allow.

We could put in some cludge where we have 1e-200 or something as the time if t_wait is exactly zero, but I don't think this is a good idea. We can easily get to a case where branch lengths become too short for double precision when growth rates are specified, and it's much better that the simulation ends with an error than the user getting back a bunch of completed trees with miniscule branch lengths.

So; I don't really know how to handle this!

Thanks to David Lawrie for the script and reporting the problem. Pinging @molpopgen for thoughts.

opened by jeromekelleher 35

Bottomup simplify
Here (in simplify_work/) is an implementation of the bottomup simplify algorithm that follows the python code in algorithms.py as closely as I could manage.

Good news: it runs!

msp simulate 10 trees.ts python3 simplify_algorithms.py simplify trees.ts 3

Bad news: it isn't right if there's more than one tree. Also maybe it falls in an infinite loop sometimes. But, this is just debugging. @jeromekelleher : is this close enough to the C to enable easy translation, once it works?

Main points:

This is different to msprime in that recombination and coalescence events happen at the same time. To do this, I wrote my own remove_ancestry function, kinda like recombination_event, and basically used merge_ancestors as-is.

But, to associate ancestors in the internal state to nodes of the edgesets we're reading in, I had to add an additional data structure, A, that stores for each node ID in the input sequence the ID of the head ancestral segment. This then has to be updated in merge_ancestors.

The only other thing is getting the initial state right.

I don't think we actually need the Fenwick tree here, since we don't need to choose segments proportional to their length. But maybe I'm missing something else we need them for.

I'm not clear how this eventually gets into NodeTables and EdgesetTables. I think that happens later on, so I shouldn't worry?

I'll finish tracking down the bugs, but a go-ahead would be helpful.
opened by petrelharp 35

Core dumped error

I got the following error :

python3: lib/object_heap.c:136: object_heap_free_object: Assertion `self->top < self->size' failed.
Aborted (core dumped)

After running the following code :

import matplotlib
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats
from matplotlib import lines as mlines
from matplotlib import pyplot
import msprime
import msprime.cli as cli

sample_size=20
Ne=10**4
r=1*1e-8
m=1.25*1e-8
L=10 ** 8
strengh=10

population_configurations=[msprime.PopulationConfiguration(initial_size=Ne,sample_size=sample_size)]
demographic_events = [
            msprime.PopulationParametersChange(
                time=1000, growth_rate=-0.00025584278811045
            ),
            msprime.PopulationParametersChange(
                time=10000, growth_rate=0
            ),
        ]
for x in range(1,11):
    ts=msprime.simulate(population_configurations=population_configurations,recombination_rate=r,mutation_rate=m,length=L,demographic_events=demographic_events)
    name="Figure_1_Decrease_strength_"+str(strengh)+"_"+str(x)+".vcf"
    with open(name, "w") as vcf_file:
        ts.write_vcf(vcf_file, 2)
    vcf_file.close()

The error does not always occur, it might run couple of iterations before crashing.

bug

opened by TPPSellinger 34

add Software Heritage badges to repo README file

This is a proposal for adding Software Heritage badges to tskit and msprime (and perhaps other) repo README files.

Here's info about Software Heritage badges: https://www.softwareheritage.org/2020/01/13/the-swh-badges-are-here/

I'm happy to create a PRs if this is considered an improvement.

I think Software Heritage is a great resource for the research community and am happy to write more about why it is so awesome.

opened by castedo 2

SMC with record_full_arg produces discontinuous nodes

I'm not what "SMC" means for a full ARG, so I'm not sure if this is expected, but I thought it worth noting:

ts = msprime.sim_ancestry(
    4,
    sequence_length=1e4,
    recombination_rate=1e-5,
    record_full_arg=True,
    random_seed=14,
    model="SMC",
)
ts.draw_svg(style=".n19 > .sym {fill:red}")

opened by hyanwong 2

add keep_unary for pedigrees and dtwf

This pull request stores all nodes through which some ancestral material passes for pedigrees and dtwf simulations when the flag record_unary=True is set. This is a first pass to allow for testing (see #2132). This pull request requires defining a MSP_NODE_IS_PASS_THROUGH_EVENT (see ). Is this as simple as simply using the next available number in line (#define MSP_NODE_IS_PT_EVENT (1u << 22))? As with #2130 unit tests are still basic and require further work.

opened by GertjanBisschop 4
More helpful record_full_arg errors

There are a few cases (e.g. the DTWF) where full_arg recording is not possible. We should give slightly more informative error messages when this is attempted. This PR should make it easy to add extra error messages for different cases like this.

opened by hyanwong 2
Test sim_mutations with unary nodes

Just to be sure, we should add a quick test that checks that sim_mutations works OK when we have unary nodes (probably already exists for full_arg, but no harm in adding another test).

opened by jeromekelleher 1
Test simulating with unary nodes on all ancestry models/processes

In addition to some statistical tests (#2134) we should add some straightforward unit tests that sytematically checks that we handle things correctly across the different ancestry models and ancestral processes (e.g. gene conversion)

opened by jeromekelleher 0

Releases(1.2.0)

1.2.0(May 19, 2022)
New features

Add the FixedPedigree ancestry model and various infrastructure for importing pedigree information into msprime.

Bug fixes:

Fix rare assertion trip in the single sweep model caused by numerical jitter. (#1966, #2038, @jeromekelleher, @molpopgen)

Fix edge case in Demography.from_old_style() (#2047, #2048, @grahamgower)

Maintenance:

Documentation improvements (#2054, #2033, #2011 @petrelharp, @gregorgorjanc)

Source code(tar.gz)
Source code(zip)
1.1.1(Feb 10, 2022)
Minor bugfix release

Bug fixes:

Fix (very) rare assertion trip caused by underlying GSL bug. (#1997, #2000, @chriscrsmith, @molpopgen, @andrewkern)

Maintenance:

Various documentation improvements.

Source code(tar.gz)
Source code(zip)
1.1.0(Dec 14, 2021)
[1.1.0] - 2021-12-14

New features

Add support for tree sequence time_units field. The time_units will be set to “generations” for the output of sim_ancestry (and simulate), unless the initial_state argument is used. In this case, the time_units value will be inherited from the input. (#1953, #1951, #1877, #1948, @jeromekelleher).

Bug fixes:

Raise an error if Demography.from_demes() is passed a model with non-zero selfing_rate or cloning_rate values (which msprime does not support). (#1938, #1937, @grahamgower).

Do not assume Population metadata schemas contain the properties and additionalProperties attributes (#1947, #1954, @jeromekelleher).

Read the population name from PopulationConfiguration metadata in Demography.from_old_style (#1950, #1954, @jeromekelleher)

Maintenance:

Update tskit to Python 0.4.0 and C 0.99.15.

Source code(tar.gz)
Source code(zip)
1.0.4(Dec 1, 2021)

New features:

Support for Demes 0.2.0, which introduces a change to how pulse sources and proportions are specified. (#1936, #1930, @apragsdale)
Source code(tar.gz)
Source code(zip)
1.0.3(Nov 13, 2021)
[1.0.3] - 2021-11-12

This is a bugfix release recommended for all users.

New features:

Support for running full ARG simulations with gene conversion (#1801, #1773, @JereKoskela).

Improved performance when running many small simulations (#1909, @jeromekelleher).

Update to tskit C API 0.99.14 (#1829).

Bug fixes:

Fix bug in full ARG simulation with missing regions of the genome, where ARG nodes were not correctly returned. (#1893, @jeromekelleher, @hyl317)

Fix memory leak when running sim_ancestry in a loop (#1904, #1899, @jeromekelleher, @grahamgower).

Fix printing small values in rate maps (#1906, #1905, @petrelharp).

Source code(tar.gz)
Source code(zip)
1.0.2(Jul 29, 2021)
Minor feature release with improved Demes support and a few small bugfixes.

New features:

Support for Demes input and logging in the msp simulate CLI (#1716, @jeromekelleher).

Add Demography.to_demes method for creating a Demes demographic model from an msprime demography (#1724, @grahamgower).

Improved mapping of Demes models to Demography objects (#1758, #1757, #1756 @apragsdale).

Improved numerical algorithms in DemographyDebugger (#1788, @grahamgower, @petrelharp).

Bugfixes:

Raise an error if running full ARG simulations with gene conversion (#1774).

Source code(tar.gz)
Source code(zip)
1.0.1(May 10, 2021)
Minor feature release with experimental Demes support.

Change the semantics of Admixture events slightly so that ancestral populations that are inactive, are marked as active (#1662, #1657, @jeromekelleher, @apragsdale)

Initial support for Demes via the Demography.from_demes method. (#1662, #1675, @jeromekelleher, @apragsdale, @grahamgower)

Source code(tar.gz)
Source code(zip)
1.0.0(Apr 14, 2021)

This is a major update recommended for all users. Please see the changelog for details.
Source code(tar.gz)
Source code(zip)
1.0.0b1(Apr 1, 2021)

Beta release for testing. See https://tskit.dev/msprime/docs/latest/CHANGELOG.html for notes.
Source code(tar.gz)
Source code(zip)
1.0.0a6(Mar 3, 2021)

Sixth alpha for testing. Includes packaging updates.
Source code(tar.gz)
Source code(zip)
1.0.0a5(Mar 2, 2021)

Fifth alpha for testing and evaluation. Includes draft of the 1.0 demography API.
Source code(tar.gz)
Source code(zip)
1.0.0a4(Feb 1, 2021)

Source code(tar.gz)
Source code(zip)
1.0.0a3(Jan 18, 2021)

Alpha 3 release.
Source code(tar.gz)
Source code(zip)
1.0.0a2(Jan 18, 2021)

Second alpha; updates to release artifacts.
Source code(tar.gz)
Source code(zip)
1.0.0a1(Jan 15, 2021)

Early release of the new 1.0 APIs for developers and experience users. This is not stable and the APIs may still change.

See https://tskit-dev.github.io/msprime-docs/main/quickstart.html#upgrading-from-0-x for more information.
Source code(tar.gz)
Source code(zip)
0.7.5(May 29, 2020)

This is a dummy release to allow us to update the "stable" docs branch on readthedocs. This is to correct an error in the version of the Out of Africa model described in the tutorial. See here for full details.
Source code(tar.gz)
Source code(zip)
0.7.4(Dec 5, 2019)
This release fixes an important bug in the legacy ms compatible interface and is therefore strongly recommended of all mspms users.

Bug fixes:

Fix error in mspms output of tree spans. In previous versions, the length of genome spanned by trees in the newick output was incorrect in certain situations (specifically, when “invisible” recombinations are present so that two or more identical trees are printed out). Thanks to @fbaumdicker for spotting the problem. (@jeromekelleher, #837, #836)

Fix assertion tripped when we have very low recombination rates in the DTWF model. Thanks to @terhorst for the bug report. (@jeromekelleher, #833, #831).

Fix bug in memory allocation when simulating mutations on a tree sequence that already contains many mutations. Thanks to @santaci for the bug report. (@jeromekelleher, @petrelharp, #838, #806)

New features:

Add the new Census event, which allows us to place nodes on all extant branches at a given time (@gtsambos #799).

Improved error reporting for input parameters, in particular demographic events (#829).

Documentation:

Improved container documentation (@agladstein, #822, #809).

Improved developer docs for macs (@gtsambos, @molpopgen, #805).

Clarify meaning of migration matrix (@petrelharp, #830).

Source code(tar.gz)
Source code(zip)
0.7.3(Aug 3, 2019)
Bug fixes:

Support for SMC models coupled with the record_full_arg feature was erroneously removed in a previous version (:issue:795). The feature has been resinstated (:pr:796).

Source code(tar.gz)
Source code(zip)
0.7.2(Jul 30, 2019)
Minor release fixing a very rare bug and with some new features.

Breaking changes

The random trajectory has been changed slightly to improve handling of ancient sampling events (:pr:782). Thus, simulations for a given random seed will not be identical to previous versions, if ancient samples are used.

New features

Automated Docker builds (:user:agladstein; :pr:661)

Add mean coalescence time to DemographyDebugger (:user:petrelharp; :pr:779).

Improve MassMigration descriptions in DemographyDebugger (:user:marianne-aspbury; :pr:791).

Bug fixes:

In very, very, very rare cases it was possible to generate a zero waiting time until the next coalescent event, leading to zero branch lengths in the output tree sequence and an error being raised (:user:molpopgen, :user:DL42, :user:jeromekelleher; :issue:783, :pr:785).

Source code(tar.gz)
Source code(zip)
0.7.1(Jun 8, 2019)
New features

Discrete Time Wright-Fisher simulation model (:user:DomNelson).

SMC/SMC' simulation models (:user:jeromekelleher).

Mixed simulation models (:user:jeromekelleher).

Specify end_time to allow early-finish for simulations (:user:jeromekelleher).

Calculation of historical coalescence rates in the DemographyDebugger (:user:jgallowa07, :user:petrelharp).

Additional information on population sizes in DemographyDebugger (:user:andrewkern).

Remove support for Python 2 (:user:hugovk).

Allow specifying metadata for populations (:user:jeromekelleher).

Bug fixes:

Various minor bug and doc fixes from :user:hyanwong, :user:petrelharp, :user:brianzhang01, :user:mufernando and :user:andrewkern.

Source code(tar.gz)
Source code(zip)
0.7.1b1(May 31, 2019)

Early release to make the DTWF model available for testing.
Source code(tar.gz)
Source code(zip)
0.7.0(Feb 22, 2019)
Separation of tskit from msprime. Msprime is now solely dedicated to simulating the coalescent, and all infrastucture for working with succinct tree sequences is now provided by tskit. To ensure compatability, msprime now imports code from tskit under the old names, which should ensure that all code continues to work without changes.

New features

Ability to record the full ARG (Jere Koskela; #665)

Bug fixes:

Fix deprecation warning (#695).

Source code(tar.gz)
Source code(zip)
0.7.0a1(Jan 14, 2019)

Alpha release for testing the tskit/msprime split.
Source code(tar.gz)
Source code(zip)
0.6.2(Dec 4, 2018)
Minor bugfix release.

New features:

Add provenance recording option to simplify (#601)

Minor performance improvement (#598)

Bug fixes:

Fix performance regression in replication (#608)

Source code(tar.gz)
Source code(zip)
0.6.1(Aug 25, 2018)
Significant features for integration with forwards-time simulators plus improvements and bugfixes.

Breaking changes:

Change in the semantics of how populations are treated by simplify. By default, populations that are not referenced will now be removed from the data model. This can be avoided by setting filter_populations=False.

Simplify now raises an error if called on a set of tables that contain one or more migrations.

New features:

The simulate() function now supports a from_ts argument allowing msprime to complete the ancestry in tree sequences generated by forward simulations (#503, #541, #572, #581).

Add start_time and end_time parameters to the mutate function (#508).

Add reduce_to_site_topology argument to simplify. This allows us to find the minimal tree sequence that would be visible from a given set of sites, and is also a useful compression method if we are only interested in the observed sequences. (#545, #307).

Simplify generalised to support individuals, and the filter_populations, filter_individuals and filter_sites parameters added to allow filtering of unreferenced objects from the data model. (#567).

Default random seeds are now generated from a sequence initialised by a system source of randomness (#534). Random seeds should also be safely generated across multiple processes.

Full text I/0 support for Individuals and Populations (#498, #555)

Substantially improved performance in msprime.load for large tables and significant refactoring of C code (#559, #567, #569).

Improved performance of generating genotypes (#580).

Formal schema for tree sequence provenance (#566, #583).

Many updates to documentation.

Bug fixes:

Throw a more intelligle error during simulation if a topology is produced where the time of a parent is equal to the time of the child. (#570, #87).

Pickle supported in the TableCollection object. (#574, #577).

Deprecated:

The filter_zero_mutation_sites parameter for simplify has been deprecated in favour of filter_sites.

Source code(tar.gz)
Source code(zip)
0.6.0(Jun 20, 2018)
This release is focused on ensuring interoperability with the forthcoming SLiM 3.0 release, which has support for outputting tree sequences in msprime's .trees format. The release represents a substantial step towards the goal of separating the tskit code from msprime. It removes the troublesome HDF5 dependency in favour of the much simpler kastore library.

The principle new features are the mutate() function which allows us to easily add mutations to any tree sequence, preliminary support for Individuals and Populations within the data model, and the addition of the new TableCollection object as the central structure in the Tables API.

Breaking changes:

Files stored in the HDF5 format will need to upgraded using the msp upgrade command.

New features:

The mutate function (#507).

Removed HDF5 library dependency. Now use the embedded kastore library for storing data.

Numpy and h5py are now install time dependencies, solving some installation headaches.

The new TableCollection type gives much tighter integration with the low-level library. Functions like sort_tables and simplify_tables are now methods of this class. The load_tables function has been replaced by TableCollection.tree_sequence. These functions still work, but are deprecated.

Preliminary support for Individual and Population types in the Tables API and for TreeSequences.

Add 'root' argument to SparseTree.newick and support for arbitrary node labels (#510).

Larger numbers of alleles now supported via 16-bit genotypes (#466).

Substantially improved simplify performance when there is a large number of sites (#453).

Bug fixes:

Fix bug in tree drawing with many roots (#486)

Fix segfault in accessing trees with zero roots (#515)

Fix bug where DemographyDebugger was modifying the input sample sizes (#407)

Deprecated:

sort_tables is deprecated in favour of TableCollection.sort().

simplify_tables is deprecated in favour of TableCollection.simplify().

load_tables is deprecated in favour of TableCollection.tree_sequence().

Source code(tar.gz)
Source code(zip)
0.6.0b2(Jun 16, 2018)

This release fixes some OSX bugs in 0.6.0b1.
Source code(tar.gz)
Source code(zip)
0.6.0b1(Jun 15, 2018)
This is preview release of the following major changes:

Remove HDF5 and use kastore for tree sequence files

Add Individual and Population types

The mutate() function.

Source code(tar.gz)
Source code(zip)
0.5.0(Feb 26, 2018)
This is a major update to the underlying data structures in msprime to generalise the information that can be modelled, and allow for data from external sources to be efficiently processed. The new Tables API enables efficient interchange of tree sequence data using numpy arrays. Many updates have also been made to the tree sequence API to make it more Pythonic and general. Most changes are backwards compatible, however.

Breaking changes:

The SparseTree.mutations() and TreeSequence.mutations() iterators no longer support tuple-like access to values. For example, code like

for x, u, j in ts.mutations(): print("mutation at position", x, "node = ", u)

will no longer work. Code using the old Mutation.position and Mutation.index will still work through deprecated aliases, but new code should access these values through Site.position and Site.id, respectively.

The TreeSequence.diffs() method no longer works. Please use the TreeSequence.edge_diffs() method instead.

TreeSequence.get_num_records() no longer works. Any code using this or the records() iterator should be rewritten to work with the edges() iterator and num_edges instead.

Files stored in the HDF5 format will need to upgraded using the msp upgrade command.

New features:

The API has been made more Pythonic by replacing (e.g.) tree.get_parent(u) with tree.parent(u), and tree.get_total_branch_length() with tree.total_branch_length. The old forms have been maintained as deprecated aliases. (#64)

Efficient interchange of tree sequence data using the new Tables API. This consists of classes representing the various tables (e.g. NodeTable) and some utility functions (such as load_tables, sort_tables, etc).

Support for a much more general class of tree sequence topologies. For example, trees with multiple roots are fully supported.

Substantially generalised mutation model. Mutations now occur at specific sites, which can be associated with zero to many mutations. Each site has an ancestral state (any character string) and each mutation a derived state (any character string).

Substantially updated documentation to rigorously define the underlying data model and requirements for imported data.

The variants() method now returns a list of alleles for each site, and genotypes are indexes into this array. This is both consistent with existing usage and works with the newly generalised mutation model, which allows arbitrary strings of characters as mutational states.

Add the formal concept of a sample, and distinguished from 'leaves'. Change tracked_leaves, etc. to tracked_samples (#225). Also rename sample_size to num_samples for consistency (#227).

The simplify() method returns subsets of a large tree sequence.

TreeSequence.first() returns the first tree in sequence.

Windows support. Msprime is now routinely tested on Windows as part of the suite of continuous integration tests.

Newick output is not supported for more general trees. (#117)

The genotype_matrix method allows efficient access to the full genotype matrix. (#306)

The variants iterator no longer uses a single buffer for genotype data, removing a common source of error (#253).

Unicode and ASCII output formats for SparseTree.draw().

SparseTree.draw() renders tree in the more conventional 'square shoulders' format.

SparseTree.draw() by default returns an SVG string, so it can be easily displayed in a Jupyter notebook. (#204)

Preliminary support for a broad class of site-based statistics, including Patterson's f-statistics, has been added, through the SiteStatCalculator, and its branch length analog, BranchLengthStatCalculator. The interface is still in development, and is expected may change.

Bug fixes:

Duplicate site no longer possible (#159)

Fix for incorrect population sizes in DemographyDebugger (#66).

Deprecated:

The records iterator has been deprecated, and the underlying data model has moved away from the concept of coalescence records. The structure of a tree sequence is now defined in terms of a set of nodes and edges, essentially a normlised version of coalescence records.

Changed population_id to population in various DemographicEvent classes for consistency. The old population_id argument is kept as a deprecated alias.

Changed destination to dest in MassMigrationEvent. The old destination argument is retained as a deprecated alias.

Changed sample_size to num_samples in TreeSequence and SparseTree. The older versions are retained as deprecated aliases.

Change get_num_leaves to num_samples in SparseTree. The get_num_leaves method (and other related methods) that have been retained for backwards compatability are semantically incorrect, in that they now return the number of samples. This should have no effect on existing code, since samples and leaves were synonymous. New code should use the documented num_samples form.

Accessing the position attribute on a Mutation or Variant object is now deprecated, as this is a property of a Site.

Accessing the index attribute on a Mutation or Variant object is now deprecated. Please use variant.site.id instead. In general, objects with IDs (i.e., derived from tables) now have an id field.

Various get_ methods in TreeSequence and SparseTree have been replaced by more Pythonic alternatives.

Source code(tar.gz)
Source code(zip)
0.5.0b2(Feb 2, 2018)

This release completes the documentation and API changes for the 0.5.0 series, and is a pre-release for testing purposes.
Source code(tar.gz)
Source code(zip)

Owner

Tskit developers

Software for the creation and analysis of tree-sequences.

GitHub

Graph-total-spanning-trees - A Python script to get total number of Spanning Trees in a Graph

Total number of Spanning Trees in a Graph This is a python script just written f

0 Jul 18, 2022

Sequence to Sequence Models with PyTorch

Sequence to Sequence models with PyTorch This repository contains implementations of Sequence to Sequence (Seq2Seq) models in PyTorch At present it ha

708 Dec 19, 2022

Sequence lineage information extracted from RKI sequence data repo

Pango lineage information for German SARS-CoV-2 sequences This repository contains a join of the metadata and pango lineage tables of all German SARS-

24 Oct 26, 2022

Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

514 Nov 17, 2022

An implementation of a sequence to sequence neural network using an encoder-decoder

Keras implementation of a sequence to sequence model for time series prediction using an encoder-decoder architecture. I created this post to share a

195 Dec 17, 2022

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

22 Nov 25, 2022

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

1.4k Jan 8, 2023

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

161 Dec 8, 2022

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

112 Dec 16, 2022

Grow Function: Generate 3D Stacked Bifurcating Double Deep Cellular Automata based organisms which differentiate using a Genetic Algorithm...

Grow Function: A 3D Stacked Bifurcating Double Deep Cellular Automata which differentiates using a Genetic Algorithm... TLDR;High Def Trees that you can mint as NFTs on Solana

4 Oct 8, 2022

Simulate genealogical trees and genomic sequence data using population genetic models

Related tags

Overview

msprime

Research notice

Comments

Releases(1.2.0)

1.2.0(May 19, 2022)

New features

Bug fixes:

Maintenance:

1.1.1(Feb 10, 2022)

1.1.0(Dec 14, 2021)

[1.1.0] - 2021-12-14

New features

Bug fixes:

Maintenance:

1.0.4(Dec 1, 2021)

1.0.3(Nov 13, 2021)

[1.0.3] - 2021-11-12

New features:

Bug fixes:

1.0.2(Jul 29, 2021)

New features:

Bugfixes:

1.0.1(May 10, 2021)

1.0.0(Apr 14, 2021)

1.0.0b1(Apr 1, 2021)

1.0.0a6(Mar 3, 2021)

1.0.0a5(Mar 2, 2021)

1.0.0a4(Feb 1, 2021)

1.0.0a3(Jan 18, 2021)

1.0.0a2(Jan 18, 2021)

1.0.0a1(Jan 15, 2021)

0.7.5(May 29, 2020)

0.7.4(Dec 5, 2019)

Bug fixes:

New features:

Documentation:

0.7.3(Aug 3, 2019)

0.7.2(Jul 30, 2019)

0.7.1(Jun 8, 2019)

0.7.1b1(May 31, 2019)

0.7.0(Feb 22, 2019)

0.7.0a1(Jan 14, 2019)

0.6.2(Dec 4, 2018)

0.6.1(Aug 25, 2018)

0.6.0(Jun 20, 2018)

0.6.0b2(Jun 16, 2018)

0.6.0b1(Jun 15, 2018)

0.5.0(Feb 26, 2018)

0.5.0b2(Feb 2, 2018)

Owner

Tskit developers

Graph-total-spanning-trees - A Python script to get total number of Spanning Trees in a Graph

Sequence to Sequence Models with PyTorch

Sequence lineage information extracted from RKI sequence data repo

Sequence-to-Sequence learning using PyTorch

An implementation of a sequence to sequence neural network using an encoder-decoder

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

Clustering with variational Bayes and population Monte Carlo

A parallel framework for population-based multi-agent reinforcement learning.

Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.

Locally cache assets that are normally streamed in POPULATION: ONE

A python library to build Model Trees with Linear Models at the leaves.

This program writes christmas wish programmatically. It is using turtle as a pen pointer draw christmas trees and stars.

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Grow Function: Generate 3D Stacked Bifurcating Double Deep Cellular Automata based organisms which differentiate using a Genetic Algorithm...