Sum-Product Probabilistic Language

Overview

Actions Status pypi

Sum-Product Probabilistic Language

SPPL is a probabilistic programming language that delivers exact solutions to a broad range of probabilistic inference queries. The language handles continuous, discrete, and mixed-type probability distributions; many-to-one numerical transformations; and a query language that includes general predicates on random variables.

Users express generative models as probabilistic programs with standard imperative constructs, such as arrays, if/else branches, for loops, etc. The program is then translated to a sum-product expression (a generalization of sum-product networks) that statically represents the probability distribution of all random variables in the program. This expression is used to deliver answers to probabilistic inference queries.

A system description of SPPL is given in the following paper:

SPPL: Probabilistic Programming with Fast Exact Symbolic Inference. Saad, F. A.; Rinard, M. C.; and Mansinghka, V. K. In PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, June 20-25, Virtual, Canada. ACM, New York, NY, USA. 2021. https://doi.org/10.1145/3453483.3454078.

Installation

This software is tested on Ubuntu 18.04 and requires a Python 3.6+ environment. SPPL is available on PyPI

$ python -m pip install sppl

To install the Jupyter interface, first obtain the system-wide dependencies in requirements.sh and then run

$ python -m pip install 'sppl[magics]'

Examples

The easiest way to use SPPL is via the browser-based Jupyter interface, which allows for interactive modeling, querying, and plotting. Refer to the .ipynb notebooks under the examples directory.

Benchmarks

Please refer to the artifact at the ACM Digital Library: https://doi.org/10.1145/3453483.3454078

Guide to Source Code

Please refer to GUIDE.md for a description of the main source files in this repository.

Tests

To run the test suite as a user, first install the test dependencies:

$ python -m pip install 'sppl[tests]'

Then run the test suite:

$ python -m pytest --pyargs sppl

To run the test suite as a developer:

  • To run crash tests: $ ./check.sh
  • To run integration tests: $ ./check.sh ci
  • To run a specific test: $ ./check.sh [<pytest-opts>] /path/to/test.py
  • To run the examples: $ ./check.sh examples
  • To build a docker image: $ ./check.sh docker
  • To generate a coverage report: $ ./check.sh coverage

To view the coverage report, open htmlcov/index.html in the browser.

Language Reference

Coming Soon!

Citation

To cite this work, please use the following BibTeX.

@inproceedings{saad2021sppl,
title           = {{SPPL:} Probabilistic Programming with Fast Exact Symbolic Inference},
author          = {Saad, Feras A. and Rinard, Martin C. and Mansinghka, Vikash K.},
booktitle       = {PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Design and Implementation},
pages           = {804--819},
year            = 2021,
location        = {Virtual, Canada},
publisher       = {ACM},
address         = {New York, NY, USA},
doi             = {10.1145/3453483.3454078},
address         = {New York, NY, USA},
keywords        = {probabilistic programming, symbolic execution, static analysis},
}

License

Apache 2.0; see LICENSE.txt

Acknowledgments

The logo was designed by McCoy R. Becker.

Comments
  • Memory error for big discrete Bayesian networks

    Memory error for big discrete Bayesian networks

    opened by shiruizhao 7
  • Implement an environment for SPN

    Implement an environment for SPN

    Consider the following SPML program:

    X ~ Normal(0, 1);
    if X > 0:
        Z := X**3
    else:
        Z := 1 + X
    
    probability(Z > 0);
    

    Here, each sub-SPN in the overall mixture induced by if needs to be able to map Z differently.

    opened by fsaad 4
  • Fix implementation of Product.logprob_disjoint_union

    Fix implementation of Product.logprob_disjoint_union

    https://github.com/probcomp/sum-product-dsl/blob/ba6e29362dec68422924cc23151cad2bce263eea/src/spn.py#L316

    Faulty reasoning in disjoint-union algorithm for probabilities.

    Key idea: Pr[A or B] // where A and B are in DNF = Pr[A or (B and ~A)] // disjoint union property = Pr[A] + Pr[B and ~A] // since events are disjoint

    Since A and B are in DNF, write them as A = ϕ(A; X) and ϕ(A; Y) B = ϕ(B; X) and ϕ(B; Y)

    To assess B and ~A, we have: [ϕ(B; X) and ϕ(B; Y)] and ~[ϕ(A; X) and ϕ(B; Y)]

    = [ϕ(B; X) and ϕ(B; Y)] and [~ϕ(A; X) or ~ϕ(B; Y)]

    = [ϕ(B; X) and ϕ(B; Y) and ~ϕ(A; X)] or [ϕ(B; X) and ϕ(B; Y) and ~ϕ(B; Y)]

    = [ϕ(B; X) and ~ϕ(A; X) and ϕ(B; Y)] or [ϕ(B; X) and ϕ(B; Y) and ~ϕ(B; Y)]

    the issue is that clauses in this DNF expression not necessarily disjoint.

    Example:

    ϕ(A; X) = X > 0 ϕ(A; Y) = Y < 1 ϕ(B; X) = X < 1 ϕ(B; Y) = Y < 3

    Then the final expression is:

    [(X < 1) and ~(X > 0) and (Y < 3)] or [(X < 1) and (Y < 3) and ~(Y < 1)] = [(X < 1) and (X ≤ 0) and (Y < 3)] or [(X < 1) and (Y < 3) and (Y ≥ 1)] = [ (X ≤ 0) and (Y < 3) ] or [ (X < 1) and (1 ≤ Y < 3) ]

    which is not disjoint, and reduces to inclusion-exclusion.

    opened by fsaad 4
  • Fix __str__ for EventFiniteNominal to quote the actual nominal values

    Fix __str__ for EventFiniteNominal to quote the actual nominal values

    X << ['a', 'b']
    

    should render as

    X << {'a', 'b'}
    

    so as to distinguish between symbolic entities (x) and string entities ('a', 'b') in the language.

    Similarly, for negation we have

    ~(X << ['a', 'b'])
    

    should render as

    X << UniversalSet() \ {'a', 'b'}
    
    opened by fsaad 3
  • Strange scale parameters to Normal in the fairness-income-model

    Strange scale parameters to Normal in the fairness-income-model

    The example fairness-income-model.ipynb has strange scale parameters of the two Normal distributions: age and capital_gain. E.g. the first age variable is defined as

    age ~= norm(loc=38.4208, scale=184.9151) 
    

    but this might yield very weird values for an age. Here are some random examples:

    In [.]: scipy.stats.norm.rvs(loc=38.4208, scale=184.9151,size=10)                                          
    Out[.]: 
    array([ 456.25033811,   83.38018331,  160.93861454,  129.31420506,
            -81.63928612, -249.89561325, -168.10803064,  103.88993016,
             83.14042404,   14.17379334])
    In [.]: np.max(scipy.stats.norm.rvs(loc=38.4208, scale=184.9151,size=10000))                               
    Out[.]: 697.0747156159017
    
    

    Similarly, the first capital_gain is defined as

    capital_gain ~= norm(loc=568.4105, scale=sqrt(24248365.5428))
    

    which might yield weird values:

    In [.]: scipy.stats.norm.rvs(loc=568.4105, scale=(24248365.5428),size=10)                                  
    Out[.]: 
    array([  4363679.84553238,  27916806.96928081,  24912266.33616957,
            49814111.74657502, -10570052.00261526, -17910371.43687703,
           -42457176.11145068,  21447098.3397193 , -13593136.75306042,
           -10651368.29633577])
    

    The "decision model" part tests for capital_gain values that is much smaller, in the range of 4500..8300.

    I tested to sqrt of the scale values, and then it make a little more sense (the model constrain the age to be > 18 so negative values can be ignored).

    # age
    In [.]: scipy.stats.norm.rvs(loc=38.4208, scale=math.sqrt(184.9151),size=10)                               
    Out[.]: 
    array([43.60703122, 30.07099217, 48.63010393, 47.4886019 , 36.10868353,
           34.54669056, -7.37485058, 35.89182812, 31.11277021, 25.77872902])
    
    # capital_gain
    In [.]: scipy.stats.norm.rvs(loc=568.4105, scale=math.sqrt(24248365.5428),size=10)                         
    Out[.]: 
    array([  4058.16881999,   1920.27026625,  -1806.45867856,  -2287.77208551,
            -2710.24830138, -10841.8993238 ,   3112.10957644,  -4933.77496474,
             3259.75181009,  -7738.50462773])
    

    However, after sqrt'ing all the scale values in the model, the result is now quite different from the original encoding. Instead of

    female_prior: 0.33066502427854466
    female_given_no_hire: 0.33412533774074804
    p_female_given_no hire / p_female: 1.046470962495416
    

    the adjusted model give the following:

    female_prior: 0.33076164956248716
    female_given_no_hire: 0.21962121326454123
    p_female_given_no hire / p_female:  -33.60136716120391
    

    However, I'm not sure if this is the proper way of fixing the model. Or perhaps I have misunderstood something here...

    opened by hakank 2
  • Implement general discrete real distribution (like nominal)

    Implement general discrete real distribution (like nominal)

    In the syntax " x >> dict", we should check whether the keys of the dictionary are numeric or symbolic to determine which constructor to call.

    opened by fsaad 2
  • Implement compiler from the IMP language

    Implement compiler from the IMP language

    It should suffice to specify the SPN using Python if-else notation and leveraging the ast module (extended docs)

    For example, we can define the Indian GPA problem as:

    
    nationality = choose({'India': 0.5, 'USA': 0.5})
    score = choose({'imperfect': 0.99, 'perfect': 0.01})
    
    if (nationality == 'India'):
        if (score == 'perfect'):
            gpa = Atomic(10)
        if (score == 'imperfect'):
            gpa = Uniform(0, 10)
    
    if (nationality == 'USA'):
        if (score == 'perfect'):
            gpa = Atomic(4)
        if (score == 'imperfect'):
            gpa = Uniform(0, 4)
    

    In general, any SPN will have this specific structure.

    opened by fsaad 2
  • Make graphviz optional

    Make graphviz optional

    Users report that installing graphviz is a pain. We should re-active the extras in setup.py https://github.com/probcomp/sppl/blob/e77eba5408a749fe4811bb23bf00ff7f912114ea/setup.py#L70

    opened by fsaad 1
  • Investigate sympy.Intersection error on Interval and Range with floats

    Investigate sympy.Intersection error on Interval and Range with floats

    When running spn.prob(GPA <= 0.0) on Indian GPA, error is encountered at this line https://github.com/probcomp/sum-product-dsl/blob/aafed1006b8f28708290911dbf946765139dd495/src/spn.py#L961

    Note the use of 0.0 instead of 0 in the conditioning event.

    It appears to be an error from Sympy.

    >>> sympy.Intersection(sympy.Range(10,11,1), sympy.Interval(-sympy.oo, 0.0))
    Traceback (most recent call last):
      File "/home/fsaad/sum-product-dsl/.venv/lib/python3.6/site-packages/sympy/core/compatibility.py", line 410, in as_int
        raise TypeError
    TypeError
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/fsaad/sum-product-dsl/.venv/lib/python3.6/site-packages/sympy/sets/fancysets.py", line 426, in __new__
        for w in (start, stop, step)]
      File "/home/fsaad/sum-product-dsl/.venv/lib/python3.6/site-packages/sympy/sets/fancysets.py", line 426, in <listcomp>
        for w in (start, stop, step)]
      File "/home/fsaad/sum-product-dsl/.venv/lib/python3.6/site-packages/sympy/core/compatibility.py", line 416, in as_int
        raise ValueError('%s is not an integer' % (n,))
    ValueError: 1.00000000000000 is not an integer
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/fsaad/sum-product-dsl/.venv/lib/python3.6/site-packages/sympy/sets/sets.py", line 1213, in __new__
        return simplify_intersection(args)
      File "/home/fsaad/sum-product-dsl/.venv/lib/python3.6/site-packages/sympy/sets/sets.py", line 1966, in simplify_intersection
        new_set = intersection_sets(s, t)
      File "/home/fsaad/sum-product-dsl/.venv/lib/python3.6/site-packages/sympy/multipledispatch/dispatcher.py", line 198, in __call__
        return func(*args, **kwargs)
      File "/home/fsaad/sum-product-dsl/.venv/lib/python3.6/site-packages/sympy/sets/handlers/intersection.py", line 104, in intersection_sets
        return intersection_sets(a, Range(start, end + 1))
      File "/home/fsaad/sum-product-dsl/.venv/lib/python3.6/site-packages/sympy/sets/fancysets.py", line 431, in __new__
        [0, 1/10, 1/5].'''))
    ValueError: 
    Finite arguments to Range must be integers; `imageset` can define
    other cases, e.g. use `imageset(i, i/10, Range(3))` to give [0, 1/10,
    1/5].
    
    opened by fsaad 1
  • Fix condition on DiscreteReal Atomic for non-Range data type

    Fix condition on DiscreteReal Atomic for non-Range data type

    spn = (X << poisson(mu=5))
    w = spn.condition(X << {1,2,3, 10})
    

    The conditional distribution needs to be a mixture of atoms, instead of assuming full support on the bounds.

    opened by fsaad 1
  • Remove prune_events from BranchSPN and make ProductSPN check for conjunctions first

    Remove prune_events from BranchSPN and make ProductSPN check for conjunctions first

    Rather than iterate through all 2^m subsets, first compute the probabilities of the m conjunctions and immediately return 0 if each conjunction in the union has probability zero, which removes the duplicate computation inside prune_events.

    opened by fsaad 1
  • Question/feature-request: combining model source code with pre-existing SPE-models

    Question/feature-request: combining model source code with pre-existing SPE-models

    Is there a good way to execute SPPL model source code while having it rely on a previously existing model? For example, assume I have an SPPL model m (of type ProductSPE) which models variables a and b. I have this source code source in form of a string:

    if a < 17 
        c ~= bernoulli(p=0.1)
    
    else:
        c ~= bernoulli(p=0.9)
    

    which relies on a being defined but which would otherwise compile fine if I run:

    SPPL_Compiler(source).execute_module().model
    

    Obviously, without knowing about m it won't compile because the compiler doesn't know about a. Is there a good way to tell the compiler about a model (like m) as a starting point? i.e. something like this:

    SPPL_Compiler(m, source).execute_module().model
    
    opened by Schaechtle 1
  • SPPL compiler replaces `==` and `in` with `<<` incorrectly at times

    SPPL compiler replaces `==` and `in` with `<<` incorrectly at times

    MWE

    >>> source = """
    i = 0.5
    Y ~= bernoulli(p=0 if i == 0.5 else 1)
    """
    >>> compiler = SPPL_Compiler(source)
    >>> namespace = compiler.execute_module()
    TypeError: unsupported operand type(s) for <<: 'float' and 'set'
    

    The reason is that the generated Python is

    # MODEL DEFINITION
    command = Sequence(
        Sample(Y, bernoulli(p=(0 if (i << {0.5}) else 1))),
    )
    

    The offending line is here: https://github.com/probcomp/sppl/blob/8b0fe0c37ed15dd19936d13e0fa652c3b5237cac/src/compilers/sppl_to_python.py#L145

    opened by fsaad 0
  • Possibility of using locked semantic versioned dependencies?

    Possibility of using locked semantic versioned dependencies?

    Currently sppl depends on hard-coded strict-equality versions of its dependencies.

    This makes its inclusion to ongoing projects somewhat problematic.

    We always have the choice of vendoring the code instead of installing it through pip, and ignoring restrictions on the setup.py, but I think it'd be best if sppl used a system similar to poetry or pip-tools, with loosely-restricted versions for use as a library, and strict, locked versions for experiment reproducibility in the papers, etc.

    Is this something you have considered?

    opened by IanTayler 1
  • Fix relying on `.name` attribute of `rv_discrete`

    Fix relying on `.name` attribute of `rv_discrete`

    The .name attribute of a frozen rv_discrete.dist object does not correspond to the underlying class, i.e.,

    >>> from scipy.stats import norm
    >>> d = norm(loc=0, scale=1)
    >>> d.dist.name
    'norm'
    
    >>> from scipy.stats import rv_discrete
    >>> d = rv_discrete(values=((1, 2), (.5, .5))).freeze()
    >>> d.dist.name
    'Distribution'
    

    This behavior will cause issue in serializing, since we rely on the name attribute to correspond to a scipy class: https://github.com/probcomp/sppl/blob/efff34fb3d3703247dd7001c36970069c5ac3825/src/compilers/spe_to_dict.py#L48

    Related #121

    The constructor of rv_discrete does accept a name attribute, which will need to be handled correctly when loading the scipy dist.

    opened by fsaad 0
  • Fix equality checking for instances of `rv_discrete`

    Fix equality checking for instances of `rv_discrete`

    >>> from sppl.distributions import rv_discrete
    >>> from sppl.spe import DiscreteLeaf
    >>> d1 = rv_discrete(values=((1, 2), (.5, .5)))
    >>> d2 = rv_discrete(values=((1, 2), (.8, .2)))
    >>> l1 = Id("X") >> a
    >>> l2 = Id("X") >> b
    >>> l1 == l2
    True
    

    The problem is that scipy does not store the values under the kwds in the frozen rv_discrete object, that is

    >>> l1.dist.kwds
    {}
    >>> l2.dist.kwds
    {}
    

    and so the equality logic checking passes: https://github.com/probcomp/sppl/blob/efff34fb3d3703247dd7001c36970069c5ac3825/src/spe.py#L828-L836

    The kwds field is populated by scipy for all other distributions norm, poisson, etc.

    opened by fsaad 0
  • `scipy == 1.4.1` fails to build on `osx`

    `scipy == 1.4.1` fails to build on `osx`

    Can we bump this version to >=1.7 in requirements (?) I tried a few different things here -- like installing into a fresh conda environment. For whatever reason, this old version of scipy fails to build -- but the newest version is okay.

    opened by femtomc 1
Owner
MIT Probabilistic Computing Project
MIT Probabilistic Computing Project
Streamlit App For Product Analysis - Streamlit App For Product Analysis

Streamlit_App_For_Product_Analysis Здравствуйте! Перед вами дашборд, позволяющий

Grigory Sirotkin 1 Jan 10, 2022
null 190 Jan 3, 2023
Implements an infinite sum of poisson-weighted convolutions

An infinite sum of Poisson-weighted convolutions Kyle Cranmer, Aug 2018 If viewing on GitHub, this looks better with nbviewer: click here Consider a v

Kyle Cranmer 26 Dec 7, 2022
Consecutive-Subsequence - Simple software to calculate susequence with highest sum

Simple software to calculate susequence with highest sum This repository contain

Gbadamosi Farouk 1 Jan 31, 2022
Multiple-criteria decision-making (MCDM) with Electre, Promethee, Weighted Sum and Pareto

EasyMCDM - Quick Installation methods Install with PyPI Once you have created your Python environment (Python 3.6+) you can simply type: pip3 install

Labrak Yanis 6 Nov 22, 2022
PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning Warning: This is a rapidly evolving research prototype.

MIT Probabilistic Computing Project 190 Dec 27, 2022
Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Learning Opinion Summarizers by Selecting Informative Reviews This repository contains the codebase and the dataset for the corresponding EMNLP 2021

Arthur Bražinskas 39 Jan 1, 2023
A MNIST-like fashion product database. Benchmark

Fashion-MNIST Table of Contents Why we made Fashion-MNIST Get the Data Usage Benchmark Visualization Contributing Contact Citing Fashion-MNIST License

Zalando Research 10.5k Jan 8, 2023
Optimal space decomposition based-product quantization for approximate nearest neighbor search

Optimal space decomposition based-product quantization for approximate nearest neighbor search Abstract Product quantization(PQ) is an effective neare

Mylove 1 Nov 19, 2021
The implementation of our CIKM 2021 paper titled as: "Cross-Market Product Recommendation"

FOREC: A Cross-Market Recommendation System This repository provides the implementation of our CIKM 2021 paper titled as "Cross-Market Product Recomme

Hamed Bonab 16 Sep 12, 2022
SAN for Product Attributes Prediction

SAN Heterogeneous Star Graph Attention Network for Product Attributes Prediction This repository contains the official PyTorch implementation for ADVI

Xuejiao Zhao 9 Dec 12, 2022
Supervised Contrastive Learning for Product Matching

Contrastive Product Matching This repository contains the code and data download links to reproduce the experiments of the paper "Supervised Contrasti

Web-based Systems Group @ University of Mannheim 18 Dec 10, 2022
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

Bayesian Methods for Hackers Using Python and PyMC The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chap

Cameron Davidson-Pilon 25.1k Jan 2, 2023
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 3k Dec 29, 2022
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

null 7.7k Dec 30, 2022
InferPy: Deep Probabilistic Modeling with Tensorflow Made Easy

InferPy: Deep Probabilistic Modeling Made Easy InferPy is a high-level API for probabilistic modeling written in Python and capable of running on top

PGM-Lab 141 Oct 13, 2022
Supervised domain-agnostic prediction framework for probabilistic modelling

A supervised domain-agnostic framework that allows for probabilistic modelling, namely the prediction of probability distributions for individual data

The Alan Turing Institute 112 Oct 23, 2022