Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Overview

pgmpy

Build Status Appveyor codecov Codacy Badge Downloads Join the chat at https://gitter.im/pgmpy/pgmpy

pgmpy is a python library for working with Probabilistic Graphical Models.

Documentation and list of algorithms supported is at our official site http://pgmpy.org/
Examples on using pgmpy: https://github.com/pgmpy/pgmpy/tree/dev/examples
Basic tutorial on Probabilistic Graphical models using pgmpy: https://github.com/pgmpy/pgmpy_notebook

Our mailing list is at https://groups.google.com/forum/#!forum/pgmpy .

We have our community chat at gitter.

Dependencies

pgmpy has following non optional dependencies:

  • python 3.6 or higher
  • networkX
  • scipy
  • numpy
  • pytorch

Some of the functionality would also require:

  • tqdm
  • pandas
  • pyparsing
  • statsmodels
  • joblib

Installation

pgmpy is available both on pypi and anaconda. For installing through anaconda use:

$ conda install -c ankurankan pgmpy

For installing through pip:

$ pip install -r requirements.txt  # only if you want to run unittests
$ pip install pgmpy

To install pgmpy from the source code:

$ git clone https://github.com/pgmpy/pgmpy 
$ cd pgmpy/
$ pip install -r requirements.txt
$ python setup.py install

If you face any problems during installation let us know, via issues, mail or at our gitter channel.

Development

Code

Our latest codebase is available on the dev branch of the repository.

Contributing

Issues can be reported at our issues section.

Before opening a pull request, please have a look at our contributing guide

Contributing guide contains some points that will make our life's easier in reviewing and merging your PR.

If you face any problems in pull request, feel free to ask them on the mailing list or gitter.

If you want to implement any new features, please have a discussion about it on the issue tracker or the mailing list before starting to work on it.

Testing

After installation, you can launch the test form pgmpy source directory (you will need to have the pytest package installed):

$ pytest -v

to see the coverage of existing code use following command

$ pytest --cov-report html --cov=pgmpy

Documentation and usage

The documentation is hosted at: http://pgmpy.org/

We use sphinx to build the documentation. To build the documentation on your local system use:

$ cd /path/to/pgmpy/docs
$ make html

The generated docs will be in _build/html

Examples

We have a few example jupyter notebooks here: https://github.com/pgmpy/pgmpy/tree/dev/examples For more detailed jupyter notebooks and basic tutorials on Graphical Models check: https://github.com/pgmpy/pgmpy_notebook/

Citing

Please use the following bibtex for citing pgmpy in your research:

@inproceedings{ankan2015pgmpy,
  title={pgmpy: Probabilistic graphical models using python},
  author={Ankan, Ankur and Panda, Abinash},
  booktitle={Proceedings of the 14th Python in Science Conference (SCIPY 2015)},
  year={2015},
  organization={Citeseer}
}

License

pgmpy is released under MIT License. You can read about our license at here

Issues
  • Adds base class for continuous node representation

    Adds base class for continuous node representation

    This PR deals with the basic continuous node representation feature. It will comprise a base class for Continuous node representation along with various methods to discretize the continuous variables into discrete factors. This involves the first three weeks of my GSoC project.

    opened by yashu-seth 102
  • Hamiltonian Monte Carlo

    Hamiltonian Monte Carlo

    This PR deals with implementing HMC with dual averaging. The implementation is still open for discussion. If you find anything ambiguous please comment on the line. This PR is open for discussion.

    opened by khalibartan 74
  • ContinousFactor and Joint Gaussian Representation

    ContinousFactor and Joint Gaussian Representation

    This PR deals with

    • the creation of a base class ContinuousFactor for multivariate representations.
    • the creation of the class JointGaussainDistribution - a model to represent the gaussain random variables.
    opened by yashu-seth 55
  • Added BIF.py into readwrite

    Added BIF.py into readwrite

    All functions are not implemented. only get_variable,get_states and get_property are implemented.

    Creating this pull request for easy review. I have tested these methods on munin2.bif and dog-problem.bif and they are working fine. Implemented these functions in accordance with BIF v0.15 as given here . Time taken to run if not importing pgmpy and numpy was 0.06s on average with printing the complete variable_states for munin2.bif (PS: 1003 nodes) Check issue #506 .

    opened by khalibartan 45
  • updates in check_model method

    updates in check_model method

    @ankurankan I have removed cardinalities from the model attributes but it seems it is being used at other places as well. Should cardinalities be computed every single time it is required? Is there a particular problem in having it as an attribute?

    opened by yashu-seth 32
  • Improving Variable Elimination (VE)

    Improving Variable Elimination (VE)

    Here, we finish the implementation of VE with few missing steps, namely computing good elimination orderings (with 4 heuristics - min neighbors, min fill, min weight, and weighted min fill) and safely removing irrelevant variables from the model (barren and independent by evidence nodes). In order to make these improvements, few new methods were necessary and modifying other members too. In our preliminary experimental results, queries that were taking up to 30 minutes using VE now take less than 2 minutes. Please, help us test this new code for a robust implementation and also looking for better ways of coding the algorithms. Thanks.

    opened by jhonatanoliveira 27
  • Replaced recursive call with `while` loop.

    Replaced recursive call with `while` loop.

    • Replaces the recursive version of fun with an iterative one, much easier to read in my opinion. This should be slightly fast as well because recursive calls are expensive in Python(results in new stack frame each time), plus Python has a default limit of 1000 for recursive calls(though this is highly unlikely to occur in our case).
    • I am not sure why the library is using Python 2 based super() calls considering the fact that dependency includes Python 3.3. In Python 3 thanks to the cell variable __class__ we can simply use super().method_name(...)(PEP 3135 -- New Super).
    opened by ashwch 25
  • Added model.predict_probability

    Added model.predict_probability

    Added a new method that gives probabilities of missing variables given a predict data #794

            B_0         B_1
        80  0.439178    0.560822
        81  0.581970    0.418030
        82  0.488275    0.511725
        83  0.581970    0.418030
        84  0.510794    0.489206
        85  0.439178    0.560822
        86  0.439178    0.560822
        87  0.417124    0.582876
        88  0.407978    0.592022
        89  0.429905    0.570095
        90  0.581970    0.418030
        91  0.407978    0.592022
        92  0.429905    0.570095
        93  0.429905    0.570095
        94  0.439178    0.560822
        95  0.407978    0.592022
        96  0.559904    0.440096
        97  0.417124    0.582876
        98  0.488275    0.511725
        99  0.407978    0.592022`
    

    Also a new error test for predict that increases the coverage.

    opened by raghavg7796 24
  • Strange Behavior HillClimbSearch

    Strange Behavior HillClimbSearch

    Subject of the issue

    I want to reproduce the example here

    Your environment

    • pgmpy version: 0.1.12
    • Python version: 3.6.9
    • Operating System: Ubuntu 18.04.5 LTS

    Steps to reproduce

    import pandas as pd
    import numpy as np
    from pgmpy.estimators import HillClimbSearch, BicScore
    data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 9)), columns=list('ABCDEFGHI'))
    data['J'] = data['A'] * data['B']
    est = HillClimbSearch(data, scoring_method=BicScore(data))
    best_model = est.estimate()
    best_model.edges()
    

    Expected behaviour

    [('B', 'J'), ('A', 'J')]

    Actual behaviour

    [('A', 'B'), ('J', 'A'), ('J', 'B')]

    opened by ivanDonadello 23
  • Hamiltonian Monte Carlo & Hamiltonian Monte Carlo with dual averaging

    Hamiltonian Monte Carlo & Hamiltonian Monte Carlo with dual averaging

    @ankurankan I have send this PR to aid us in discussion. I was experimenting things with how to handle gradients (removing the gradient argument). I tested with two ways:

    • First I tried handled grad_log_pdf argument on places itself depending upon how user passed the argument, if None was passed then I created a lambda function to call model.get_gradient_log_pdf otherwise I created a lambda function to use the custom class. But with this things were messy as I have to handle this parameter at two places, one in sampling class and other in BaseSimulateHamiltonianDynamics class.
    • Second ( this PR implements it). Handle everything in model.get_gradient_log_pdf. This code is less messy, because every call is made to model.get_gradient_log_pdf and the method internally handles the rest so need of making suitable changes at different places.

    How do you suggest I should handle the gradients ? You can look at the last commit to specifically see the changes I made https://github.com/pgmpy/pgmpy/pull/702/commits/748eb1fe13488bb8f0cf27a7064a67384ec3315e

    After the discussion I'll close one of the PR.

    opened by khalibartan 21
  • Efficient factor product

    Efficient factor product

    A factor product following "Probabilistic Graphical Models" (Koller 09) on page 359, Algorithm 10.A.1 - Efficient implementation of a factor product operation. Koller's algorithm was modified to fit the configuration used in pgmpy. For example, in pgmpy the configurations of Factor are supposed to be like (0,0,0) (0,0,1) (0,1,0) (1,0,0) and so on, instead of (0,0,0) (1,0,0) (0,1,0) (0,0,1) as expected for Koller's algorithm.

    Koller's implementation is around 98% faster than the current one in pgmpy. This benchmark was done by using a simple python script as follows:

    from pgmpy.factors import Factor
    from pgmpy.factors import factor_product
    from time import time
    
    phi = Factor(['x1', 'x2'], [2, 2], range(4))
    phi1 = Factor(['x3', 'x4'], [2, 2], range(4))
    t0 = time()
    prod = factor_product(phi, phi1)
    t1 = time()
    print(t1-t0)
    

    After running 6 time each implementation, here is the results:

    Comparison

    Unfortunately, we don't know how to use JobLib. But we leave this TODO with the hope that using parallel computation can improve this implementation even further.

    opened by jhonatanoliveira 21
  • failed to load model

    failed to load model

    #failed to load model model2.save('D:\xxky\model2.bif', filetype='bif') model3 = BayesianNetwork.load('D:\xxky\model2.bif', filetype='bif')

    'save' is ok, but 'load' raise error: File "D:\Anaconda\lib\site-packages\pyparsing\results.py", line 193, in getitem return self._toklist[i]

    IndexError: list index out of range

    Your environment

    • pgmpy version: 0.1.17
    • Python version: 3.8
    • Operating System: Windows

    Steps to reproduce

    Tell us how to reproduce this issue. Please provide a minimal reproducible code of the issue you are facing if possible.

    Expected behaviour

    Tell us what should happen

    Actual behaviour

    Tell us what happens instead

    opened by JonathanOase 6
  • SIGKILL generated during VariableElimination query

    SIGKILL generated during VariableElimination query

    Subject of the issue

    Hi,

    I have a model with 250 binary value nodes. Most nodes are linked with deterministic functions and I am trying to do a query on 8 nodes to find their marginal PMF. When I run the query, after some time the program generates a SIGKILL itself and dies. Why is that? The memory usage of the program according to the system monitor tops out at 6GB.

    I should add: This issue only happens when I add evidence to certain nodes. If I remove the evidence, the solver is able to find a solution in less then a second, so I really unsure of what's going on and how to find more information about the origin of the early termination

    Your environment

    • pgmpy version = 0.1.19
    • Python version = python 3.8.10
    • Operating System Ubuntu 20.04 LTS

    Steps to reproduce

    Can send code directly upon request - would not like to post it publicly

    Expected behaviour

    Successful inference

    Actual behaviour

    SIGKILL

    opened by rishubn 3
  • Batch query for inference methods

    Batch query for inference methods

    Your checklist for this pull request

    Please review the guidelines for contributing to this repository.

    • [ ] Make sure you are requesting to pull a topic/feature/bugfix branch (right side). Don't request your master!
    • [ ] Make sure you are making a pull request against the dev branch (left side). Also you should start your branch off our dev.
    • [ ] Check the commit's or even all commits' message styles matches our requested structure.

    Issue number(s) that this pull request fixes

    • Fixes #

    List of changes to the codebase in this pull request

    opened by ankurankan 1
  • Bug in EM when latent variables are present

    Bug in EM when latent variables are present

    Subject of the issue

    EM algorithm seems to be assigning equal probability values to all the states of latent variables.

    Your environment

    • pgmpy version: 0.1.18
    • Python version
    • Operating System

    Steps to reproduce

    from pgmpy.utils import get_example_model
    from pgmpy.models import BayesianNetwork
    from pgmpy.estimators import ExpectationMaximization as EM
    
    model = get_example_model('cancer')
    model_latent = BayesianNetwork(model.edges(), latents={'Smoker'})
    model_latent.add_cpds(*model.cpds)
    samples = model_latent.simulate(int(1e4))
    est = EM(model_latent, samples)
    cpds = est.get_parameters(seed=42, show_progress=False)
    print(cpds[2])
    +-----------+----------+
    | Smoker(0) | 0.500001 |
    +-----------+----------+
    | Smoker(1) | 0.499999 |
    +-----------+----------+
    
    
    opened by ankurankan 1
  • BN of multi-sensor

    BN of multi-sensor

    I have some data of multi-sensor. There are 8 channels. And I partition it into sample with the dimensions of 1024*8. There 1000 samples. I use the score-based method to learn the structure. I want to get 1000 BNs, each has 8 nodes. But it seems the method can't learn the structure. The adjacency matrices are all empty.
    Following is my code:

        data_graph = pd.DataFrame(x.T, columns=['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8'])
        hc = HillClimbSearch(data_graph)
        adj_tmp = hc.estimate(scoring_method="bicscore")
    

    x :8*1024, original time domain data adj_tmp: DAG with no edges

    opened by XMAHA 1
  • structure_score gradually taking longer and longer to execute

    structure_score gradually taking longer and longer to execute

    Hello,

    Thanks for the wonderful library. I am creating a number of random bayesian networks and calculating the bic for each network using the structure_score method. Specifically I conduct 500 loops where each loop creates 100 networks and I score each one of the networks. I noticed that my script starts off fast (i.e. it says it will take 1 hour to complete) but then it gradually gets slower and slower (eventually it says it will take 50 hours to complete). Profiling my code, the bottleneck appears to be coming from the state_counts method in StructureScore.

    environment

    • pgmpy 0.1.17
    • Python 3.8.12
    • MacOS Catalina 10.15.7

    image

    opened by Allen8838 0
Releases(v0.1.19)
Owner
pgmpy
Python library for Probabilistic Graphical Models
pgmpy
Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

Andrés Suárez 3 Jul 1, 2022
pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Matthew Johnson 517 Aug 13, 2022
statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

null 1 Oct 3, 2021
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

Tsinghua Machine Learning Group 2.1k Aug 2, 2022
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 7.6k Aug 8, 2022
Describing statistical models in Python using symbolic formulas

Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design mat

Python for Data 845 Aug 4, 2022
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Aug 13, 2022
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

Stan 203 Aug 13, 2022
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

null 3.8k Aug 8, 2022
Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

null 10 Oct 27, 2021
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 6.9k Aug 11, 2022
Using approximate bayesian posteriors in deep nets for active learning

Bayesian Active Learning (BaaL) BaaL is an active learning library developed at ElementAI. This repository contains techniques and reusable components

ElementAI 620 Aug 4, 2022
A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 1.4k Aug 3, 2022
BAyesian Model-Building Interface (Bambi) in Python.

Bambi BAyesian Model-Building Interface in Python Overview Bambi is a high-level Bayesian model-building interface written in Python. It's built on to

null 814 Aug 13, 2022
A Python Tools to imaging the shallow seismic structure

ShallowSeismicImaging Tools to imaging the shallow seismic structure, above 10 km, based on the ZH ratio measured from the ambient seismic noise, and

Xiao Xiao 8 May 20, 2022
A set of functions and analysis classes for solvation structure analysis

SolvationAnalysis The macroscopic behavior of a liquid is determined by its microscopic structure. For ionic systems, like batteries and many enzymes,

MDAnalysis 13 Aug 6, 2022
A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

Invent Analytics 8 Feb 15, 2022