PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

MIT Probabilistic Computing Project

Last update: Dec 27, 2022

Related tags

Deep Learning PClean

Overview

PClean

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Warning: This is a rapidly evolving research prototype.

PClean was created at the MIT Probabilistic Computing Project.

If you use PClean in your research, please cite the our 2021 AISTATS paper:

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March). In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. (pdf)

Using PClean

To use PClean, create a Julia file with the following structure:

using PClean
using DataFrames: DataFrame
import CSV

# Load data
data = CSV.File(filepath) |> DataFrame

# Define PClean model
PClean.@model MyModel begin
    @class ClassName1 begin
        ...
    end

    ...
    
    @class ClassNameN begin
        ...
    end
end

# Align column names of CSV with variables in the model.
# Format is ColumnName CleanVariable DirtyVariable, or, if
# there is no corruption for a certain variable, one can omit
# the DirtyVariable.
query = @query MyModel.ClassNameN [
  HospitalName hosp.name             observed_hosp_name
  Condition    metric.condition.desc observed_condition
  ...
]

# Configure observed dataset
observations = [ObservedDataset(query, data)]

# Configuration
config = PClean.InferenceConfig(1, 2; use_mh_instead_of_pg=true)

# SMC initialization
state = initialize_trace(observations, config)

# Rejuvenation sweeps
run_inference!(state, config)

# Evaluate accuracy, if ground truth is available
ground_truth = CSV.File(filepath) |> CSV.DataFrame
results = evaluate_accuracy(data, ground_truth, state, query)

# Can print results.f1, results.precision, results.accuracy, etc.
println(results)

# Even without ground truth, can save the entire latent database to CSV files:
PClean.save_results(dir, dataset_name, state, observations)

Then, from this directory, run the Julia file.

JULIA_PROJECT=. julia my_file.jl

To learn to write a PClean model, see our paper, but note the surface syntax changes described below.

Differences from the paper

As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax, from the stand-alone syntax presented in our paper:

(1) Instead of latent class C ... end, we write @class C begin ... end.

(2) Instead of subproblem begin ... end, inference hints are given using ordinary Julia begin ... end blocks.

(3) Instead of parameter x ~ d(...), we use @learned x :: D{...}. The set of distributions D for parameters is somewhat restricted.

(4) Instead of x ~ d(...) preferring E, we write x ~ d(..., E).

(5) Instead of observe x as y, ... from C, write @query ModelName.C [x y; ...]. Clauses of the form x z y are also allowed, and tell PClean that the model variable C.z represents a clean version of x, whose observed (dirty) version is modeled as C.y. This is used when automatically reconstructing a clean, flat dataset.

The names of built-in distributions may also be different, e.g. AddTypos instead of typos, and ProportionsParameter instead of dirichlet.

Comments

How to handle when a column is an array of struct
Hello,

I wonder if anyone has insights that when a table column is an array of struct, which is very common in real world database, for example, consider someone's profile has following schema:

{ "type": "array", "items": [ { "type": "record", "name": "schools", "fields": [ { "name": "school_name", "type": [ "string" ] }, { "name": "degree", "type": [ "string" ] } ] }, ] },

one example is:

Name | schools A | [[MIT, Phd], [MIT, master], [MIT, bachrlo]] B | [[Boston U, master], [Boston U, bachelor]] C | [[CMU, bachelor]]

the most naive way I can think is break each school array as single element, like followings:

Name | schools A | [MIT, Phd] A | [MIT, master] A | [MIT, bachrlo]

but I think this method would loss the relational information for one person, for example, if I know A study master at MIT, then it is possible that previous degree at MIT is bachelor instead of a typo (bachrlo)

So my question is:

how can we properly handle when the column is an array of a data structure

if columns have hierarchy structure, just confirm if I did correct or not: I should simply build this hierarchy structure into the PClean program (like a bayesian network ), and then just align them with hierarchy column names, am I correct here?

Thank you!
opened by sufengniu 3

Runtime error: LoadError: MethodError: no method matching logdensity(::AddTypos, ::String3, ::String3)

Hello, I am new to both PClean and Julia, when I tried to run the PClean experiments (julia --project=. experiments/hospital/run.jl), I got the following errors:

ERROR: LoadError: MethodError: no method matching logdensity(::AddTypos, ::String3, ::String3)
Closest candidates are:
  logdensity(::FormatName, ::Any, ::Any) at ~/lab/PClean/src/distributions/format_name.jl:33
  logdensity(::FormatName, ::Any, ::Any, ::Any, ::Any) at ~/lab/PClean/src/distributions/format_name.jl:14
  logdensity(::ExpandOnShortVersion, ::Any, ::Any, ::Any) at ~/lab/PClean/src/distributions/expand_on_short_version.jl:30
  ...
Stacktrace:
 [1] var"##663"(state#321::PClean.ProposalRowState)
   @ PClean ~/lab/PClean/src/inference/proposal_compiler.jl:101
 [2] #invokelatest#2
   @ ./essentials.jl:716 [inlined]
 [3] invokelatest
   @ ./essentials.jl:714 [inlined]
 [4] make_block_proposal!(state::PClean.ProposalRowState, block_index::Int64, config::PClean.InferenceConfig)
   @ PClean ~/lab/PClean/src/inference/block_proposal.jl:175
 [5] extend_particle!(particle::PClean.SMCParticle, config::PClean.InferenceConfig)
   @ PClean ~/lab/PClean/src/inference/row_inference.jl:70
 [6] run_smc!(trace::PClean.PCleanTrace, class::Symbol, key::Int64, config::PClean.InferenceConfig)
   @ PClean ~/lab/PClean/src/inference/row_inference.jl:146
 [7] initialize_trace(observations::Vector{ObservedDataset}, config::PClean.InferenceConfig)
   @ PClean ~/lab/PClean/src/inference/inference.jl:37
 [8] macro expansion
   @ ~/lab/PClean/experiments/hospital/run.jl:79 [inlined]
 [9] top-level scope
   @ ./timing.jl:220
in expression starting at /Users/sniu/lab/PClean/experiments/hospital/run.jl:78

my Julia version is 1.7.2, I searched online and found one similar issues (https://discourse.julialang.org/t/error-loaderror-methoderror-no-method-matching-setindex-shape-check-int64-int64-int64/18249) mentioned about new Julia need to do the broadcasting, but from error messages, I haven't figured out where to adjust it or maybe I go to the wrong direction. Any insights are appreciated. Thank you!

Sufeng

opened by sufengniu 2

Consider changing default subproblem blocking

In the paper, it is implied that by default, all attributes and reference slots of a class belong to the same subproblem. However, the current implementation uses the opposite convention: that by default, each attribute or reference slot is in its own subproblem, requiring manual blocking in order to define bigger subproblems. We should consider changing this to match the paper.

opened by alex-lew 1
Treat contiguous statements as belonging to the same subproblem by default (resolves #2)

Resolves #2 by changing the default subproblem-building strategy.

The result is that users can be 'more unthinking' and still get good inference, but that they may need to add subproblem hints (including to latent classes) to get acceptable performance. (I had to add hints in a couple places to recover the performance from before this change.)

opened by alex-lew 0

20210321 marcoct getitrunning

Fixes to example data loading and some internal source code involving data frames due to breaking changes in the DataFrames and CSV packages (e.g. symbols versus strings being returned). Examples now running with latest versions of these packages:

(PClean) pkg> st
Project PClean v0.1.0
Status `~/dev/PClean/Project.toml`
  [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.8.4
  [a93c6f00] DataFrames v0.22.5
  [31c24e10] Distributions v0.24.15
  [a98d9a8b] Interpolations v0.13.1
  [682c06a0] JSON v0.21.1
  [093fc24a] LightGraphs v1.3.5
  [1914dd2f] MacroTools v0.5.6
  [c03570c3] Memoize v0.4.4
  [91a5bcdd] Plots v1.10.6
  [f27b6e38] Polynomials v1.2.0
  [d330b81b] PyPlot v2.9.0
  [2913bbd2] StatsBase v0.33.3
  [88034a9c] StringDistances v0.10.0
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles

opened by marcoct 0

Accurate name prior
A good prior distribution on person names (first names, last name, etc.) -- but many other types of names including place names -- seems important for cases when it is useful to model the possibility of typos occurring in names. If we model an observed name field using a typo model, without an accurate name prior it is easy for the model to infer that a correctly spelled name is actually a version of another name but with typos introduced. I encountered this when writing a simple model of first names. Here is a minimal example:

PClean.@model CustomerModel begin @class FirstNames begin name ~ StringPrior(1, 60, all_given_names) end @class Person begin given_name ~ FirstNames end; @class Obs begin begin person ~ Person given_name ~ AddTypos(person.given_name.name) end end; end; query = @query CustomerModel.Obs [ given_name person.given_name.name given_name ]; observations = [ObservedDataset(query, df)] config = PClean.InferenceConfig(5, 2; use_mh_instead_of_pg=true) @time begin tr = initialize_trace(observations, config); run_inference!(tr, config) end

Coming up with a good name prior seems like a very nontrivial task. Intuitively, if a human were performing this task, they would rely on their prior experience with names, including common spelling and translation / transliterations and knowledge of the variety closely related names with common phonetic origins, etc. A name expert would have a much more accurate name prior than a random person. Also, the statistics of names (frequency distributions, etc.) might vary widely based on the population or sub-population. One longer-term goal could be to develop an accurate name prior that represents the knowledge of a "global name expert".

Intermediate steps could be to

Train a more accurate n-gram text model that is trained on a data set of names.

Train or find an existing deep generative model for names.

Other steps that don't involve coming up with a name prior, but might mitigate the issue mentioned above might be:

Come up with a more precise typo model, or an approximate typo model that somehow alleviates the issue (e.g. by upper-bounding the number of typos in a name). (This should be a separate issue).

Use a large data set of names a directly-observed table in the model. This is equivalent to using a name prior that is a frequency-weighted distribution over these names. (A likely issue with that approach is that if a name is not observed at least once within this data set, then it might be likely to be corrected to name that is).

Change the Pitman-Yor parameters for the underlying name table to better match statistics of real names, and more generally admit more rare names.

Also, a review of the potential consequences of a biased name prior, and approaches to reduce bias in the name priors, and/or mitigate downstream consequences of this bias, could be valuable.
opened by marcoct 0
Support linear combinations of MeanParameters
The goal is to allow parameters (possibly from different classes) to be transformed before they are used as arguments to distributions. For example, linear combinations of normally-distributed parameters can still be used as the mean of a Gaussian observation. This may be useful in cases where we want to model multiple causes; for example, the rent of an apartment may be normally distributed around a linear combination of parameters representing the effects of (1) the apartment's location, (2) the apartment's size, (3) the apartment's landlord, etc.

There are (at least) two general strategies we could take for supporting this:

We could maintain as global state during inference a mean vector and covariance matrix for the multivariate Gaussian posterior over all MeanParameters in a program, updating it as necessary when values are observed or unobserved. Then, we could resample all MeanParameters jointly in a single blocked Gibbs update. This is probably the best approach from an inference perspective, as it fully takes into account all posterior correlations between the variables. I haven't yet worked out the math for what the update rules would be, or how expensive they'd be. A useful reference would be Dario Stein's recent paper, which describes the implementation of a language with Gaussian variables and affine transformations that supports exact conditioning, and uses the "maintain a covariance matrix and mean vector" approach: https://arxiv.org/pdf/2101.11351.pdf

We could perform individual Gibbs updates separately on each MeanParameter. Then when observing that N(x1+x2, sigma) == y, we think of it as an observation of x1 as y-x2 when updating x1, and of x2 as y-x1 when updating x2. This requires fewer changes to the current architecture, at the cost of possibly worse inference (more Gibbs samples are needed to converge to the same local posterior that the blocked update would have sampled from directly).
opened by alex-lew 0
Investigate performance of Flights model after fixing issue #2

Performance of Flights model suffers without the subproblem block at

https://github.com/probcomp/PClean/commit/f51c9489dda76a6dbfd7c64fc166a5c94b13db7a#diff-2a3b7234fcda10bae8f2e3e677e2add7dc29ea841a266f8a13708c4e57ac069bR14

but it is unclear to me why this should be the case: the flight ID is always observed.
bug

opened by alex-lew 0
Add code for all experiments to repository (or create new repository for paper experiments)
This includes

Runtime + accuracy-over-time measurements against baseline inference algorithms (Figure 6)

Configuration for baseline systems (HoloClean + NADEEF)

Uncertainty-aware analysis of Rents dataset
opened by alex-lew 0

Owner

MIT Probabilistic Computing Project

GitHub

Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting (ICCV, 2021)

DKPNet ICCV 2021 Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting Baseline of DKPNet is availa

19 Oct 14, 2022

Implementation for "Domain-Specific Bias Filtering for Single Labeled Domain Generalization"

DSBF Introduction This repository contains the implementation code for paper: Domain-Specific Bias Filtering for Single Labeled Domain Generalization

7 Jan 5, 2023

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

527 Dec 4, 2022

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks. Bayesian-Torch is designed to be flexible and seamless in extending a deterministic deep neural network architecture to corresponding Bayesian form by simply replacing the deterministic layers with Bayesian layers.

210 Jan 4, 2023

Hierarchical-Bayesian-Defense - Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical Variational Inference (Openreview)

Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical V

20 Dec 2, 2022

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

1.3k Dec 30, 2022

Meta Language-Specific Layers in Multilingual Language Models

Meta Language-Specific Layers in Multilingual Language Models This repo contains the source codes for our paper On Negative Interference in Multilingu

20 Feb 13, 2022

Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022

Probabilistic Programming and Statistical Inference in PyTorch

PtStat Probabilistic Programming and Statistical Inference in PyTorch. Introduction This project is being developed during my time at Cogent Labs. The

109 Nov 26, 2022

Modular Probabilistic Programming on MXNet

100 Dec 10, 2022

Deep Probabilistic Programming Course @ DIKU

52 May 14, 2022

DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predicate.

DeepProbLog DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predic

KU Leuven Machine Learning Research Group

94 Dec 18, 2022

Supervised domain-agnostic prediction framework for probabilistic modelling

A supervised domain-agnostic framework that allows for probabilistic modelling, namely the prediction of probability distributions for individual data

112 Oct 23, 2022

A Pytorch Implementation of [Source data‐free domain adaptation of object detector through domain

A Pytorch Implementation of Source data‐free domain adaptation of object detector through domain‐specific perturbation Please follow Faster R-CNN and

1 Dec 25, 2021

Sum-Product Probabilistic Language

Sum-Product Probabilistic Language SPPL is a probabilistic programming language that delivers exact solutions to a broad range of probabilistic infere

57 Nov 17, 2022

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

364 Dec 28, 2022

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Visualizing Adapted Knowledge in Domain Transfer @inproceedings{hou2021visualizing, title={Visualizing Adapted Knowledge in Domain Transfer}, auth

80 Dec 25, 2022

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation [Paper] Prerequisites To install requirements: pip install -r requirements.txt

84 Dec 26, 2022

Official pytorch implementation of "Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization" ACMMM 2021 (Oral)

Feature Stylization and Domain-aware Contrastive Loss for Domain Generalization This is an official implementation of "Feature Stylization and Domain-

22 Sep 22, 2022