Data Efficient Decision Making

Overview

Project Azua

0. Overview

Many modern AI algorithms are known to be data-hungry, whereas human decision-making is much more efficient. The human can reason under uncertainty, actively acquire valuable information from the world to reduce uncertainty, and make personalized decisions given incomplete information. How can we replicate those abilities in machine intelligence?

In project Azua, we build AI algorithms to aid efficient decision-making with minimum data requirements. To achieve optimal trade-offs and enable the human in the loop, we combine state-of-the-art methods in deep learning, probabilistic inference, and causality. We provide easy-to-use deep learning tools that can perform efficient multiple imputation under partially observed, mixed type data, discover the underlying causal relationship behind the data, and suggest the next best step for decision making. Our technology has enabled personalized decision-making in real-world systems, wrapping multiple advanced research in simple APIs, suitable for research development in the research communities, and commercial usages by data scientists and developers.

References

If you have used the models in our code base, please consider to cite the corresponding paper:

[1], (PVAE and information acquisition) Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez-Lobato, Sebastian Nowozin, and Cheng Zhang. "EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE." In International Conference on Machine Learning, pp. 4234-4243. PMLR, 2019.

[2], (VAEM) Chao Ma, Sebastian Tschiatschek, Richard Turner, José Miguel Hernández-Lobato, and Cheng Zhang. "VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data." Advances in Neural Information Processing Systems 33 (2020).

[3], (VICause) Pablo Morales-Alvarez, Angus Lamb, Simon Woodhead, Simon Peyton Jones, Miltos Allamanis, and Cheng Zhang, "VICAUSE: Simultaneous missing value imputation and causal discovery", ICML 2021 workshop on the Neglected Assumptions in Causal Inference

[4], (Eedi dataset) Zichao Wang, Angus Lamb, Evgeny Saveliev, Pashmina Cameron, Yordan Zaykov, Jose Miguel Hernandez-Lobato, Richard E. Turner et al. "Results and Insights from Diagnostic Questions: The NeurIPS 2020 Education Challenge." arXiv preprint arXiv:2104.04034 (2021).

[5], (CORGI:) Jooyeon Kim, Angus Lamb, Simon Woodhead, Simon Pyton Jones, Cheng Zhang, and Miltiadis Allamanis. CORGI: Content-Rich Graph Neural Networks with Attention. In GReS: Workshop on Graph Neural Networks for Recommendation and Search, 2021

Resources

For quick introduction to our work, checkout our NeurIPS 2020 tutorial, from 2:17:11. For a more in-depth technical introduction, checkout our ICML 2020 tutorial

1. Core functionalities

Azua has there core functionalities: missing data imputation, active information acquisition, and causal discovery.

AZUA

1.1. Missing Value Prediction (MVP)

In many real-life scenarios, we will need to make decisions under incomplete information. Therefore, it is crucial to make accurate estimates regarding the missing information. To this end, Azua provides state-of-the-art missing value prediction methods based on deep learning algorithms. These methods are able to learn from incomplete data, and then perform missing value imputation. Instead of producing only one single values for missing entries as in common softwares, most of our methods are able to return multiple imputation values, which provide valuable information of imputation uncertainty. We work with data with same type [1], as well as mixed type data and different missing patterns [2].

1.2. Personalized active information acquisition/Next best question (NBQ)

Azua can not only be used as a powerful data imputation tool, but also as an engine to actively acquire valuable information [1]. Given an incomplete data instance, Azua is able to suggest which unobserved variable is the most informative one (subject to the specific data instance and the task) to collect, using information-theoretic approaches. This allows the users to focus on collecting only the most important information, and thus make decisions with minimum amount of data.

Our active information acquisition functionality has two modes: i) if there is a specific variable (called target variable) that the user wants to predict, then Azua will suggest the next variable to collect, that is most valuable to predicting that particular target variable. ii) otherwise, Azua will make decisions using built-in criterion.

1.3 Causal discovery under missing data (CD)

The underlying causal relationships behind data crucial for real-world decision making. However, discovering causal structures under incomplete information is difficult. Azua provide state-of-the-art solution based on graph neural nets, which is able to tackle missing value imputation and causal discovery problems at the same time.

2. Getting started

Set-up Python environement

A conda environment is used to manage system requirements. To install conda, check the installation instructions here. To create the azua environment, after initializing submodules as above, run

conda env create -f environment.yml

And to activate the environment run

conda activate azua

Download dataset

You need to download the dataset you want to use for your experiment, and put it under relevant data/'s subdirectory e.g. putting yahoo dataset under data/yahoo directory. For the list of the supported datasets, please refer to the Supported datasets section.

For some of the UCI dataset, you can use download_dataset.py script for downloading the dataset e.g.:

python download_dataset.py boston

Run experiment

run_experiment.py script runs any combination of model training, imputation and active learning. An example of running experiment is:

python run_experiment.py boston -mt pvae -i -a eddi rand

In this example, we train a PVAE model (i.e. "-mt" parameter) on the Boston Housing dataset (i.e. first parameer), evaluate the imputation performance on the test set (i.e. "-i" parameter) and compare the sequential feature acquisition performance between the EDDI policy and random policy (i.e. "-a" parameter). For more information on running experiments, available parameters etc., please run the following command:

python run_experiment.py --help

We also provide more examples of running different experiments in the section below.

3. Model overview

Below we summarize the list of models currently available in Azua, their descriptions, functionalities (MVP = missing value prediction, NBQ = personalized information acquisition/next best quesiton, CD = Causal discovery), and an example code that shows how to run the model (which will also reproduce one experiment from the paper).

Model Description Functionalities Example usage
Partial VAE (PVAE) An extension of VAEs for
partially observed data.
See our paper.
MVP, NBQ python run_experiment.py boston -mt pvae -a eddi rand
VAE Mixed (VAEM) An extension of PVAE for
heterogeneous mixed type data.
See our paper.
MVP, NBQ python run_experiment.py bank -mt vaem_predictive -a sing
MNAR Partial VAE (MNAR-PVAE) An extension of VAE that
handles missing-not-at-random
(MNAR) data.
More details in the future.
MVP, NBQ python run_experiment.py yahoo -mt mnar_pvae -i
Bayesian Partial VAE (B-PVAE) PVAE with a Bayesian treatment. MVP, NBQ python run_experiment.py boston -mt bayesian_pvae -a eddi rand
Transformer PVAE A PVAE in which the encoder
and decoder contain transformers.
See our paper
MVP, NBQ python run_experiment.py boston -mt transformer_pvae -a eddi rand
Transformer encoder PVAE A PVAE in which the encoder
contains a transformer.
See our paper
MVP, NBQ python run_experiment.py boston -mt transformer_encoder_pvae -a eddi rand
Transformer imputer/Rupert A simple transformer model.
See our paper
MVP, NBQ python run_experiment.py boston -mt transformer_imputer -a variance rand
VICause Causal discovery from data with
 missing features
 and imputation. link to paper.
MVP, CD python run_experiment.py eedi_task_3_4_topics -mt vicause
CoRGi GNN-based imputation with
emphasis on item-related data
based on Kim et al.
MVP See 5.7.1-5.7.4 for details.
Graph Convolutional Network (GCN) GNN-based imputation based
on Kipf et al.
MVP See 5.7.2-5.7.4 for details.
GRAPE GNN-based imputation based
on You et al.
MVP See 5.7.2-5.7.4 for details.
Graph Convolutional Matrix Completion (GC-MC) GNN-based imputation based
on van den Berg et al.
MVP See 5.7.2-5.7.4 for details.
GraphSAGE GNN-based imputation based
on Hamilton et al.
MVP See [5.7.2-5.7.4](####5.7.2 Different node initializations) for details.
Graph Attention Network (GAT) Attention-based GNN imputation
based on Veličković et al.
MVP See [5.7.2-5.7.4](####5.7.2 Different node initializations) for details.
Deep Matrix Factorization (DMF) Matrix factorization with NN architecture. See deep matrix factorization MVP python run_experiment.py eedi_task_3_4_binary -mt deep_matrix_factorization
Mean imputing Replace missing value with
mean.
MVP python run_experiment.py boston -mt mean_imputing
Zero imputing Replace missing value with
zeros.
MVP python run_experiment.py boston -mt zero_imputing
Min imputing Replace missing value with
min value.
MVP python run_experiment.py boston -mt min_imputing
Majority vote Replace missing value with
majority vote.
MVP python run_experiment.py boston -mt majority_vote
MICE Multiple Imputation by
Chained Equations,
see this paper
MVP python run_experiment.py boston -mt mice
MissForest An iterative imputation method
(missForest) based on random forests.
See this paper
MVP python run_experiment.py boston -mt missforest

Objectives

Next Best Question Objectives Description
EDDI It uses information gain given observed values to predict the next best feature to query.
SING It uses a fixed information gain ordering based on no questions being asked.
Random It randomly selects the next best feature to query.
Variance It queries the feature that is expected to reduce predictive variance in the target variable the most.

4. Reference results

Supported datasets

We provide variables.json files and model configurations for the following datasets:

  • UCI datasets: webpage
  • MNIST: webpage
  • CIFAR-10: webpage
  • NeurIPS 2020 Education Challenge datasets: webpage
    • eedi_task_1_2_binary: The data for the first two tasks. It uses only correct (1) or wrong (0) answers.
    • eedi_task_1_2_categorical: The data for the first two tasks. It uses A, B, C, D answers.
    • eedi_task_3_4_binary: The data for the last two tasks. It uses only correct(1) or wrong (0) answers.
    • eedi_task_3_4_categorical: The data for the last two tasks. It uses A, B, C, D answers.
    • eedi_task_3_4_topics: The data for the last two tasks. To produce the experimental results in VICause, binary answers are used. It has additional topic metadata.
  • Neuropathic Pain Diagnosis Simulator Dataset: webpage
    • denoted as "Neuropathic_pain" below. You need to use the simulator to generate the data.
  • Synthetic relationships: synthetic data generated by sampling the underlying true causal structure, and then, generating the data points from it.
  • Yahoo webpage
  • Goodreads webpage: Refer to section 5.7.3 for more details.

Missing Value Prediction (MVP)

Test Data Normaliized RMSE

For evalaution, we apply row-wise splitting, and we use 30% holdout data to test.

Dataset Partial
VAE
VAEM Predictive
VAEM
MNAR
Partial
VAE
B-PVAE Mean
imputing
Zero
imputing
Min
imputing
Majority
vote
MICE MissForest
Bank 0.51 0.66 0.56 -- -- -- -- -- 0.51 -- --
Boston 0.17 0.18 -- -- 0.18 0.23 -- -- 0.37 -- 0.15
Conrete 0.18 0.19 -- -- -- 0.22 -- -- 0.27 -- 0.13
Energy 0.22 0.32 -- -- 0.25 0.35 -- -- 0.48 -- 0.24
Iris 0.59 -- -- -- -- -- -- -- -- -- --
Kin8nm 0.27 -- -- -- -- -- -- -- -- -- --
Wine 0.17 0.17 -- -- -- 0.24 -- -- 0.31 -- 0.17
Yacht 0.24 0.23 -- -- -- 0.3 -- -- 0.3 -- 0.16
Yahoo 0.36 -- -- 0.3 -- -- -- -- -- -- --

Accuracy

Please note that for binary data (e.g. eedi_task_3_4_binary), we report accuracy to compare with the literature.

Dataset Partial
VAE
VICause CORGI GRAPE GCMC Graph
Convolutional
Network
Graph
Attention
Network
GRAPHSAVE
eedi_task_3_4_binary 0.72 -- 0.71 0.71 0.69 0.71 0.6 0.69
eedi_task_3_4_categorical 0.57 -- -- -- -- -- -- --
eedi_task_3_4_topics 0.71 0.69 -- -- -- -- -- --
Neuropathic_pain 0.94 0.95 -- -- -- -- -- --

Next Best Question (NBQ): area under information curve (AUIC)

To evaluate the performance of different models for NBQ task, we compare the area under information curve (AUIC). See our paper for details. AUIC is calculated as follows: at each step of the NBQ, each model will propose to collect one variable, and make new predictions for the target variable. We can then calculate the predictive error (e.g., rmse) of the target variable at each step. This creates the information curve as the NBQ task progresses. Therefore, the area under the information curve (AUIC) can then be used to compare the performance across models and strategies. Smaller AUIC value indicates better performance.

Dataset Partial
VAE
VAEM Predictive
VAEM
MNAR
Partial
VAE
B-PVAE
Bank 6.6 6.49 5.91 -- --
Boston 2.03 2.0 -- -- 1.96
Conrete 1.48 1.47 -- -- --
Energy 1.18 1.3 -- -- 1.44
Iris 2.8 -- -- -- --
Kin8nm 1.28 -- -- -- --
Wine 2.14 2.45 -- -- --
Yacht 0.94 0.9 -- -- --

Causal discovery (CD)

We procide F1 score for adjacency and orientation to measure the causal discovery results. Please refer to VICause paper for details.

Dataset VICause
Adjacency.F1 Orientation.F1
Neuropathic_pain 0.28 0.26
Synthetic_relationships 0.82 0.47

5. Model details

5.1 Partial VAE

Model Description

Partial VAE (PVAE) is an unsupervised deep generative model, that is specifically designed to handle missing data. We mainly use this model to learn the underlying structure (latent representation) of the partially observed data, and perform missing data imputation. Just like any vanilla VAEs, PVAE is comprised of an encoder and a decoder. The PVAE encoder is parameterized by the so-called set-encoder (point-net, see our paper for details), which is able to extract the latent representation from partially observed data. Then, the PVAE decoder can take as input the extracted latent representation, and generate values for both missing entries (imputation), and observed entries (reconstruction).

The partial encoder

One of the major differences between PVAE and VAE is, the PVAE encoder can handle missing data in a principled way. The PVAE encoder is parameterized by the so-called set-encoder, which will process partially observed data in three steps: 1, feature embedding; 2, permutation-invariant aggregation; and 3, encoding into statistics of latent representation. These are implemented in feature_embedder.py, 'point_net.py', and encoder.py, respectively. see our paper, Section 3.2 for technical details.

Model configs

  • "embedding_dim": dimensionality of embedding (referred to as e in the paper) for each input to PVAE encoder. See our paper for details.
  • "set_embedding_dim": dimensionality of output set embedding (referred to as h in the paper) in PVAE encoder. See our paper for details.
  • "set_embedding_multiply_weights": Whether or not to take the product of x with embedding weights when feeding through. Default: true.
  • "latent_dim": dimensionality of the PVAE latent representation
  • "encoder_layers": structure of encoder network (excluding input and output layers)
  • "decoder_layers": structure of decoder network (excluding input and output layers)
  • "non_linearity": Choice of non-linear activation functions for hidden layers of PVAE decoder. Possible choice: "ReLU", "Sigmoid", and "Tanh". Default is "ReLU".
  • "activation_for_continuous": Choice of non-linear activation functions for the output layer of PVAE decoder. Possible choice: "Identity", ```"ReLU", "Sigmoid"`, and `"Tanh"`. Default is `"Sigmoid"`.
  • "init_method": Initialization method for PVAE weights. Possible choice: "default" (Pytorch default), "xavier_uniform", "xavier_normal", "uniform", and "normal". Default is "default".
  • "encoding_function": The permutation invariant operator of PVAE encoder. Default is "sum".
  • "decoder_variances": Variance of the observation noise added to the PVAE decoder output (for continuous variables only).
  • "random_seed": Random seed used for initializing the model. Default: [0].
  • "categorical_likelihood_coefficient": The coefficient for the likelihood terms of categorical variables. Default: 1.0.
  • "kl_coefficient": The Beta coefficient for the KL term. Default: 1.0.
  • "variance_autotune": Automatically learn the variance of the observation noise or not. Default: false.
  • "use_importance_sampling": Use importance sampling or not, when calculating the PVAE ELBO. When turned on, the PVAE will turn into importance weighted version of PVAE. See IWAE for more details. Default: false,
  • "squash_input": When preprocessing the data, squash the data to be between 0 and 1 or not. Default: true. Note that when false, you should change the config of "activation_for_continuous" accordingly (from "Sigmoid" to "Identity").

5.2 VAEM

Model Description

Real-world datasets often contain variables of different types (categorical, ordinal, continuous, etc.), and different marginal distributions. Although PVAE is able to cope with missing data, it does not handle heterogeneous mixed-type data very well. Azua provide a new model called VAEM to handle such scenarios.

The marginal VAEs and the dependency network

In short, VAEM is an extension to VAE that can handle such heterogeneous data. It is a deep generative model that is trained in a two stage manner.

  • In the first stage, we model the marginal distributions of each single variable separately. This is done by fitting a different vanilla VAE independently to each data dimension. This is implemented in marginal_vaes.py. Those one-dimensional VAEs will capture the marginal properties of each variable and provide a latent representation that is more homogeneous across dimensions.

  • In the second stage, we capture the dependencies among each variables. To this end, another Partial VAE, called the dependency network, is build on top of the latent representations provided by the first-stage VAEs. This is implemented in dependency_network_creator

To summarize, we can think of the first stage of VAEM as a data pre-processing step, that transforms heterogeneous mixed-type data into a homogeneous version of the data. Then, we can perform missing data imputation and personalized information acquisition on the pre-processed data.

Model configs

Since the main components of VAEM are VAEs and PVAE, thus the model configs of VAEM mostly inherit from PVAE (but with proper prefixes). For example, in the config files of VAEM, "marginal_encoder_layers" stands for the structure of the encoder network of marginal VAEs; dep_embedding_dim stands for the dimensionality of embedding of the dependency networks. Note however that since the marginal VAEs are vanilla VAEs rather than PVAEs, the configs arguments corresponding to set-encoders are disabled.

5.3 Predictive VAEM

Model Description

In some scenarios, when performing missing data imputation and information acquisition, the user might be having a supervised learning problem in mind. That is, the observable variables can be classified into two categories: the input variables (covariates), and the output variable (target variable). Both PVAE and VAEM will treat the input variable and output variable (targets) equally, and learns a joint distribution over them. On the contrary, predictive VAEM will simultaneously learn a joint distribution over the input variables, as well as a conditional distribution of the target, given the input variables. We found that such approach will generally yield better predictive performance on the target variable in practice.

The predictive model

The conditional distribution of the target, given the input variables (as well as the latent representation), is parameterized by a feed-forward neural network. This is implemented in marginal_vaes_with_predictive_vae.

Model configs

The predictive VAEMs share the same configs as VAEMs.

5.4 MNAR Partial VAE

Real-world missing values are often associated with complex generative processes, where the cause of the missingness may not be fully observed. This is known as missing not at random (MNAR) data. However, many of the standard imputation methods, such as our PVAE and VAEM, do not take into account the missingness mechanism, resulting in biased imputation values when MNAR data is present. Also, many practical methods for MNAR does not have identifiability guarantees: their parameters can not be uniquely determined by partially observed data, even with access to infinite samples. Azua provides a new deep generative model, called MNAR Partial VAE, that addresses both of these issues.

Mask net and identifiable PVAE

MNAR PVAE has two main components: a Mask net, and an identifiable PVAE. The mask net is a neural network (implemented in mask_net ), that models the conditional probability distribution of missing mask, given the data (and latent representations). This will help debiasing the MNAR mechanism. The identifiable PVAE is a variant of VAE, when combined with the mask net, will provide identifiability guarantees under certain assumptions. Unlike vanilla PVAEs, identifiable PVAE uses a neural network, called the prior net, to define the prior distribution on latent space. The prior net requires to take some fully observed auxiliary variables as inputs (you may think of it as some side information), and generate the distribution on the latent space. By default, unless specified, we will automatically treat fully observed variables as auxiliary variables. For more details, please see our paper (link will be available in the future).

Model configs

Most of the model configs are the same as PVAE, except the following:

  • "mask_net_config": This object contains the model configuration of the mask net.

    • "decoder_layers": The neural network structure of mask net.
    • "mask_net_coefficient": weight of the mask net loss function.
    • "latent connection": if true, the mask net will also take as input the latent representations.
  • "prior_net_config": This object contains the model configuration of the prior net/

    • "use_prior_net_to_train": if true, we will use prior net to train the PVAE, instead of the standard normal prior.
    • "encoder_layers": the neural network structure of prior net.
    • "use_prior_net_to_impute": use prior net to perform imputation or not. By default, we will always set this to false.
    • "degenerate_prior": As mentioned before, we will automatically treat fully observed variables as auxiliary variables. However, in some cases, fully observed variables might not be available (for example, in recommender data). "degenerate_prior" will determine how we handle such degenerate case. Currently, we only support "mask" method, which will use the missingness mask themselves as auxiliary variables.

5.5 Bayesian partial VAE (B-PVAE)

Standard training of PVAE produces the point estimates for the neural network parameters in the decoder. This approach does not quantify the epistemic uncertainty of our model. B-PVAE is a variant of PVAE, that applies a fully Bayesian treatment to the weights. The model setting is the same as in BELGAM, whereas the approximate inference is done using the inducing weights approach.

Implementation

Implementation-wise, B-PVAE is based on Bayesianize, a lightweight Bayesian neural network (BNN) wrapper in pytorch, which allows easy conversion of neural networks in existing scripts to its Bayesian version with minimal changes. For more details, please see our github repo.

5.6 VICause

Missing values constitute an important challenge in real-world machine learning for both prediction and causal discovery tasks. However, only few methods in causal discovery can handle missing data in an efficient way, while existing imputation methods are agnostic to causality. In this work we propose VICAUSE, a novel approach to simultaneously tackle missing value imputation and causal discovery efficiently with deep learning. Particularly, we propose a generative model with a structured latent space and a graph neural network-based architecture, scaling to large number of variables. Moreover, our method can discover relationship between groups of variables which is useful in many real-world applications. VICAUSE shows improved performance compared to popular and recent approaches in both missing value imputation and causal discovery.

For more information, please refer to the [paper] (https://www.microsoft.com/en-us/research/publication/vicause-simultaneous-missing-value-imputation-and-causal-discovery/).

5.7 CoRGi, Graph Convolutional Network (GCN), GRAPE, Graph Convolutional Matrix Completion (GC-MC), and GraphSAGE

5.7.1 CoRGi and baselines

CoRGi

CoRGi is a GNN model that considers the rich data within nodes in the context of their neighbors. This is achieved by endowing CORGI’s message passing with a personalized attention mechanism over the content of each node. This way, CORGI assigns user-item-specific attention scores with respect to the words that appear in items. More detailed information can be found in our paper:

CORGI: Content-Rich Graph Neural Networks with Attention. J. Kim, A. Lamb, S. Woodhead, S. Peyton Jones, C. Zhang, M. Allamanis. RecSys: Workshop on Graph Neural Networks for Recommendation and Search, 2021, 2021

Graph Convolutional Network (GCN)

Azua provides a re-implementation of GCN. As a default, "average" is used for the aggregation function and nodes are randomly initialized. We adopt dropout with probability 0.5 for node embedding updates as well as for the prediction MLPs.

GRAPE

GRAPE is a GNN model that employs edge embeddings (please refer to this paper for details). Also, it adopts edge dropouts that are applied throughout all message-passing layers. Compared to the GRAPE proposed in the oroginal paper, because of the memory issue, we do not initialize nodes with one-hot vectors nor constants (ones).

Graph Convolutional Matrix Completion (GC-MC)

Compared to GCN, this model has a single message-passing layer. Also, For classification, each label is endowed with a separate message passing channel. Here, we do not implement the weight sharing. For more details, please refer to this paper.

GraphSAGE

GraphSAGE extends GCN by allowing the model to be trained on the part of the graph, making the model to be used in inductive settings. For more details, please refer to this paper

Graph Attention Network (GAT)

During message aggregation, GAT uses the attention mechanism to allow the target nodes to distinguish the weights of multiple messages from the source nodes for aggregation. For more details, please refer to this paper.

5.7.2 Different node initializations

All GNN models allow different kinds of node initializations. This can be done by modifying the model config file. For example, to run CoRGi with SBERT initialization, change "node_init": "random" to "node_init": "sbert_init" in configs/defaults/model_config_corgi.json.

The list of node initializations allowed inclue:

"random", "grape", "text_init" (TF-IDF),"sbert_init", "neural_bow_init", "bert_cls_init", "bert_average_init"

For example, the test performance of GCN Init: NeuralBOW in Table 2 of the paper on Eedi dataset can be acquired by running:

python run_experiment.py eedi graph_convolutional_network -dc configs/defaults/model_config_graph_convolutional_network.json

with "node_init": "neural_bow_init" in te corresponding model config file.

5.7.3 Datasets

CoRGi operate on content-augmented graph data.

Goodreads

Download the data from this link under data directory with name goodreads.

The Goodreads dataset from the Goodreads website contains users and books. The content of each book-node is its natural language description. The dataset includes a 1 to 5 integer ratings between some books and users.

The pre-processing of this data can found at

research_experiments/GNN/create_goodreads_dataset.py

Eedi

Download the data from this link under data directory with name eedi.

This dataset is from the Diagnostic Questions - NeurIPS 2020 Education Challenge. It contains anonymized student and question identities with the student responses to some questions. The content of each question-node is the text of the question. Edge labels are binary: one and zero for correct and incorrect answers.

The pre-processing codes for the datasets to be used for CoRGi can be found at:

research_experiments/eedi/

5.7.3 Running Corgi

To run the CoRGi code with Eedi dataset, first locate the preprocessed data at

data/eedi/

Then, run the following code:

python run_experiment.py eedi -mt corgi

This can be done with different datasets and different GNN models. The train and validation performances can be tracked using tensorboard which is logged under the runs directory. Also, the trained model is saved with .pt extension.

6. Other engineering details

Reproducibility

As the project uses PyTorch, we can't guarantee completely reproducible results across different platforms and devices. However, for the specific platform/device, the results should be completed reproducible i.e. running an experiment twice should give the exact same results. More about limitation on reproducibility in PyTorch can be found here.

Add your own dataset

To add a new dataset, a new directory should be added to the data folder, containing either all of the dataset in a file named all.csv, or a train/test split in files named train.csv and test.csv. In the former case, a train/test split will be generated, in a 80%/20% split by default.

Data can be specified in two formats using the --data_type flag in the entrypoint scripts. The default format is "csv", which assumes that each column represents a feature, and each row represents a data point. The alternative format is "sparse_csv", where there are 3 columns representing the row ID, column ID and value of a particular matrix element, as in a coordinate-list (COO) sparse matrix. All values not specified are assumed to be missing.In both cases, no header row should be included in the CSV file.

Variable metadata for each variable in a dataset can be specified in an optional file named variables.json. This file is an array of dictionaries, one for each variable in the dataset. For each variable, the following values may be specified:

  • id: int, index of the variable in the dataset
  • query: bool, whether this variable can be queried during active learning (True) or is a target (False).
  • type: string, type of variable - either "continuous", "binary" or "categorical".
  • lower: numeric, lower bound for the variable.
  • upper: numeric, upper bound for the variable.
  • name: string, name of the variable.

For each field not specified, it will attempt to be inferred. Note: all features will be assumed to be queriable, and thus not active learning targets, unless explicitly specified otherwise. Lower and upper values will be inferred from the training data, and the type will be inferred based on whether the variable takes exclusively integer values.

Split type for the dataset

The source data can be split into train/validation/test datasets either based on rows or elements. The former is split by rows of the matrix, whereas the latter is split by individual elements of the matrix, so that different elements of a row can appear in different data splits (i.e. train or validation or test).

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Comments
  • DECI demo simulation error

    DECI demo simulation error

    I'm new to DECI and wanted to replicate the exact steps mentioned in "DECI Demo" file.

    When i tried to first run this script - research_experiments/DECI/data_generation/csuite/simulate.py I'm getting an error "ValueError: cannot reindex from a duplicate axis" - the error is coming from the first method call of "two_node_lin" method and iterating over : Observational Interventional Counterfactual -> this is where It got stuck !!

    Could you guys, please attach the data in azua/data/ folder, can we remove the data generating step as I spent a lot of time in debugging this thing?

    and also, I ran the below command and it's going to the exception block, any ideas on this one ? try: from evaluation_pipeline.aml_azua_context import setup_azua_context_in_aml azua_context = setup_azua_context_in_aml() except ImportError: from azua.azua.experiment.azua_context import AzuaContext azua_context = AzuaContext()

    @ae-foster @pawni @themanojkumar @cvarun16 @ChengZhangMSRC @makukl

    opened by saichaitanyamolabanti 8
  • Error with running model CoRGi

    Error with running model CoRGi

    When I run the CoRGi model with the Goodreads dataset, there are several errors. I made a goodreads dataset with this code, 'research_experiences/GNN/create_goodreads_dataset.py'.

    First, in line 325 of the graph_neural_network.py, an assertion occurs because the data format is not GraphData. So I tried to use the to_graph function in Dataset Class.

    And in 842 of the graph_neural_network.py, AttributeError occurs because there is no x attribute in the data.

    AttributeError: 'Dataset' object does not have attribute 'x'.

    How do I solve this?

    opened by nyongja 4
  • Model of GINA

    Model of GINA

    Hi, has the GINA model (Identifiable Generative Models for Missing Not at Random Data Imputation) already been released? I cannot find it in this repository.

    opened by Sam1224 2
  • Error when running on CPU

    Error when running on CPU

    I observed this error when running on CPU:

    RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

    I made this change in project-azua/azua/models/torch_model.py and now it works: # model.load_state_dict(torch.load(model_path)) model.load_state_dict(torch.load(model_path, map_location=torch.device(torch_device)))

    opened by inchiosa 1
  • How do I “choose the best action” without having to specify treatments?

    How do I “choose the best action” without having to specify treatments?

    In the real world application, people tend to have their expected goal and seeking for approaches to achieve that goal (given a model, then conduct counterfactual generation), is this within the scope of project azua? I think azua does an amazing job in proposing an end to end framework for conducting CATE but we also need to know which variables we should intervene upon in order to calculate the CATE. If I have potentially many treatments, does that mean I need to run through all the treatment combinations in order to find the one that maximizes the CATE?

    What do you guys think? I thought to better achieving the goal for “choose the best action” we may want to have some interface like the branch of work that DiCE did by generating a series of counterfactual recourses, or there may be opportunities to propose a even more unified model for doing this counterfactual CATE generation?

    opened by tonyabracadabra 0
  • run_experiment: ‘Provide’ object has no attribute ‘metrics_logger’

    run_experiment: ‘Provide’ object has no attribute ‘metrics_logger’

    ISSUE

    When running python run_experiment.py boston -mt pvae -i -a eddi rand the program fails@ line 216-218 in run_experiment.py

    lines 216-218:

    kwargs_file = azua_context.aml_step(run_aggregation, pipeline_creation_mode)(
            input_dirs=input_dirs, 
    	output_dir=models_dir, 
    	experiment_name=experiment_name, 
    	aml_tags=aml_tags
    )
    

    Error message

    File “../project-azua/azua/experiment/run_aggregation.py", line 16, in run_aggregation metrics_logger = azua_context.metrics_logger() AttributeError: ‘Provide’ object has no attribute ‘metrics_logger’

    The error refers to this function:

    def run_aggregation(
        input_dirs: List[str],
        output_dir: str,
        experiment_name: str,
        aml_tags=Dict[str, Any],
        azua_context: AzuaContext = Provide[AzuaContext],
    ) 
    

    SOLUTION

    To fix the issue I included the azua_context param in lines 216-218:

    kwargs_file = azua_context.aml_step(run_aggregation, pipeline_creation_mode)(
            input_dirs=input_dirs, 
    	output_dir=models_dir, 
    	experiment_name=experiment_name, 
    	aml_tags=aml_tags,
    	azua_context=azua_context
    )
    

    Ran successfully.

    opened by yashua-ovando 1
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 2, 2023
Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Highly interpretable, sklearn-compatible classifier based on decision rules This is a scikit-learn compatible wrapper for the Bayesian Rule List class

Tamas Madl 482 Nov 19, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
Test symmetries with sklearn decision tree models

Test symmetries with sklearn decision tree models Setup Begin from an environment with a recent version of python 3. source setup.sh Leave the enviro

Rupert Tombs 2 Jul 19, 2022
Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

null 1 Dec 22, 2021
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 3.1k Jan 6, 2023
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 10x Faster Trainin

Microsoft 8.4k Dec 30, 2022
nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Microsoft 241 Dec 26, 2022
MegFlow - Efficient ML solutions for long-tailed demands.

Efficient ML solutions for long-tailed demands.

旷视天元 MegEngine 371 Dec 21, 2022
MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training

MosaicML Composer MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training. We aim to ease th

MosaicML 2.8k Jan 6, 2023
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 7, 2023
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

David Kundih 3 Oct 19, 2022
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Azaria Gebremichael 2 Jul 29, 2021
Data from "Datamodels: Predicting Predictions with Training Data"

Data from "Datamodels: Predicting Predictions with Training Data" Here we provid

Madry Lab 51 Dec 9, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

null 2.3k Jan 5, 2023
Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

Facebook 15.4k Jan 7, 2023