Project Azua
0. Overview
Many modern AI algorithms are known to be data-hungry, whereas human decision-making is much more efficient. The human can reason under uncertainty, actively acquire valuable information from the world to reduce uncertainty, and make personalized decisions given incomplete information. How can we replicate those abilities in machine intelligence?
In project Azua, we build AI algorithms to aid efficient decision-making with minimum data requirements. To achieve optimal trade-offs and enable the human in the loop, we combine state-of-the-art methods in deep learning, probabilistic inference, and causality. We provide easy-to-use deep learning tools that can perform efficient multiple imputation under partially observed, mixed type data, discover the underlying causal relationship behind the data, and suggest the next best step for decision making. Our technology has enabled personalized decision-making in real-world systems, wrapping multiple advanced research in simple APIs, suitable for research development in the research communities, and commercial usages by data scientists and developers.
References
If you have used the models in our code base, please consider to cite the corresponding paper:
[1], (PVAE and information acquisition) Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez-Lobato, Sebastian Nowozin, and Cheng Zhang. "EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE." In International Conference on Machine Learning, pp. 4234-4243. PMLR, 2019.
[2], (VAEM) Chao Ma, Sebastian Tschiatschek, Richard Turner, José Miguel Hernández-Lobato, and Cheng Zhang. "VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data." Advances in Neural Information Processing Systems 33 (2020).
[3], (VICause) Pablo Morales-Alvarez, Angus Lamb, Simon Woodhead, Simon Peyton Jones, Miltos Allamanis, and Cheng Zhang, "VICAUSE: Simultaneous missing value imputation and causal discovery", ICML 2021 workshop on the Neglected Assumptions in Causal Inference
[4], (Eedi dataset) Zichao Wang, Angus Lamb, Evgeny Saveliev, Pashmina Cameron, Yordan Zaykov, Jose Miguel Hernandez-Lobato, Richard E. Turner et al. "Results and Insights from Diagnostic Questions: The NeurIPS 2020 Education Challenge." arXiv preprint arXiv:2104.04034 (2021).
[5], (CORGI:) Jooyeon Kim, Angus Lamb, Simon Woodhead, Simon Pyton Jones, Cheng Zhang, and Miltiadis Allamanis. CORGI: Content-Rich Graph Neural Networks with Attention. In GReS: Workshop on Graph Neural Networks for Recommendation and Search, 2021
Resources
For quick introduction to our work, checkout our NeurIPS 2020 tutorial, from 2:17:11. For a more in-depth technical introduction, checkout our ICML 2020 tutorial
1. Core functionalities
Azua has there core functionalities: missing data imputation, active information acquisition, and causal discovery.
1.1. Missing Value Prediction (MVP)
In many real-life scenarios, we will need to make decisions under incomplete information. Therefore, it is crucial to make accurate estimates regarding the missing information. To this end, Azua provides state-of-the-art missing value prediction methods based on deep learning algorithms. These methods are able to learn from incomplete data, and then perform missing value imputation. Instead of producing only one single values for missing entries as in common softwares, most of our methods are able to return multiple imputation values, which provide valuable information of imputation uncertainty. We work with data with same type [1], as well as mixed type data and different missing patterns [2].
1.2. Personalized active information acquisition/Next best question (NBQ)
Azua can not only be used as a powerful data imputation tool, but also as an engine to actively acquire valuable information [1]. Given an incomplete data instance, Azua is able to suggest which unobserved variable is the most informative one (subject to the specific data instance and the task) to collect, using information-theoretic approaches. This allows the users to focus on collecting only the most important information, and thus make decisions with minimum amount of data.
Our active information acquisition functionality has two modes: i) if there is a specific variable (called target variable) that the user wants to predict, then Azua will suggest the next variable to collect, that is most valuable to predicting that particular target variable. ii) otherwise, Azua will make decisions using built-in criterion.
1.3 Causal discovery under missing data (CD)
The underlying causal relationships behind data crucial for real-world decision making. However, discovering causal structures under incomplete information is difficult. Azua provide state-of-the-art solution based on graph neural nets, which is able to tackle missing value imputation and causal discovery problems at the same time.
2. Getting started
Set-up Python environement
A conda environment is used to manage system requirements. To install conda, check the installation instructions here. To create the azua environment, after initializing submodules as above, run
conda env create -f environment.yml
And to activate the environment run
conda activate azua
Download dataset
You need to download the dataset you want to use for your experiment, and put it under relevant data/'s subdirectory e.g. putting yahoo dataset under data/yahoo directory. For the list of the supported datasets, please refer to the Supported datasets section.
For some of the UCI dataset, you can use download_dataset.py script for downloading the dataset e.g.:
python download_dataset.py boston
Run experiment
run_experiment.py
script runs any combination of model training, imputation and active learning. An example of running experiment is:
python run_experiment.py boston -mt pvae -i -a eddi rand
In this example, we train a PVAE model (i.e. "-mt" parameter) on the Boston Housing dataset (i.e. first parameer), evaluate the imputation performance on the test set (i.e. "-i" parameter) and compare the sequential feature acquisition performance between the EDDI policy and random policy (i.e. "-a" parameter). For more information on running experiments, available parameters etc., please run the following command:
python run_experiment.py --help
We also provide more examples of running different experiments in the section below.
3. Model overview
Below we summarize the list of models currently available in Azua, their descriptions, functionalities (MVP = missing value prediction, NBQ = personalized information acquisition/next best quesiton, CD = Causal discovery), and an example code that shows how to run the model (which will also reproduce one experiment from the paper).
Model | Description | Functionalities | Example usage |
---|---|---|---|
Partial VAE (PVAE) | An extension of VAEs for partially observed data. See our paper. |
MVP, NBQ | python run_experiment.py boston -mt pvae -a eddi rand |
VAE Mixed (VAEM) | An extension of PVAE for heterogeneous mixed type data. See our paper. |
MVP, NBQ | python run_experiment.py bank -mt vaem_predictive -a sing |
MNAR Partial VAE (MNAR-PVAE) | An extension of VAE that handles missing-not-at-random (MNAR) data. More details in the future. |
MVP, NBQ | python run_experiment.py yahoo -mt mnar_pvae -i |
Bayesian Partial VAE (B-PVAE) | PVAE with a Bayesian treatment. | MVP, NBQ | python run_experiment.py boston -mt bayesian_pvae -a eddi rand |
Transformer PVAE | A PVAE in which the encoder and decoder contain transformers. See our paper |
MVP, NBQ | python run_experiment.py boston -mt transformer_pvae -a eddi rand |
Transformer encoder PVAE | A PVAE in which the encoder contains a transformer. See our paper |
MVP, NBQ | python run_experiment.py boston -mt transformer_encoder_pvae -a eddi rand |
Transformer imputer/Rupert | A simple transformer model. See our paper |
MVP, NBQ | python run_experiment.py boston -mt transformer_imputer -a variance rand |
VICause | Causal discovery from data with missing features and imputation. link to paper. |
MVP, CD | python run_experiment.py eedi_task_3_4_topics -mt vicause |
CoRGi | GNN-based imputation with emphasis on item-related data based on Kim et al. |
MVP | See 5.7.1-5.7.4 for details. |
Graph Convolutional Network (GCN) | GNN-based imputation based on Kipf et al. |
MVP | See 5.7.2-5.7.4 for details. |
GRAPE | GNN-based imputation based on You et al. |
MVP | See 5.7.2-5.7.4 for details. |
Graph Convolutional Matrix Completion (GC-MC) | GNN-based imputation based on van den Berg et al. |
MVP | See 5.7.2-5.7.4 for details. |
GraphSAGE | GNN-based imputation based on Hamilton et al. |
MVP | See [5.7.2-5.7.4](####5.7.2 Different node initializations) for details. |
Graph Attention Network (GAT) | Attention-based GNN imputation based on Veličković et al. |
MVP | See [5.7.2-5.7.4](####5.7.2 Different node initializations) for details. |
Deep Matrix Factorization (DMF) | Matrix factorization with NN architecture. See deep matrix factorization | MVP | python run_experiment.py eedi_task_3_4_binary -mt deep_matrix_factorization |
Mean imputing | Replace missing value with mean. |
MVP | python run_experiment.py boston -mt mean_imputing |
Zero imputing | Replace missing value with zeros. |
MVP | python run_experiment.py boston -mt zero_imputing |
Min imputing | Replace missing value with min value. |
MVP | python run_experiment.py boston -mt min_imputing |
Majority vote | Replace missing value with majority vote. |
MVP | python run_experiment.py boston -mt majority_vote |
MICE | Multiple Imputation by Chained Equations, see this paper |
MVP | python run_experiment.py boston -mt mice |
MissForest | An iterative imputation method (missForest) based on random forests. See this paper |
MVP | python run_experiment.py boston -mt missforest |
Objectives
Next Best Question Objectives | Description |
---|---|
EDDI | It uses information gain given observed values to predict the next best feature to query. |
SING | It uses a fixed information gain ordering based on no questions being asked. |
Random | It randomly selects the next best feature to query. |
Variance | It queries the feature that is expected to reduce predictive variance in the target variable the most. |
4. Reference results
Supported datasets
We provide variables.json
files and model configurations for the following datasets:
- UCI datasets: webpage
- MNIST: webpage
- CIFAR-10: webpage
- NeurIPS 2020 Education Challenge datasets: webpage
- eedi_task_1_2_binary: The data for the first two tasks. It uses only correct (1) or wrong (0) answers.
- eedi_task_1_2_categorical: The data for the first two tasks. It uses A, B, C, D answers.
- eedi_task_3_4_binary: The data for the last two tasks. It uses only correct(1) or wrong (0) answers.
- eedi_task_3_4_categorical: The data for the last two tasks. It uses A, B, C, D answers.
- eedi_task_3_4_topics: The data for the last two tasks. To produce the experimental results in VICause, binary answers are used. It has additional topic metadata.
- Neuropathic Pain Diagnosis Simulator Dataset: webpage
- denoted as "Neuropathic_pain" below. You need to use the simulator to generate the data.
- Synthetic relationships: synthetic data generated by sampling the underlying true causal structure, and then, generating the data points from it.
- Yahoo webpage
- Goodreads webpage: Refer to section 5.7.3 for more details.
Missing Value Prediction (MVP)
Test Data Normaliized RMSE
For evalaution, we apply row-wise splitting, and we use 30% holdout data to test.
Dataset | Partial VAE |
VAEM | Predictive VAEM |
MNAR Partial VAE |
B-PVAE | Mean imputing |
Zero imputing |
Min imputing |
Majority vote |
MICE | MissForest |
---|---|---|---|---|---|---|---|---|---|---|---|
Bank | 0.51 | 0.66 | 0.56 | -- | -- | -- | -- | -- | 0.51 | -- | -- |
Boston | 0.17 | 0.18 | -- | -- | 0.18 | 0.23 | -- | -- | 0.37 | -- | 0.15 |
Conrete | 0.18 | 0.19 | -- | -- | -- | 0.22 | -- | -- | 0.27 | -- | 0.13 |
Energy | 0.22 | 0.32 | -- | -- | 0.25 | 0.35 | -- | -- | 0.48 | -- | 0.24 |
Iris | 0.59 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
Kin8nm | 0.27 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
Wine | 0.17 | 0.17 | -- | -- | -- | 0.24 | -- | -- | 0.31 | -- | 0.17 |
Yacht | 0.24 | 0.23 | -- | -- | -- | 0.3 | -- | -- | 0.3 | -- | 0.16 |
Yahoo | 0.36 | -- | -- | 0.3 | -- | -- | -- | -- | -- | -- | -- |
Accuracy
Please note that for binary data (e.g. eedi_task_3_4_binary), we report accuracy to compare with the literature.
Dataset | Partial VAE |
VICause | CORGI | GRAPE | GCMC | Graph Convolutional Network |
Graph Attention Network |
GRAPHSAVE |
---|---|---|---|---|---|---|---|---|
eedi_task_3_4_binary | 0.72 | -- | 0.71 | 0.71 | 0.69 | 0.71 | 0.6 | 0.69 |
eedi_task_3_4_categorical | 0.57 | -- | -- | -- | -- | -- | -- | -- |
eedi_task_3_4_topics | 0.71 | 0.69 | -- | -- | -- | -- | -- | -- |
Neuropathic_pain | 0.94 | 0.95 | -- | -- | -- | -- | -- | -- |
Next Best Question (NBQ): area under information curve (AUIC)
To evaluate the performance of different models for NBQ task, we compare the area under information curve (AUIC). See our paper for details. AUIC is calculated as follows: at each step of the NBQ, each model will propose to collect one variable, and make new predictions for the target variable. We can then calculate the predictive error (e.g., rmse) of the target variable at each step. This creates the information curve as the NBQ task progresses. Therefore, the area under the information curve (AUIC) can then be used to compare the performance across models and strategies. Smaller AUIC value indicates better performance.
Dataset | Partial VAE |
VAEM | Predictive VAEM |
MNAR Partial VAE |
B-PVAE |
---|---|---|---|---|---|
Bank | 6.6 | 6.49 | 5.91 | -- | -- |
Boston | 2.03 | 2.0 | -- | -- | 1.96 |
Conrete | 1.48 | 1.47 | -- | -- | -- |
Energy | 1.18 | 1.3 | -- | -- | 1.44 |
Iris | 2.8 | -- | -- | -- | -- |
Kin8nm | 1.28 | -- | -- | -- | -- |
Wine | 2.14 | 2.45 | -- | -- | -- |
Yacht | 0.94 | 0.9 | -- | -- | -- |
Causal discovery (CD)
We procide F1 score for adjacency and orientation to measure the causal discovery results. Please refer to VICause paper for details.
Dataset | VICause | |
---|---|---|
Adjacency.F1 | Orientation.F1 | |
Neuropathic_pain | 0.28 | 0.26 |
Synthetic_relationships | 0.82 | 0.47 |
5. Model details
5.1 Partial VAE
Model Description
Partial VAE (PVAE) is an unsupervised deep generative model, that is specifically designed to handle missing data. We mainly use this model to learn the underlying structure (latent representation) of the partially observed data, and perform missing data imputation. Just like any vanilla VAEs, PVAE is comprised of an encoder and a decoder. The PVAE encoder is parameterized by the so-called set-encoder (point-net, see our paper for details), which is able to extract the latent representation from partially observed data. Then, the PVAE decoder can take as input the extracted latent representation, and generate values for both missing entries (imputation), and observed entries (reconstruction).
The partial encoder
One of the major differences between PVAE and VAE is, the PVAE encoder can handle missing data in a principled way. The PVAE encoder is parameterized by the so-called set-encoder, which will process partially observed data in three steps: 1, feature embedding; 2, permutation-invariant aggregation; and 3, encoding into statistics of latent representation. These are implemented in feature_embedder.py
, 'point_net.py', and encoder.py
, respectively. see our paper, Section 3.2 for technical details.
Model configs
"embedding_dim"
: dimensionality of embedding (referred to as e in the paper) for each input to PVAE encoder. See our paper for details."set_embedding_dim"
: dimensionality of output set embedding (referred to as h in the paper) in PVAE encoder. See our paper for details."set_embedding_multiply_weights"
: Whether or not to take the product of x with embedding weights when feeding through. Default:true
."latent_dim"
: dimensionality of the PVAE latent representation"encoder_layers"
: structure of encoder network (excluding input and output layers)"decoder_layers"
: structure of decoder network (excluding input and output layers)"non_linearity"
: Choice of non-linear activation functions for hidden layers of PVAE decoder. Possible choice:"ReLU"
,"Sigmoid"
, and"Tanh"
. Default is"ReLU"
."activation_for_continuous"
: Choice of non-linear activation functions for the output layer of PVAE decoder. Possible choice:"Identity"
, ```"ReLU",
"Sigmoid"`, and `"Tanh"`. Default is `"Sigmoid"`."init_method"
: Initialization method for PVAE weights. Possible choice:"default"
(Pytorch default),"xavier_uniform"
,"xavier_normal"
,"uniform"
, and"normal"
. Default is"default"
."encoding_function"
: The permutation invariant operator of PVAE encoder. Default is"sum"
."decoder_variances"
: Variance of the observation noise added to the PVAE decoder output (for continuous variables only)."random_seed"
: Random seed used for initializing the model. Default:[0]
."categorical_likelihood_coefficient"
: The coefficient for the likelihood terms of categorical variables. Default:1.0
."kl_coefficient"
: The Beta coefficient for the KL term. Default:1.0
."variance_autotune"
: Automatically learn the variance of the observation noise or not. Default:false
."use_importance_sampling"
: Use importance sampling or not, when calculating the PVAE ELBO. When turned on, the PVAE will turn into importance weighted version of PVAE. See IWAE for more details. Default:false
,"squash_input"
: When preprocessing the data, squash the data to be between 0 and 1 or not. Default:true
. Note that whenfalse
, you should change the config of"activation_for_continuous"
accordingly (from"Sigmoid"
to"Identity"
).
5.2 VAEM
Model Description
Real-world datasets often contain variables of different types (categorical, ordinal, continuous, etc.), and different marginal distributions. Although PVAE is able to cope with missing data, it does not handle heterogeneous mixed-type data very well. Azua provide a new model called VAEM to handle such scenarios.
The marginal VAEs and the dependency network
In short, VAEM is an extension to VAE that can handle such heterogeneous data. It is a deep generative model that is trained in a two stage manner.
-
In the first stage, we model the marginal distributions of each single variable separately. This is done by fitting a different vanilla VAE independently to each data dimension. This is implemented in
marginal_vaes.py
. Those one-dimensional VAEs will capture the marginal properties of each variable and provide a latent representation that is more homogeneous across dimensions. -
In the second stage, we capture the dependencies among each variables. To this end, another Partial VAE, called the dependency network, is build on top of the latent representations provided by the first-stage VAEs. This is implemented in
dependency_network_creator
To summarize, we can think of the first stage of VAEM as a data pre-processing step, that transforms heterogeneous mixed-type data into a homogeneous version of the data. Then, we can perform missing data imputation and personalized information acquisition on the pre-processed data.
Model configs
Since the main components of VAEM are VAEs and PVAE, thus the model configs of VAEM mostly inherit from PVAE (but with proper prefixes). For example, in the config files of VAEM, "marginal_encoder_layers"
stands for the structure of the encoder network of marginal VAEs; dep_embedding_dim
stands for the dimensionality of embedding of the dependency networks. Note however that since the marginal VAEs are vanilla VAEs rather than PVAEs, the configs arguments corresponding to set-encoders are disabled.
5.3 Predictive VAEM
Model Description
In some scenarios, when performing missing data imputation and information acquisition, the user might be having a supervised learning problem in mind. That is, the observable variables can be classified into two categories: the input variables (covariates), and the output variable (target variable). Both PVAE and VAEM will treat the input variable and output variable (targets) equally, and learns a joint distribution over them. On the contrary, predictive VAEM will simultaneously learn a joint distribution over the input variables, as well as a conditional distribution of the target, given the input variables. We found that such approach will generally yield better predictive performance on the target variable in practice.
The predictive model
The conditional distribution of the target, given the input variables (as well as the latent representation), is parameterized by a feed-forward neural network. This is implemented in marginal_vaes_with_predictive_vae
.
Model configs
The predictive VAEMs share the same configs as VAEMs.
5.4 MNAR Partial VAE
Real-world missing values are often associated with complex generative processes, where the cause of the missingness may not be fully observed. This is known as missing not at random (MNAR) data. However, many of the standard imputation methods, such as our PVAE and VAEM, do not take into account the missingness mechanism, resulting in biased imputation values when MNAR data is present. Also, many practical methods for MNAR does not have identifiability guarantees: their parameters can not be uniquely determined by partially observed data, even with access to infinite samples. Azua provides a new deep generative model, called MNAR Partial VAE, that addresses both of these issues.
Mask net and identifiable PVAE
MNAR PVAE has two main components: a Mask net, and an identifiable PVAE. The mask net is a neural network (implemented in mask_net
), that models the conditional probability distribution of missing mask, given the data (and latent representations). This will help debiasing the MNAR mechanism. The identifiable PVAE is a variant of VAE, when combined with the mask net, will provide identifiability guarantees under certain assumptions. Unlike vanilla PVAEs, identifiable PVAE uses a neural network, called the prior net, to define the prior distribution on latent space. The prior net requires to take some fully observed auxiliary variables as inputs (you may think of it as some side information), and generate the distribution on the latent space. By default, unless specified, we will automatically treat fully observed variables as auxiliary variables. For more details, please see our paper (link will be available in the future).
Model configs
Most of the model configs are the same as PVAE, except the following:
-
"mask_net_config"
: This object contains the model configuration of the mask net."decoder_layers"
: The neural network structure of mask net."mask_net_coefficient"
: weight of the mask net loss function."latent connection"
: iftrue
, the mask net will also take as input the latent representations.
-
"prior_net_config"
: This object contains the model configuration of the prior net/"use_prior_net_to_train"
: iftrue
, we will use prior net to train the PVAE, instead of the standard normal prior."encoder_layers"
: the neural network structure of prior net."use_prior_net_to_impute"
: use prior net to perform imputation or not. By default, we will always set this tofalse
."degenerate_prior"
: As mentioned before, we will automatically treat fully observed variables as auxiliary variables. However, in some cases, fully observed variables might not be available (for example, in recommender data)."degenerate_prior"
will determine how we handle such degenerate case. Currently, we only support"mask"
method, which will use the missingness mask themselves as auxiliary variables.
5.5 Bayesian partial VAE (B-PVAE)
Standard training of PVAE produces the point estimates for the neural network parameters in the decoder. This approach does not quantify the epistemic uncertainty of our model. B-PVAE is a variant of PVAE, that applies a fully Bayesian treatment to the weights. The model setting is the same as in BELGAM, whereas the approximate inference is done using the inducing weights approach.
Implementation
Implementation-wise, B-PVAE is based on Bayesianize, a lightweight Bayesian neural network (BNN) wrapper in pytorch, which allows easy conversion of neural networks in existing scripts to its Bayesian version with minimal changes. For more details, please see our github repo.
5.6 VICause
Missing values constitute an important challenge in real-world machine learning for both prediction and causal discovery tasks. However, only few methods in causal discovery can handle missing data in an efficient way, while existing imputation methods are agnostic to causality. In this work we propose VICAUSE, a novel approach to simultaneously tackle missing value imputation and causal discovery efficiently with deep learning. Particularly, we propose a generative model with a structured latent space and a graph neural network-based architecture, scaling to large number of variables. Moreover, our method can discover relationship between groups of variables which is useful in many real-world applications. VICAUSE shows improved performance compared to popular and recent approaches in both missing value imputation and causal discovery.
For more information, please refer to the [paper] (https://www.microsoft.com/en-us/research/publication/vicause-simultaneous-missing-value-imputation-and-causal-discovery/).
5.7 CoRGi, Graph Convolutional Network (GCN), GRAPE, Graph Convolutional Matrix Completion (GC-MC), and GraphSAGE
5.7.1 CoRGi and baselines
CoRGi is a GNN model that considers the rich data within nodes in the context of their neighbors. This is achieved by endowing CORGI’s message passing with a personalized attention mechanism over the content of each node. This way, CORGI assigns user-item-specific attention scores with respect to the words that appear in items. More detailed information can be found in our paper:
CORGI: Content-Rich Graph Neural Networks with Attention. J. Kim, A. Lamb, S. Woodhead, S. Peyton Jones, C. Zhang, M. Allamanis. RecSys: Workshop on Graph Neural Networks for Recommendation and Search, 2021, 2021
Graph Convolutional Network (GCN)
Azua provides a re-implementation of GCN. As a default, "average" is used for the aggregation function and nodes are randomly initialized. We adopt dropout with probability 0.5 for node embedding updates as well as for the prediction MLPs.
GRAPE is a GNN model that employs edge embeddings (please refer to this paper for details). Also, it adopts edge dropouts that are applied throughout all message-passing layers. Compared to the GRAPE proposed in the oroginal paper, because of the memory issue, we do not initialize nodes with one-hot vectors nor constants (ones).
Graph Convolutional Matrix Completion (GC-MC)
Compared to GCN, this model has a single message-passing layer. Also, For classification, each label is endowed with a separate message passing channel. Here, we do not implement the weight sharing. For more details, please refer to this paper.
GraphSAGE extends GCN by allowing the model to be trained on the part of the graph, making the model to be used in inductive settings. For more details, please refer to this paper
During message aggregation, GAT uses the attention mechanism to allow the target nodes to distinguish the weights of multiple messages from the source nodes for aggregation. For more details, please refer to this paper.
5.7.2 Different node initializations
All GNN models allow different kinds of node initializations. This can be done by modifying the model config file. For example, to run CoRGi with SBERT initialization, change "node_init": "random"
to "node_init": "sbert_init"
in configs/defaults/model_config_corgi.json
.
The list of node initializations allowed inclue:
"random", "grape", "text_init" (TF-IDF),"sbert_init", "neural_bow_init", "bert_cls_init", "bert_average_init"
For example, the test performance of GCN Init: NeuralBOW
in Table 2 of the paper on Eedi dataset can be acquired by running:
python run_experiment.py eedi graph_convolutional_network -dc configs/defaults/model_config_graph_convolutional_network.json
with "node_init": "neural_bow_init"
in te corresponding model config file.
5.7.3 Datasets
CoRGi operate on content-augmented graph data.
Goodreads
Download the data from this link under data
directory with name goodreads
.
The Goodreads dataset from the Goodreads website contains users and books. The content of each book-node is its natural language description. The dataset includes a 1 to 5 integer ratings between some books and users.
The pre-processing of this data can found at
research_experiments/GNN/create_goodreads_dataset.py
Eedi
Download the data from this link under data
directory with name eedi
.
This dataset is from the Diagnostic Questions - NeurIPS 2020 Education Challenge. It contains anonymized student and question identities with the student responses to some questions. The content of each question-node is the text of the question. Edge labels are binary: one and zero for correct and incorrect answers.
The pre-processing codes for the datasets to be used for CoRGi can be found at:
research_experiments/eedi/
5.7.3 Running Corgi
To run the CoRGi code with Eedi dataset, first locate the preprocessed data at
data/eedi/
Then, run the following code:
python run_experiment.py eedi -mt corgi
This can be done with different datasets and different GNN models. The train and validation performances can be tracked using tensorboard which is logged under the runs
directory. Also, the trained model is saved with .pt
extension.
6. Other engineering details
Reproducibility
As the project uses PyTorch, we can't guarantee completely reproducible results across different platforms and devices. However, for the specific platform/device, the results should be completed reproducible i.e. running an experiment twice should give the exact same results. More about limitation on reproducibility in PyTorch can be found here.
Add your own dataset
To add a new dataset, a new directory should be added to the data folder, containing either all of the dataset in a file named all.csv
, or a train/test split in files named train.csv
and test.csv
. In the former case, a train/test split will be generated, in a 80%/20% split by default.
Data can be specified in two formats using the --data_type
flag in the entrypoint scripts. The default format is "csv", which assumes that each column represents a feature, and each row represents a data point. The alternative format is "sparse_csv", where there are 3 columns representing the row ID, column ID and value of a particular matrix element, as in a coordinate-list (COO) sparse matrix. All values not specified are assumed to be missing.In both cases, no header row should be included in the CSV file.
Variable metadata for each variable in a dataset can be specified in an optional file named variables.json
. This file is an array of dictionaries, one for each variable in the dataset. For each variable, the following values may be specified:
- id: int, index of the variable in the dataset
- query: bool, whether this variable can be queried during active learning (True) or is a target (False).
- type: string, type of variable - either "continuous", "binary" or "categorical".
- lower: numeric, lower bound for the variable.
- upper: numeric, upper bound for the variable.
- name: string, name of the variable.
For each field not specified, it will attempt to be inferred. Note: all features will be assumed to be queriable, and thus not active learning targets, unless explicitly specified otherwise. Lower and upper values will be inferred from the training data, and the type will be inferred based on whether the variable takes exclusively integer values.
Split type for the dataset
The source data can be split into train/validation/test datasets either based on rows or elements. The former is split by rows of the matrix, whereas the latter is split by individual elements of the matrix, so that different elements of a row can appear in different data splits (i.e. train or validation or test).
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.