[ICML 2022] The official implementation of Graph Stochastic Attention (GSAT).

Last update: Nov 27, 2022

Related tags

Overview

Graph Stochastic Attention (GSAT)

The official implementation of GSAT for our paper: Interpretable and Generalizable Graph Learning via Stochastic Attention Mechanism, to appear in ICML 2022.

Introduction

Commonly used attention mechanisms do not impose any constraints during training (besides normalization), and thus may lack interpretability. GSAT is a novel attention mechanism for building interpretable graph learning models. It injects stochasticity to learn attention, where a higher attention weight means a higher probability of the corresponding edge being kept during training. Such a mechanism will push the model to learn higher attention weights for edges that are important for prediction accuracy, which provides interpretability. To further improve the interpretability for graph learning tasks and avoid trivial solutions, we derive regularization terms for GSAT based on the information bottleneck (IB) principle. As a by-product, IB also helps model generalization. Fig. 1 shows the architecture of GSAT.

Figure 1. The architecture of GSAT.

Installation

We have tested our code on Python 3.9 with PyTorch 1.10.0, PyG 2.0.3 and CUDA 11.3. Please follow the following steps to create a virtual environment and install the required packages.

Create a virtual environment:

conda create --name gsat python=3.9
conda activate gsat

Install dependencies:

conda install -y pytorch==1.10.0 torchvision cudatoolkit=11.3 -c pytorch
pip install torch-scatter==2.0.9 torch-sparse==0.6.12 torch-cluster==1.5.9 torch-spline-conv==1.2.1 torch-geometric==2.0.3 -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
pip install -r requirements.txt

In case a lower CUDA version is required, please use the following command to install dependencies:

conda install -y pytorch==1.9.0 torchvision==0.10.0 torchaudio==0.9.0 cudatoolkit=10.2 -c pytorch
pip install torch-scatter==2.0.9 torch-sparse==0.6.12 torch-cluster==1.5.9 torch-spline-conv==1.2.1 torch-geometric==2.0.3 -f https://data.pyg.org/whl/torch-1.9.0+cu102.html
pip install -r requirements.txt

Run Examples

We provide examples with minimal code to run GSAT in ./example/example.ipynb. We have tested the provided examples on Ba-2Motifs (GIN), Mutag (GIN) and OGBG-Molhiv (PNA). Yet, to implement GSAT* one needs to load a pre-trained model first in the provided example.

It should be able to run on other datasets as well, but some hard-coded hyperparameters might need to be changed accordingly. To reproduce results for other datasets, please follow the instructions in the following section.

Reproduce Results

We provide the source code to reproduce the results in our paper. The results of GSAT can be reproduced by running run_gsat.py. To reproduce GSAT*, one needs to run pretrain_clf.py first and change the configuration file accordingly (from_scratch: false).

To pre-train a classifier:

cd ./src
python pretrain_clf.py --dataset [dataset_name] --backbone [model_name] --cuda [GPU_id]

To train GSAT:

cd ./src
python run_gsat.py --dataset [dataset_name] --backbone [model_name] --cuda [GPU_id]

dataset_name can be choosen from ba_2motifs, mutag, mnist, Graph-SST2, spmotif_0.5, spmotif_0.7, spmotif_0.9, ogbg_molhiv, ogbg_moltox21, ogbg_molbace, ogbg_molbbbp, ogbg_molclintox, ogbg_molsider.

model_name can be choosen from GIN, PNA.

GPU_id is the id of the GPU to use. To use CPU, please set it to -1.

Training Logs

Standard output provides basic training logs, while more detailed logs and interpretation visualizations can be found on tensorboard:

tensorboard --logdir=./data/[dataset_name]/logs

Hyperparameter Settings

All settings can be found in ./src/configs.

Instructions on Acquiring Datasets

Ba_2Motifs
- Raw data files can be downloaded automatically, provided by PGExplainer and DIG.
Spurious-Motif
- Raw data files can be generated automatically, provide by DIR.
OGBG-Mol
- Raw data files can be downloaded automatically, provided by OGBG.
Mutag
- Raw data files need to be downloaded here, provided by PGExplainer.
- Unzip Mutagenicity.zip and Mutagenicity.pkl.zip.
- Put the raw data files in ./data/mutag/raw.
Graph-SST2
- Raw data files need to be downloaded here, provided by DIG.
- Unzip the downloaded Graph-SST2.zip.
- Put the raw data files in ./data/Graph-SST2/raw.
MNIST-75sp
- Raw data files need to be generated following the instruction here.
- Put the generated files in ./data/mnist/raw.

FAQ

Does GSAT encourage sparsity?

No, GSAT doesn't encourage generating sparse subgraphs. We find r = 0.7 (Eq.(9) in our paper) can generally work well for all datasets in our experiments, which means during training roughly 70% of edges will be kept (kind of still large). This is because GSAT doesn't try to provide interpretability by finding a small/sparse subgraph of the original input graph, which is what previous works normally do and will hurt performance significantly for inhrently interpretable models (as shown in Fig. 7 in the paper). By contrast, GSAT provides interpretability by pushing the critical edges to have relatively lower stochasticity during training.

How to choose the value of `r`?

A grid search in [0.5, 0.6, 0.7, 0.8, 0.9] is recommended, but r = 0.7 is a good starting point. Note that in practice we would decay the value of r gradually during training from 0.9 to the chosen value.

`p` or `α` to implement Eq.(9)?

Recall in Fig. 1, p is the probability of dropping an edge, while α is the sampled result from Bern(p). In our provided implementation, as an empirical choice, α is used to implement Eq.(9) (the Gumbel-softmax trick makes α essentially continuous in practice). We find that when α is used it may provide more regularization and makes the model more robust to hyperparameters. Nonetheless, using p can achieve the same performance, but it needs some more tuning.

Can you show an example of how GSAT works?

Below we show an example from the ba_2motifs dataset, which is to distinguish five-node cycle motifs (left) and house motifs (right). To make good predictions (minimize the cross-entropy loss), GSAT will push the attention weights of those critical edges to be relatively large (ideally close to 1). Otherwise, those critical edges may be dropped too frequently and thus result in a large cross-entropy loss. Meanwhile, to minimize the regularization loss (the KL divergence term in Eq.(9) of the paper), GSAT will push the attention weights of other non-critical edges to be close to r, which is set to be 0.7 in the example. This mechanism of injecting stochasticity makes the learned attention weights from GSAT directly interpretable, since the more critical an edge is, the larger its attention weight will be (the less likely it can be dropped). Note that ba_2motifs satisfies our Thm. 4.1 with no noise, and GSAT achieves perfect interpretation performance on it.

Figure 2. An example of the learned attention weights.

Reference

If you find our paper and repo useful, please cite our paper:

@article{miao2022interpretable,
  title={Interpretable and Generalizable Graph Learning via Stochastic Attention Mechanism},
  author={Miao, Siqi and Liu, Miaoyuan and Li, Pan},
  journal={arXiv preprint arXiv:2201.12987},
  year={2022}
}

Comments

directed edge weights on undirected graphs

Hello,

First of all, thank you for your paper and your code, it is a pleasure to work with it.

However, I have a question about the following line : https://github.com/Graph-COM/GSAT/blob/ea900dbe8d27fa64c30f2fe46ab8b5a68ef719ca/src/run_gsat.py#L79 When I run it, this line does not seem to do anything more than edge_att = (att + att) / 2. As a result, edge weights are different depending on the direction of the edge (0 to 1 != 1 to 0). Have I missed anything?

opened by simoons95 6
Issues with implementation

Hi all, I tried to implement the info loss in my own GNN. I am using a custom convolution in a custom dataset that might have leakage, so this might be the source of error. But I am trying to understand why the model would behave the way it is behaving. I would appreciate any ideas/feedback.

My model is for link prediction on small subgraphs, where each for each edge I wanna predict, I sample a subgraph around it.

I am implementing the info_loss just like in your code: info_loss = (edge_att * torch.log(edge_att/r + 1e-6) + (1-edge_att) * torch.log((1-edge_att)/(1-r+1e-6) + 1e-6)).mean()

If I don't use any sort of info loss, when I train my model, my edge attention looks like this:

If I use l1 loss (just minimizing edge_att.mean()), my edge attention looks like this:

If I use l1 loss, but multiply by 1e-3, it looks like this:

However, if I use the info loss proposed in your paper, my edge attention agglutinates in values close to r. For example, for r = 0.3, I get the attention distribution below. If I use r=0.5, then the dense part of the historgram moves to the middle, and if I use something like r=0.7 or r=0.9, then all my attention weights are closer to 1.

I tried to understand the intuition behind it by plotting the curve att x info_loss for different values of r

So basically the info_loss is approximately zero when closer to r, and positive everywhere else. This is forcing my model to try to have the attention always close to r (which I am not sure if I understand why), and apparently this is exactly what my model is doing. What confuses me is that in your paper, r is recommended to be between 0.5 and 0.9. However, in my current setting, this forces the majority of my edge attention to be > 0.5, instead of making them sparse.

I wonder if I am doing something wrong, if info_loss should have a smaller weight, or if my concrete_sampler should have a higher temperature to force a bernoulli-like distribution, or if maybe my model simply doesnt really need the edges, and it is ok with using any edge_attention value, hacking a way to get the same solution just based on node embedding, for example, without message passing. Maybe I have excess of dropout during training? (I do both node and edge dropout).

Please let me know if you have any ideas. Thanks in advance!

opened by fmellomascarenhas 4
Unstable accuracy on test set

Hello,

I also have a second question (see https://github.com/Graph-COM/GSAT/issues/5 for the first one). When I run your code on ba_2motifs, I obtain the following graph: Fortunately, the validation set have the exact same behaviour as the test set, so when the validation set accuracy is high, the test set accuracy is high too, hence the good results. Is it expected that it is so unstable? What could I do to avoid that?

opened by simoons95 2
about subgraph Gs

Thanks for your amazing work! But where does the code reflect the process of obtaining subgraphs (Gs) through sampling.

If you see this issue, please tell me the answer.Thanks in advance!

opened by WuZongzhen 2
Add `reorder_like` to speed up the averaging of edge weights for undirected graphs

As kindly pointed out by @simoons95 in https://github.com/Graph-COM/GSAT/issues/5#issuecomment-1318688708, reorder_like addresses https://github.com/Graph-COM/GSAT/issues/5 and is much faster.
enhancement

opened by siqim 0
Bug fix: `torch_sparse.transpose` outputs are coalesced

torch_sparse.transpose would coalesce outputs by default and mess up results when input edge_index is not sorted. This issue does not affect any reported results, but it may cause problems potentially. Setting coalesced=False resolves the issue.
bug

opened by siqim 0

Owner

GitHub https://arxiv.org/abs/2201.12987

An implementation of "MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing" (ICML 2019).

MixHop and N-GCN ⠀ A PyTorch implementation of "MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing" (ICML 2019)

393 Dec 13, 2022

Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

HackED 2022 Team 3IQ - 2022 Imposter Detector By Aneeljyot Alagh, Curtis Kan, Jo

3 Aug 20, 2022

The official implementation of You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient.

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient (paper) @misc{zhang2021compress,

46 Dec 7, 2022

Official implementation of "SinIR: Efficient General Image Manipulation with Single Image Reconstruction" (ICML 2021)

SinIR (Official Implementation) Requirements To install requirements: pip install -r requirements.txt We used Python 3.7.4 and f-strings which are in

47 Oct 11, 2022

Official implementation for "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation" (CVPR 2022)

QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation (CVPR2022) https://arxiv.org/abs/2203.08483 Unpaired image-to-image (I2I

50 Dec 16, 2022

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

43 Dec 7, 2022

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

120 Dec 15, 2022

[ICML 2021] "Graph Contrastive Learning Automated" by Yuning You, Tianlong Chen, Yang Shen, Zhangyang Wang

Graph Contrastive Learning Automated PyTorch implementation for Graph Contrastive Learning Automated [talk] [poster] [appendix] Yuning You, Tianlong C

80 Nov 23, 2022

Official code for Score-Based Generative Modeling through Stochastic Differential Equations

Score-Based Generative Modeling through Stochastic Differential Equations This repo contains the official implementation for the paper Score-Based Gen

818 Jan 6, 2023

Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

SuperGAT Official implementation of Self-supervised Graph Attention Networks (SuperGAT). This model is presented at How to Find Your Friendly Neighbor

127 Dec 28, 2022

Official PyTorch implementation of "AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks"

AASIST This repository provides the overall framework for training and evaluating audio anti-spoofing systems proposed in 'AASIST: Audio Anti-Spoofing

56 Jan 2, 2023

Official Code for ICML 2021 paper "Revisiting Point Cloud Shape Classification with a Simple and Effective Baseline"

Revisiting Point Cloud Shape Classification with a Simple and Effective Baseline Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, Jia Deng Internati

115 Jan 4, 2023

Official code for UnICORNN (ICML 2021)

UnICORNN (Undamped Independent Controlled Oscillatory RNN) [ICML 2021] This repository contains the implementation to reproduce the numerical experime

21 Dec 22, 2022

Implementation of Stochastic Image-to-Video Synthesis using cINNs.

Stochastic Image-to-Video Synthesis using cINNs Official PyTorch implementation of Stochastic Image-to-Video Synthesis using cINNs accepted to CVPR202

135 Dec 28, 2022

PyTorch implementation for SDEdit: Image Synthesis and Editing with Stochastic Differential Equations

SDEdit: Image Synthesis and Editing with Stochastic Differential Equations Project | Paper | Colab PyTorch implementation of SDEdit: Image Synthesis a

536 Jan 5, 2023

PyTorch implementation for Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition.

Stochastic CSLR This is the PyTorch implementation for the ECCV 2020 paper: Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuou

28 Dec 19, 2022

PyTorch implementation for Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021, Oral)

Score-Based Generative Modeling through Stochastic Differential Equations This repo contains a PyTorch implementation for the paper Score-Based Genera

757 Jan 4, 2023

An official source code for paper Deep Graph Clustering via Dual Correlation Reduction, accepted by AAAI 2022

Dual Correlation Reduction Network An official source code for paper Deep Graph Clustering via Dual Correlation Reduction, accepted by AAAI 2022. Any

109 Dec 23, 2022

The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

DS3L This is the code for paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020. Setups The code is implem

36 Oct 19, 2022