[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

Last update: Dec 21, 2022

Related tags

Text Data & NLP Non-Homophily-Benchmarks

Overview

New Benchmarks for Learning on Non-Homophilous Graphs

Here are the codes and datasets accompanying the paper:
New Benchmarks for Learning on Non-Homophilous Graphs
Derek Lim (Cornell), Xiuyu Li (Cornell), Felix Hohne (Cornell), and Ser-Nam Lim (Facebook AI).
Workshop on Graph Learning Benchmarks, WWW 2021.
[PDF link]

There are codes to load our proposed datasets, compute our measure of the presence of homophily, and train various graph machine learning models in our experimental setup.

Organization

main.py contains the main experimental scripts.

dataset.py loads our datasets.

models.py contains implementations for graph machine learning models, though C&S (correct_smooth.py, cs_tune_hparams.py) is in separate files. Also, gcn-ogbn-proteins.py contains code for running GCN and GCN+JK on ogbn-proteins. Running several of the GNN models on larger datasets may require at least 24GB of VRAM.

homophily.py contains functions for computing homophily measures, including the one that we introduce in our_measure.

Datasets

As discussed in the paper, our proposed datasets are "twitch-e", "yelp-chi", "deezer", "fb100", "pokec", "ogbn-proteins", "arxiv-year", and "snap-patents", which can be loaded by load_nc_dataset in dataset.py by passing in their respective string name. Many of these datasets are included in the data/ directory, but due to their size, yelp-chi, snap-patents, and pokec are automatically downloaded from a Google drive link when loaded from dataset.py. The arxiv-year and ogbn-proteins datasets are downloaded using OGB downloaders. load_nc_dataset returns an NCDataset, the documentation for which is also provided in dataset.py. It is functionally equivalent to OGB's Library-Agnostic Loader for Node Property Prediction, except for the fact that it returns torch tensors. See the OGB website for more specific documentation. Just like the OGB function, dataset.get_idx_split() returns fixed dataset split for training, validation, and testing.

When there are multiple graphs (as in the case of twitch-e and fb100), different ones can be loaded by passing in the sub_dataname argument to load_nc_dataset in dataset.py.

twitch-e consists of seven graphs ["DE", "ENGB", "ES", "FR", "PTBR", "RU", "TW"]. In the paper we test on DE.

fb100 consists of 100 graphs. We only include ["Amherst41", "Cornell5", "Johns Hopkins55", "Penn94", "Reed98"] in this repo, although others may be downloaded from the internet archive. In the paper we test on Penn94.

Installation instructions

Create and activate a new conda environment using python=3.8 (i.e. conda create --name non-hom python=3.8)
Activate your conda environment
Check CUDA version using nvidia-smi
In the root directory of this repository, run bash install.sh cu110, replacing cu110 with your CUDA version (i.e. CUDA 11 -> cu110, CUDA 10.2 -> cu102, CUDA 10.1 -> cu101). We tested on Ubuntu 18.04, CUDA 11.0.

Running experiments

Make sure a results folder exists in the root directory.
Our experiments are in the experiments/ directory. There are bash scripts for running methods on single and multiple datasets. Please note that the experiments must be run from the root directory. For instance, to run the MixHop experiments on snap-patents, use:

bash experiments/mixhop_exp.sh snap-patents

Some datasets require specifying a second sub_dataset argument e.g. to run MixHop experiments on the twitch-e, DE sub_dataset, do:

bash experiments/mixhop_exp.sh twitch-e DE

Otherwise, run python main.py --help to see the full list of options for running experiments. As one example, to train a GAT with max jumping knowledge connections on (directed) arxiv-year with 32 hidden channels and 4 attention heads, run:

python main.py --dataset arxiv-year --method gatjk --hidden_channels 32 --gat_heads 4 --directed

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

ConferencingSpeech 2022 challenge This repository contains the datasets list and scripts required for the ConferencingSpeech 2022 challenge. For more

21 Dec 2, 2022

PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

760 Jan 3, 2023

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

285 Jan 2, 2023

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

3.1k Jan 7, 2023

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

2.5k Feb 17, 2021

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

60 Dec 25, 2022

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

132 Nov 25, 2022

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

normalizer This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch

23 Nov 30, 2022

Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

1.2k Dec 30, 2022

Comments

cannot load snap-patents dataset

Hi,

Thank you for your contribution about the datasets! I tried to run your predefined experiments directly but get the following errors when I run bash experiments/mixhop_exp.sh snap-patents:

Namespace(dataset='snap-patents', sub_dataset='', hidden_channels=8, dropout=0.5, lr=0.01, method='mixhop', epochs=500, cpu=False, weight_decay=0.001, display_step=25, hops=2, num_layers=2, runs=5, cached=False, gat_heads=8, lp_alpha=0.1, gpr_alpha=0.1, directed=True, jk_type='max', rocauc=False, num_mlp_layers=1, print_prop=False, train_prop=0.5, valid_prop=0.25, rand_split=False, no_bn=False) Traceback (most recent call last): File "/home/niepert-adm/Downloads/Non-Homophily-Benchmarks/main.py", line 32, in <module> dataset = load_nc_dataset(args.dataset, args.sub_dataset) File "/home/niepert-adm/Downloads/Non-Homophily-Benchmarks/dataset.py", line 102, in load_nc_dataset dataset = load_snap_patents_mat() File "/home/niepert-adm/Downloads/Non-Homophily-Benchmarks/dataset.py", line 256, in load_snap_patents_mat fulldata = scipy.io.loadmat(f'{DATAPATH}snap_patents.mat') File "/home/niepert-adm/miniconda3/envs/non-hom/lib/python3.9/site-packages/scipy/io/matlab/_mio.py", line 225, in loadmat MR, _ = mat_reader_factory(f, **kwargs) File "/home/niepert-adm/miniconda3/envs/non-hom/lib/python3.9/site-packages/scipy/io/matlab/_mio.py", line 74, in mat_reader_factory mjv, mnv = _get_matfile_version(byte_stream) File "/home/niepert-adm/miniconda3/envs/non-hom/lib/python3.9/site-packages/scipy/io/matlab/_miobase.py", line 251, in _get_matfile_version raise ValueError('Unknown mat file type, version %s, %s' % ret) ValueError: Unknown mat file type, version 32, 99

bash experiments/mixhop_exp.sh twitch-e DE works !

Best, Min

opened by MinWang1997 2
Reproduce results on pokec and snap-patents

Hello,

Thank you so much for putting these datasets together for public access. Very interesting and well-written paper as well!

I have encountered some issues when I try to reproduce the results on the pokec and snap-patents dataset. For the simplest GCN model, my results on these two datasets are ~62% and ~41% respectively. However, the accuracy in the paper are ~75% and ~45%. For both cases, I used hidden_dim = 32 and searched over lr = [0.1, 0.01, 0.001]. May I ask what hyperparameters I should use to achieve the desired accuracy shown in the paper? Also, after how many epochs did your training converge?

In the paper appendix B1, it says that the best results were also searched over hidden_dim = [4,8,16,32]. However, my training accuracy is also similar to the validation/test accuracy, so I am not sure reducing hidden_dim will help. Also, since these two datasets are large, running the hyperparameter search again be expensive. Could you please kindly share the exact hyperparameters you used?

By the way, my results are on the first fixed split. My other guess is that maybe the 5 fixed splits are very different from each other so the averaged result can be high when the other splits produce high accuracies? However, if this is the case, the variance of these 5 splits may seem to be too high. It would be great if you could also confirm the accuracy of 5 splits should be similar.

I really appreciate your help.

opened by ShichangZh 2
Embedding Method

Hi, thank you for your great work! I have a question about Twitch dataset. (same as issue in here) you wrote: Vertex features are extracted based on the games played and liked, location, and streaming habits. but you didn't mention how you make node feature vectors with those features. Could you please let us know?

Thanks,

opened by HAI-syz 0

Owner

Cornell University Artificial Intelligence

GitHub

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

230 Nov 22, 2022

Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

20 Dec 25, 2022

Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

10 Oct 21, 2022

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

147 Dec 5, 2022

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

67 Nov 14, 2022

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

261 Nov 12, 2022

SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

213 Jan 4, 2023

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

237 Jan 2, 2023

[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

Related tags

Overview

New Benchmarks for Learning on Non-Homophilous Graphs

Organization

Datasets

Installation instructions

Running experiments

You might also like...

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

Google and Stanford University released a new pre-trained model called ELECTRA

Comments

cannot load snap-patents dataset

Reproduce results on pokec and snap-patents

Embedding Method

Owner

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

Learning to Rewrite for Non-Autoregressive Neural Machine Translation

Extracting Summary Knowledge Graphs from Long Documents

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

SummerTime - Text Summarization Toolkit for Non-experts

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.