[NeurIPS 2021] Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Last update: Jan 3, 2023

Related tags

Deep Learning Non-Homophily-Large-Scale

Overview

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Derek Lim*, Felix Hohne*, Xiuyu Li*, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, Ser-Nam Lim

Published at NeurIPS 2021

Here are codes to load our proposed datasets, compute our measure of homophily, and train various graph machine learning models in our experimental setup. We include an implementation of the new graph neural network LINKX that we develop.

Organization

main.py contains the main full batch experimental scripts.

main_scalable.py contains the minibatching experimental scripts.

parse.py contains flags for running models with specific settings and hyperparameters.

dataset.py loads our datasets.

models.py contains implementations for graph machine learning models, though C&S (correct_smooth.py, cs_tune_hparams.py) are in separate files. Running several of the GNN models on larger datasets may require at least 24GB of VRAM. Our LINKX model is implemented in this file.

homophily.py contains functions for computing homophily measures, including the one that we introduce in our_measure.

experiments/ contains the bash files to reproduce full batch experiments.

scalable_experiments/ contains the bash files to reproduce minibatching experiments.

wiki_scraping/ contains the Python scripts to reproduce the "wiki" dataset by querying the Wikipedia API and cleaning up the data.

Datasets

As discussed in the paper, our proposed datasets are "genius", "twitch-gamer", "fb100", "pokec", "wiki", "arxiv-year", and "snap-patents", which can be loaded by load_nc_dataset in dataset.py by passing in their respective string name. Many of these datasets are included in the data/ directory, but wiki, twitch-gamer, snap-patents, and pokec are automatically downloaded from a Google drive link when loaded from dataset.py. The arxiv-year dataset is downloaded using OGB downloaders. load_nc_dataset returns an NCDataset, the documentation for which is also provided in dataset.py. It is functionally equivalent to OGB's Library-Agnostic Loader for Node Property Prediction, except for the fact that it returns torch tensors. See the OGB website for more specific documentation. Just like the OGB function, dataset.get_idx_split() returns fixed dataset split for training, validation, and testing.

When there are multiple graphs (as in the case of fb100), different ones can be loaded by passing in the sub_dataname argument to load_nc_dataset in dataset.py. In particular, fb100 consists of 100 graphs. We only include ["Amherst41", "Cornell5", "Johns Hopkins55", "Penn94", "Reed98"] in this repo, although others may be downloaded from the internet archive. In the paper we test on Penn94.

References

The datasets come from a variety of sources, as listed here:

Penn94. Traud et al 2012. Social Structure of Facebook Networks
pokec. Leskovec et al. Stanford Network Analysis Project
arXiv-year. Hu et al 2020. Open Graph Benchmark
snap-patents. Leskovec et al. Stanford Network Analysis Project
genius. Lim and Benson 2020. Expertise and Dynamics within Crowdsourced Musical Knowledge Curation: A Case Study of the Genius Platform
twitch-gamers. Rozemberczki and Sarkar 2021. Twitch Gamers: a Dataset for Evaluating Proximity Preserving and Structural Role-based Node Embeddings
wiki. Collected by the authors of this work in 2021.

Installation instructions

Create and activate a new conda environment using python=3.8 (i.e. conda create --name non-hom python=3.8)
Activate your conda environment
Check CUDA version using nvidia-smi
run bash install.sh cu110, replacing cu110 with your CUDA version (CUDA 11 -> cu110, CUDA 10.2 -> cu102, CUDA 10.1 -> cu101). We tested on Ubuntu 18.04, CUDA 11.0.

Running experiments

Make sure a results folder exists in the root directory.
Our experiments are in the experiments/ and scalable_experiments/ directory. There are bash scripts for running methods on single and multiple datasets. Please note that the experiments must be run from the root directory, e.g. (bash experiments/mixhop_exp.sh snap-patents). For instance, to run the MixHop experiments on arxiv-year, use:

bash experiments/mixhop_exp.sh arxiv-year

To run LINKX on pokec, use:

bash experiments/linkx_exp.sh pokec

To run LINK on Penn94, use:

bash experiments/link_exp.sh fb100 Penn94

To run GCN-cluster on twitch-gamers, use:

bash scalable_experiments/gcn_cluster.sh twitch-gamer

To run LINKX minibatched on wiki, use

bash scalable_experiments/linkx_exp.sh wiki

To run LINKX on Geom-GCN with full hyperparameter grid on chameleon, use

bash experiments/linkx_tuning.sh chameleon

Comments

Wrong test result on Cora

When I modified the shell file for the Cora dataset, and ran the command: bash experiments/gcn_exp.sh Cora. The test results only get around 47~48. And we all know that GCN as a classical model will get results around at 81.

I try hard to debug and modify the codes to get what’s wrong. But still be confused. Could you please give me an answer or a solution?

opened by llooFlashooll 8
Can't Reproduce LINKX Full Batch results

Hi, I try to reproduce LINKX Full-Batch results on Actor/Cornell/Texas/arxiv-year dataset, but I only get 32 test accuracy for Actor dataset, 64 for Cornell dataset, 67 for Texas dataset, 54 for arxiv-year dataset.

I follow the parameters as you show in the paper: hidden_channels ∈ {16, 32, 128, 256}, MLP_final ∈ {1, 2, 3}, MLP_A ∈ {1, 2}, MLP_X ∈ {1, 2}, learning rate ∈ {0.05, 0.01, 0.002}, dropout ∈ {0.0, 0.5}, I try many times to Combination these parameters, But I failed to reproduce the results. Could you please show the parameters used in these datasets?

opened by zrhhhhh123 5

pokec cannot be downloaded directly, either

Hi authors,

Again I found pokec cannot be downloaded using your gdd script. The file I downloaded is an html file, from that I can download anyway, but just let you guys know.

The html file:

<!DOCTYPE html><html><head><title>Google Drive - Download warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="uH/xyEi4EXe5R83mihluqg">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive has detected issues with your download</p><p class="uc-warning-subcaption">This file is too large for Google to scan for viruses.</p><p class="uc-warning-subcaption">This file is executable and may harm your computer.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=1dNs5E7BrWJbgcHeQ_zuy5Ozp2tRCWG0y">pokec.mat</a> (1.3G)</span></p><form id="downloadForm" action="https://docs.google.com/uc?export=download&amp;id=1dNs5E7BrWJbgcHeQ_zuy5Ozp2tRCWG0y&amp;confirm=t" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>

opened by devnkong 4

Datasets clarification

Hey authors,

Thanks for the great work!

I saw you people have another repo for a WWW paper. Are the datasets identical? Which repo shall I use?

Also I haven't dived deeply into the specs of the datasets, could you inform me how are the input node features of the datasets are constructed?

Thank you!

opened by devnkong 2
about provided wiki dataset

Hi, for some reason, I can not use gdown to download dataset, so I choose to download wiki dataset provided in your github page. However, I find it doesn't match in the code. In code: It seems that there are 3 files to be downloaded. While in in github page: There are only 2 files with different names.

opened by yangzhao1230 0
switch to gdown to download large files

This PR switches google_drive_downloader to gdown for downloading data from Google Drive, aiming to solve the large files downloading issue of the python script mentioned in #4, #5, https://github.com/CUAI/Non-Homophily-Benchmarks/issues/2.

opened by Xiuyu-Li 0
Running issue of LINKX model

Hello,

I got an issue of running LINKX model for scalable experiments. In particular, in the forward function of LINKX model, it needs to construct a sparse tensor (line 39, models.py), whose dimension is set to (m, self.num_nodes). m is the number of unique nodes in the parsed data batch. However, since the edge indices are not relabeled, there are certainly some node indices exceeding the number of unique nodes (m value here). Is this a bug or I got some wrong here?

Thanks!

opened by VeritasYin 0
Embedding Method

Hi, thank you for your great work! I have a question about Twitch dataset. you wrote: Vertex features are extracted based on the games played and liked, location, and streaming habits. but you didn't mention how you make node feature vectors with those features. Could you please let us know?

Thanks,

opened by HAI-syz 2
batch training on linkx

I tried to do experiment with linkx, but when I used row sampling that paper suggested does not work. I think it has some error in making sparse matrix. Do you have plan to fix it or update it?

opened by KimKyuSik 5

[NeurIPS 2021] Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Related tags

Overview

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Organization

Datasets

References

Installation instructions

Running experiments

Comments

Wrong test result on Cora

Can't Reproduce LINKX Full Batch results

pokec cannot be downloaded directly, either

Datasets clarification

about provided wiki dataset

switch to gdown to download large files

Running issue of LINKX model

Embedding Method

batch training on linkx

Owner

Revisiting Video Saliency: A Large-scale Benchmark and a New Model (CVPR18, PAMI19)

The code for the CVPR 2021 paper Neural Deformation Graphs, a novel approach for globally-consistent deformation tracking and 3D reconstruction of non-rigid objects.

Multi-task Learning of Order-Consistent Causal Graphs (NeuRIPs 2021)

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps[AAAI2021]

A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning (CoRL 2021)

A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.

[ICCV 2021] A Simple Baseline for Semi-supervised Semantic Segmentation with Strong Data Augmentation

Language models are open knowledge graphs ( non official implementation )

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.