A large-scale database for graph representation learning

Scott Freitas

Last update: Nov 25, 2022

Related tags

Deep Learning data-science machine-learning data-mining embedded deep-learning graph graph-algorithms artificial-intelligence neural-networks representation-learning graph-embedding graph-classification geometric-deep-learning

Overview

A Large-Scale Database for Graph Representation Learning

MalNet: Advancing State-of-the-art Graph Databases

Recent research focusing on developing graph kernels, neural networks and spectral methods to capture graph topology has revealed a number of shortcomings of existing graph benchmark datasets, which often contain graphs that are relatively:

limited in number,
small in scale in terms of nodes and edges, and
restricted in class diversity.

To solve these issues, we have been working to develop the worlds largest public graph representation learning database to date at Georgia Tech’s Polo Club of Data Science. We release MalNet, which contains over 1.2 million function call graphs averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families of classes (see Figure 1 below).

Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 44x larger graphs on average, and 63x more classes.

What is a function call graph (FCG)?

Function call graphs represent the control flow of programs (see Figure 2 below), and can be statically extracted from many types of software (e.g., EXE, PE, APK). We use the Android ecosystem due to its large market share, easy accessibility, and diversity of malicious software. With the generous permission of the AndroZoo we collected 1,262,024 Android APK files, specifically selecting APKs containing both a family and type label obtained from the Euphony classification structure.

How do we download and explore MalNet?

We have designed and developed MalNet Explorer, an interactive graph exploration and visualization tool to help people easily explore the data before downloading. Figure 3 shows MalNet Explorer’s desktop web interface and its main components. MalNet Explorer and the data is available online at: www.mal-net.org.

How to run the code?

The experiments we conducted in the arXiv paper can be run using dm_experiments.py.

You might also like...

O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning (CoRL 2021)

O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning Object-object Interaction Affordance Learning. For a given object-object int

26 Nov 4, 2022

Self-supervised learning on Graph Representation Learning (node-level task)

graph_SSL Self-supervised learning on Graph Representation Learning (node-level task) How to run the code To run GRACE, sh run_GRACE.sh To run GCA, sh

3 Dec 31, 2021

Open-AI's DALL-E for large scale training in mesh-tensorflow.

DALL-E in Mesh-Tensorflow [WIP] Open-AI's DALL-E in Mesh-Tensorflow. If this is similarly efficient to GPT-Neo, this repo should be able to train mode

432 Dec 16, 2022

Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

34.7k Jan 4, 2023

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

373 Jan 2, 2023

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

128 Dec 8, 2022

Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets This is the official implementation of "Towards Good Pract

52 Nov 22, 2022

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

184 Dec 11, 2022

An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

33 Jun 27, 2021

A large-scale database for graph representation learning

Related tags

Overview

A Large-Scale Database for Graph Representation Learning

MalNet: Advancing State-of-the-art Graph Databases

What is a function call graph (FCG)?

How do we download and explore MalNet?

How to run the code?

You might also like...

O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning (CoRL 2021)

Self-supervised learning on Graph Representation Learning (node-level task)

Open-AI's DALL-E for large scale training in mesh-tensorflow.

Apache Spark - A unified analytics engine for large-scale data processing

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Owner

Scott Freitas

A PyTorch implementation of "Multi-Scale Contrastive Siamese Networks for Self-Supervised Graph Representation Learning", IJCAI-21

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

Galileo library for large scale graph training by JD

We have implemented shaDow-GNN as a general and powerful pipeline for graph representation learning. For more details, please find our paper titled Deep Graph Neural Networks with Shallow Subgraph Samplers, available on arXiv (https//arxiv.org/abs/2012.01380).

Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training