GraPE
GraPE (Graph Processing and Embedding) is a fast graph processing and embedding library, designed to scale with big graphs and to run on both off-the-shelf laptop and desktop computers and High Performance Computing clusters of workstations.
The library is written in Rust and Python programming languages, and has been developed by AnacletoLAB (Dept.of Computer Science of the University of Milan), in collaboration with the RobinsonLab (Jackson Laboratory for Genomic Medicine) and the BPOP (Lawrence Berkeley National Laboratory).
GraPE is composed of two main modules: Ensmallen (ENabler of SMALL runtimE and memory Needs) and Embiggen (EMBeddInG GENerator), that run synergistically using parallel computation and efficient data structures.
Ensmallen efficiently executes graph processing operations including large-scale first and second-order random walks, while Embiggen leverages the large amount of sampled random walks generated by Ensmallen by computing effective node and edge embeddings. Beside being helpful for unsupervised exploratory analysis of graphs, the computed embeddings can be used for trainining any of the flexible neural models for edge and node label prediction, provided by Embiggen itself.
The following figure shows the main relationships between Ensmallen and Embiggen modules:
GraPE
Installation ofFor most computers you can just download it using pip:
pip install grape
Since Ensmallen is written in Rust, on PyPi we distribute pre-compiled packages for Windows, Linux, MacOs
for the Python version 3.6, 3.7, 3.8, 3.9
for x86_64
cpus.
For the Linux binaires we follow the Python's ManyLinux2010 (PEP 571) standard which requires libc version >= 2.12, this version was releasted in 03/08/2010 so any Linux System in the last ten years should be compatible. To check your current libc version you can run ldd --version
.
We also assume that the cpu has the following features: sse, sse2, ssse3, sse4_1, sse4_2, avx, avx2, bmi1, bmi2, popcnt
. If these features are not present, you cannot use the PyPi pre-compiled binaries and you have to manually compile Ensmallen (Guide) . On Linux you can check if your CPU supports these features by running cat /proc/cpuinfo
and ensuring that all these features are presents under the flags
section. While these features are not strictly required, they significanly speed-up the executions and should be supported by any x86_64
CPU newer than Intel's Haswell architecture (2013).
If the your CPU doesn't support them you will get, on import, a ValueError
exception with the following message:
This library was compiled assuming that SIMD instruction commonly available in CPU hardware since 2013 are present
on the machine where this library is intended to run.
On the current machine, the flags <MISSING_FLAGS> are not available.
You could still compile Ensmallen on this machine and have a version of the library that can execute here, but the
library has been extensively designed to use SIMD instructions, so you would have a version slower than the one
provided on Pypi.
These requirements were chosen to provide a good tradeoff between compatability and performance. If your system is not compatible, you can manually compile Ensmallen for any Os, libc version, and CPU architecture (such as Arm, AArch64, RiscV, Mips) which are supported by Rust and LLVM. Manually compiling Ensmallen might require more than half an hour and around 10Gb of RAM, if you encounter any error during the installation and/or compilation feel free to open an Issue here on Github and we will help troubleshoot it.
Main functionalities of the library
Robust graph loading and automatic graph retrieval:
- More than 13000 graphs directly available from the library for benchmarking
- Support for multiple graph formats
- Automatic human readable reports of format errors
- Automatic human readable reports of the main graph characteristics
Random walks:
- Exact and approximated first and second order random walks
- Massive generation of sampled random walks for graph embedding
- Automatic dispatching of 8 optimized random walk algorithms depending on the parameters of the random walk and the type (weighted/unweighted) of the graph
Node embedding models:
- SkipGram
- CBOW
- GloVe
Edge and node prediction models:
- Perceptron
- Multi-Layer Perceptron
- Deep Neural Networks
Preprocessing for node embedding and edge prediction:
- Lazy generation of skip-grams from random walks
- Lazy generation of balanced batches for edge prediction
- GloVe co-occurence matrix computation
Graph processing operations:
- Optimized filtering by node, edge and components characteristics
- Optimized algebraic set operations on graphs
- Automatic generation of reports summarizing graph features in natural language
Graph algorithms:
- Breadth and Depth-first search
- Dijkstra, Tarjan's strongly connected component
- Efficient Diameter computation, spanning arborescence and connected components
- Approximated vertex cover, triads counting, transitivity, clustering coefficient and triangles counting
- Betweenness and stress centrality, Closeness and harmonic centrality
Graph visualization tools: visualization of node and edge properties
Tutorials
You can find tutorials covering various aspects of the GraPE library here. All tutorials are as self-contained as possible and can be immediately executed on COLAB.
If you want to get quickly started, after having installed GraPE from Pypi as described above, you can try running the following example using the SkipGram embedding model on the Cora-graph:
from ensmallen.datasets.linqs import Cora
from ensmallen.datasets.linqs.parse_linqs import get_words_data
from embiggen.pipelines import compute_node_embedding
from embiggen.visualizations import GraphVisualization
import matplotlib.pyplot as plt
# Dowload, load up the graph and its node features
graph, node_features = get_words_data(Cora())
# Compute a SkipGram node embedding, using a second-order random walk sampling
node_embedding, training_history = compute_node_embedding(
graph,
node_embedding_method_name="SkipGram",
# Let's increase the probability of explore the local neighbourhood
return_weight=2.0,
explore_weight=0.1
)
# Visualize the obtained node embeddings
visualizer = GraphVisualization(graph, node_embedding_method_name="SkipGram")
visualizer.fit_transform_nodes(node_embedding)
visualizer.plot_node_types()
plt.show()
You can see a tutorial detailing the above script here, and you can run it on COLAB from here.
Documentation
On line documentation
The on line documentation of the library is available here. Since Ensmallen is written in Rust, and PyO3 (the crate we use for the Python bindings), doesn't support typing, the documentation is obtained generating an empty skeleton package. This allows to have a proper documentation but you won't be able to see the source-code in it.
Using the automatic method suggestions utility
To aid working with the library, Grape provides an integrated recommender system meant to help you either to find a method or, if a method has been renamed for any reason, find its new name.
As an example, after having loaded the STRING Homo Sapiens graph, the function for computing the connected components can be retrieved by simply typing components as follows:
from ensmallen.datasets.string import HomoSapiens
graph = HomoSapiens()
graph.components
The code above will raise the following error, and will suggest methods with a similar or related name:
AttributeError Traceback (most recent call last)
<ipython-input-3-52fac30ac7f6> in <module>()
----> 2 graph.components
AttributeError: The method 'components' does not exists, did you mean one of the following?
* 'remove_components'
* 'connected_components'
* 'strongly_connected_components'
* 'get_connected_components_number'
* 'get_total_edge_weights'
* 'get_mininum_edge_weight'
* 'get_maximum_edge_weight'
* 'get_unchecked_maximum_node_degree'
* 'get_unchecked_minimum_node_degree'
* 'get_weighted_maximum_node_degree'
In our example the method we need for computing the graph components would be connected_components.
Now the easiest way to get the method documentation is to use Python's help as follows:
help(graph.connected_components)
And the above will return you:
connected_components(verbose) method of builtins.Graph instance
Compute the connected components building in parallel a spanning tree using [bader's algorithm](https://www.sciencedirect.com/science/article/abs/pii/S0743731505000882).
**This works only for undirected graphs.**
The returned quadruple contains:
- Vector of the connected component for each node.
- Number of connected components.
- Minimum connected component size.
- Maximum connected component size.
Parameters
----------
verbose: Optional[bool]
Whether to show a loading bar or not.
Raises
-------
ValueError
If the given graph is directed.
ValueError
If the system configuration does not allow for the creation of the thread pool.
You can try to run the code described above on COLAB.
Cite GraPE
Please cite the following paper if it was useful for your research:
@misc{cappelletti2021grape,
title={GraPE: fast and scalable Graph Processing and Embedding},
author={Luca Cappelletti and Tommaso Fontana and Elena Casiraghi and Vida Ravanmehr and Tiffany J. Callahan and Marcin P. Joachimiak and Christopher J. Mungall and Peter N. Robinson and Justin Reese and Giorgio Valentini},
year={2021},
eprint={2110.06196},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
If you believe that any example may be of help, do feel free to open a GitHub issue describing what we are missing in this tutorial.