Extracts data from the database for a graph-node and stores it in parquet files

Cardstack

Last update: Jan 10, 2022

Related tags

Deep Learning subgraph-extractor

Overview

subgraph-extractor

Extracts data from the database for a graph-node and stores it in parquet files

Installation

For developing, it's recommended to use conda to create an environment.

Create one with python 3.9

conda create --name subgraph-extractor python=3.9

Now activate it

conda activate subgraph-extractor

Install the dev packages (note there is no space after the .)

pip install -e .[dev]

Use

Now you can use the main entrypoint, see help for more details

subgraph_extractor --help

Creating a config files

The easiest way to start is to use the interactive subgraph config generator.

Start by launching the subgraph config generator with the location you want to write the config file to.

subgraph_config_generator --config-location subgraph_config.yaml

It will default to using a local graph-node with default username & password (postgresql://graph-node:let-me-in@localhost:5432/graph-node) If you are connecting to something else you need to specify the database connection string with --database-string.

You will then be asked to select:

The relevant subgraph
From the subgraph, which tables to extract (multi-select)
For each table, which column to partition on (this is typically the block number or timestamp)
Any numeric columns that require mapping to another type * see note below

Numeric column mappings

Uint256 is a common data type in contracts but rare in most data processing tools. The graph node creates a Postgres Numeric column for any field marked as a BigInt as it is capable of accurately storing uint256s (a common data type in solidity).

However, many downstream tools cannot handle these as numbers.

By default, these columns will be exported as bytes - a lossless representation but one that is not as usable for sums, averages, etc. This is fine for some data, such as addresses or where the field is used to pack data (e.g. the tokenIds for decentraland).

For other use cases, the data must be converted to another type. In the config file, you can specify numeric columns that need to be mapped to another type:

column_mappings:
  my_original_column_name:
    my_new_column_name:
      type: uint64

However, if the conversion does not work (e.g. the number is too large), the extraction will stop with an error. This is fine for cases where you know the range (e.g. timestamp or block number). For other cases you can specify a maximum value, default and a column to store whether the row was at most the maximum value:

column_mappings:
  my_original_column_name:
    my_new_column_name:
      type: uint64
      max_value: 18446744073709551615
      default: 0
      validity_column: new_new_column_name_valid

If the number is over 18446744073709551615, there will be a 0 stored in the column my_new_column_name and FALSE stored in new_new_column_name_valid.

If your numbers are too large but can be safely lowered for your usecase (e.g. converting from wei to gwei) you can provide a downscale value:

column_mappings:
  transfer_fee_wei:
    transfer_fee_gwei:
      downscale: 1000000000
      type: uint64
      max_value: 18446744073709551615
      default: 0
      validity_column: transfer_fee_gwei_valid

This will perform an integer division (divide and floor) the original value. WARNING this is a lossy conversion.

You may have as many mappings for a single column as you want, and the original will always be present as bytes.

The following numeric types are allowed:

int8, int16, int32, int64
uint8, uint16, uint32, uint64
float32, float64
Numeric38 (this is a numeric/Decimal column with 38 digits of precision)

Contributing

Please format everything with black and isort

black . && isort --profile=black .

You might also like...

Node for thenewboston digital currency network.

Project setup For project setup see INSTALL.rst Community Join the community to stay updated on the most recent developments, project roadmaps, and ra

27 Jul 8, 2022

A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

201 Nov 9, 2022

An example showing how to use jax to train resnet50 on multi-node multi-GPU

jax-multi-gpu-resnet50-example This repo shows how to use jax for multi-node multi-GPU training. The example is adapted from the resnet50 example in d

20 Jul 4, 2022

Pytorch Implementation of Adversarial Deep Network Embedding for Cross-Network Node Classification

Pytorch Implementation of Adversarial Deep Network Embedding for Cross-Network Node Classification (ACDNE) This is a pytorch implementation of the Adv

8 Oct 13, 2022

Converts geometry node attributes to built-in attributes

Attribute Converter Simplifies converting attributes created by geometry nodes to built-in attributes like UVs or vertex colors, as a single click ope

12 Dec 22, 2022

Simple node deletion tool for onnx.

snd4onnx Simple node deletion tool for onnx. I only test very miscellaneous and limited patterns as a hobby. There are probably a large number of bugs

6 May 15, 2022

List of all dependencies affected by node-ipc malicious commit

node-ipc-dependencies-list List of all dependencies affected by node-ipc malicious commit as of 17/3/2022 - 19/3/2022 (timestamp) Please improve upon

99 Oct 15, 2022

We have implemented shaDow-GNN as a general and powerful pipeline for graph representation learning. For more details, please find our paper titled Deep Graph Neural Networks with Shallow Subgraph Samplers, available on arXiv (https//arxiv.org/abs/2012.01380).

Deep GNN, Shallow Sampling Hanqing Zeng, Muhan Zhang, Yinglong Xia, Ajitesh Srivastava, Andrey Malevich, Rajgopal Kannan, Viktor Prasanna, Long Jin, R

117 Dec 20, 2022

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

519 Jan 2, 2023

Comments

Parquet metadata

Builds on #2

The key change here is that there is also an extra file written out.

Parquet files all have a metadata block in them at the end. This allows planners to identify whether or not the file needs to be used, as it contains things like min/max values for columns. It also contains sub-file level detail if there are enough rows.

However, to make use of this a system needs to process all of the files in storage to read the metadata block. Not an issue when all the files are local but a more significant one when they're on S3 as it'd need to crawl everything. In addition we have a non-standard partitioning scheme with multiple levels.

We can collect all of the metadata from each file and put it into one single metadata file that points off to the actual file locations where the data is stored. This should entirely remove the need to generate lists of required files in the other packages (the most common repeated code) and remove the need to pull the files locally before running queries. This is a standard feature in pyarrow, and should be usable outside of our code and outside of just python.

https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

When these are available, we can delete most of https://github.com/cardstack/cardstack/blob/main/packages/cardpay-reward-programs/cardpay_reward_programs/utils.py#L11-L78 and replace it with a single call to pyarrow dataset construction, which we can register in duckdb and remove the parquet_scan manual construction.

opened by IanCal 4
Fix partition break
The current logic in the package when writing out new partitions is this:

Generate a list of all partitions that should exist

Iterate backwards, to find the latest one that has been written

Write all newer ones

Example:

Existing partitions * blocks 1-10 * blocks 10-20 Run again, partitions to exist are 1-10, 10-20, 20-30, 30-40 30-40 does not exist 20-30 does not exist 10-20 does exist - STOP Write 30-40 Write 20-30 End

There is an issue here that in that final step it writes from newest to oldest. If the process fails during this, this happens:

Existing partitions * blocks 1-10 * blocks 10-20 Run again, partitions to exist are 1-10, 10-20, 20-30, 30-40 30-40 does not exist 20-30 does not exist 10-20 does exist - STOP Write 30-40 Write 20-30 - FAILS Run again, partitions to exist are 1-10, 10-20, 20-30, 30-40 30-40 does exist - STOP End

The partition for 20-30 is never written, causing downstream breakages.

We could write the partitions out in the opposite order, however there is a better approach. The partitions from the last successful run can be calculated based on what's in the latest.yaml file. We can then just look at the difference between the last successful run and the required partitions and write out the difference. This also saves having to check to see if lots of files exist.

This change adds some tests for running the thing end to end, and a more complicated test to check for the above scenario (which correctly fails on the old code and passes on this).
opened by IanCal 1
Fix block column in empty partitions, force uint23.

The generated _block_number column is inconsistently set in the output files, depending on whether there is any data or not.

This breaks anything that tries to read across multiple files when one of them is empty (possibly more cases).

We force uint32 instead.

opened by IanCal 0

Owner

Cardstack

Experience Web 3.0.

GitHub

Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

snscrape-jsonl-urls-extractor extracts urls from jsonl produced by snscrape Usag

1 Feb 26, 2022

This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your username and app/website.

PasswordGeneratorAndVault This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your us

1 Feb 26, 2022

A PyTorch Implementation of "Watch Your Step: Learning Node Embeddings via Graph Attention" (NeurIPS 2018).

Attention Walk ⠀⠀ A PyTorch Implementation of Watch Your Step: Learning Node Embeddings via Graph Attention (NIPS 2018). Abstract Graph embedding meth

303 Dec 9, 2022

G-NIA model from "Single Node Injection Attack against Graph Neural Networks" (CIKM 2021)

Single Node Injection Attack against Graph Neural Networks This repository is our Pytorch implementation of our paper: Single Node Injection Attack ag

18 Nov 21, 2022

Self-supervised learning on Graph Representation Learning (node-level task)

graph_SSL Self-supervised learning on Graph Representation Learning (node-level task) How to run the code To run GRACE, sh run_GRACE.sh To run GCA, sh

3 Dec 31, 2021

Node-level Graph Regression with Deep Gaussian Process Models

Node-level Graph Regression with Deep Gaussian Process Models Prerequests our implementation is mainly based on tensorflow 1.x and gpflow 1.x: python

1 Jan 16, 2022

This script scrapes and stores the availability of timeslots for Car Driving Test at all RTA Serivce NSW centres in the state.

This script scrapes and stores the availability of timeslots for Car Driving Test at all RTA Serivce NSW centres in the state. Dependencies Account wi

21 Dec 14, 2022

Extracts data from the database for a graph-node and stores it in parquet files

Related tags

Overview

subgraph-extractor

Installation

Use

Creating a config files

Numeric column mappings

Contributing

You might also like...

Node for thenewboston digital currency network.

A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

An example showing how to use jax to train resnet50 on multi-node multi-GPU

Pytorch Implementation of Adversarial Deep Network Embedding for Cross-Network Node Classification

Converts geometry node attributes to built-in attributes

Simple node deletion tool for onnx.

List of all dependencies affected by node-ipc malicious commit

We have implemented shaDow-GNN as a general and powerful pipeline for graph representation learning. For more details, please find our paper titled Deep Graph Neural Networks with Shallow Subgraph Samplers, available on arXiv (https//arxiv.org/abs/2012.01380).

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

Comments

Parquet metadata

Fix partition break

Fix block column in empty partitions, force uint23.

Owner

Cardstack

Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your username and app/website.

A PyTorch Implementation of "Watch Your Step: Learning Node Embeddings via Graph Attention" (NeurIPS 2018).

G-NIA model from "Single Node Injection Attack against Graph Neural Networks" (CIKM 2021)

Self-supervised learning on Graph Representation Learning (node-level task)

Node-level Graph Regression with Deep Gaussian Process Models

This script scrapes and stores the availability of timeslots for Car Driving Test at all RTA Serivce NSW centres in the state.

Blender Python - Node-based multi-line text and image flowchart

Text completion with Hugging Face and TensorFlow.js running on Node.js

A library for building and serving multi-node distributed faiss indices.