Extracts data from the database for a graph-node and stores it in parquet files

Overview

subgraph-extractor

Extracts data from the database for a graph-node and stores it in parquet files

Installation

For developing, it's recommended to use conda to create an environment.

Create one with python 3.9

conda create --name subgraph-extractor python=3.9

Now activate it

conda activate subgraph-extractor

Install the dev packages (note there is no space after the .)

pip install -e .[dev]

Use

Now you can use the main entrypoint, see help for more details

subgraph_extractor --help

Creating a config files

The easiest way to start is to use the interactive subgraph config generator.

Start by launching the subgraph config generator with the location you want to write the config file to.

subgraph_config_generator --config-location subgraph_config.yaml

It will default to using a local graph-node with default username & password (postgresql://graph-node:let-me-in@localhost:5432/graph-node) If you are connecting to something else you need to specify the database connection string with --database-string.

You will then be asked to select:

  • The relevant subgraph
  • From the subgraph, which tables to extract (multi-select)
  • For each table, which column to partition on (this is typically the block number or timestamp)
  • Any numeric columns that require mapping to another type * see note below

Numeric column mappings

Uint256 is a common data type in contracts but rare in most data processing tools. The graph node creates a Postgres Numeric column for any field marked as a BigInt as it is capable of accurately storing uint256s (a common data type in solidity).

However, many downstream tools cannot handle these as numbers.

By default, these columns will be exported as bytes - a lossless representation but one that is not as usable for sums, averages, etc. This is fine for some data, such as addresses or where the field is used to pack data (e.g. the tokenIds for decentraland).

For other use cases, the data must be converted to another type. In the config file, you can specify numeric columns that need to be mapped to another type:

column_mappings:
  my_original_column_name:
    my_new_column_name:
      type: uint64

However, if the conversion does not work (e.g. the number is too large), the extraction will stop with an error. This is fine for cases where you know the range (e.g. timestamp or block number). For other cases you can specify a maximum value, default and a column to store whether the row was at most the maximum value:

column_mappings:
  my_original_column_name:
    my_new_column_name:
      type: uint64
      max_value: 18446744073709551615
      default: 0
      validity_column: new_new_column_name_valid

If the number is over 18446744073709551615, there will be a 0 stored in the column my_new_column_name and FALSE stored in new_new_column_name_valid.

If your numbers are too large but can be safely lowered for your usecase (e.g. converting from wei to gwei) you can provide a downscale value:

column_mappings:
  transfer_fee_wei:
    transfer_fee_gwei:
      downscale: 1000000000
      type: uint64
      max_value: 18446744073709551615
      default: 0
      validity_column: transfer_fee_gwei_valid

This will perform an integer division (divide and floor) the original value. WARNING this is a lossy conversion.

You may have as many mappings for a single column as you want, and the original will always be present as bytes.

The following numeric types are allowed:

  • int8, int16, int32, int64
  • uint8, uint16, uint32, uint64
  • float32, float64
  • Numeric38 (this is a numeric/Decimal column with 38 digits of precision)

Contributing

Please format everything with black and isort

black . && isort --profile=black .
You might also like...
Node for thenewboston digital currency network.
Node for thenewboston digital currency network.

Project setup For project setup see INSTALL.rst Community Join the community to stay updated on the most recent developments, project roadmaps, and ra

A Pytorch implementation of
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

An example showing how to use jax to train resnet50 on multi-node multi-GPU

jax-multi-gpu-resnet50-example This repo shows how to use jax for multi-node multi-GPU training. The example is adapted from the resnet50 example in d

Pytorch Implementation of Adversarial Deep Network Embedding for Cross-Network Node Classification

Pytorch Implementation of Adversarial Deep Network Embedding for Cross-Network Node Classification (ACDNE) This is a pytorch implementation of the Adv

Converts geometry node attributes to built-in attributes

Attribute Converter Simplifies converting attributes created by geometry nodes to built-in attributes like UVs or vertex colors, as a single click ope

Simple node deletion tool for onnx.
Simple node deletion tool for onnx.

snd4onnx Simple node deletion tool for onnx. I only test very miscellaneous and limited patterns as a hobby. There are probably a large number of bugs

List of all dependencies affected by node-ipc malicious commit
List of all dependencies affected by node-ipc malicious commit

node-ipc-dependencies-list List of all dependencies affected by node-ipc malicious commit as of 17/3/2022 - 19/3/2022 (timestamp) Please improve upon

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

Comments
  • Parquet metadata

    Parquet metadata

    Builds on #2

    The key change here is that there is also an extra file written out.

    Parquet files all have a metadata block in them at the end. This allows planners to identify whether or not the file needs to be used, as it contains things like min/max values for columns. It also contains sub-file level detail if there are enough rows.

    However, to make use of this a system needs to process all of the files in storage to read the metadata block. Not an issue when all the files are local but a more significant one when they're on S3 as it'd need to crawl everything. In addition we have a non-standard partitioning scheme with multiple levels.

    We can collect all of the metadata from each file and put it into one single metadata file that points off to the actual file locations where the data is stored. This should entirely remove the need to generate lists of required files in the other packages (the most common repeated code) and remove the need to pull the files locally before running queries. This is a standard feature in pyarrow, and should be usable outside of our code and outside of just python.

    https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

    When these are available, we can delete most of https://github.com/cardstack/cardstack/blob/main/packages/cardpay-reward-programs/cardpay_reward_programs/utils.py#L11-L78 and replace it with a single call to pyarrow dataset construction, which we can register in duckdb and remove the parquet_scan manual construction.

    opened by IanCal 4
  • Fix partition break

    Fix partition break

    The current logic in the package when writing out new partitions is this:

    • Generate a list of all partitions that should exist
    • Iterate backwards, to find the latest one that has been written
    • Write all newer ones

    Example:

    Existing partitions
    * blocks 1-10
    * blocks 10-20
    
    Run again, partitions to exist are 1-10, 10-20, 20-30, 30-40
    
    30-40 does not exist
    20-30 does not exist
    10-20 does exist - STOP
    
    Write 30-40
    Write 20-30
    
    End
    

    There is an issue here that in that final step it writes from newest to oldest. If the process fails during this, this happens:

    Existing partitions
    * blocks 1-10
    * blocks 10-20
    
    Run again, partitions to exist are 1-10, 10-20, 20-30, 30-40
    
    30-40 does not exist
    20-30 does not exist
    10-20 does exist - STOP
    
    Write 30-40
    Write 20-30 - FAILS
    
    
    Run again, partitions to exist are 1-10, 10-20, 20-30, 30-40
    
    30-40 does exist - STOP
    
    End
    

    The partition for 20-30 is never written, causing downstream breakages.

    We could write the partitions out in the opposite order, however there is a better approach. The partitions from the last successful run can be calculated based on what's in the latest.yaml file. We can then just look at the difference between the last successful run and the required partitions and write out the difference. This also saves having to check to see if lots of files exist.

    This change adds some tests for running the thing end to end, and a more complicated test to check for the above scenario (which correctly fails on the old code and passes on this).

    opened by IanCal 1
  • Fix block column in empty partitions, force uint23.

    Fix block column in empty partitions, force uint23.

    The generated _block_number column is inconsistently set in the output files, depending on whether there is any data or not.

    This breaks anything that tries to read across multiple files when one of them is empty (possibly more cases).

    We force uint32 instead.

    opened by IanCal 0
Owner
Cardstack
Experience Web 3.0.
Cardstack
Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

snscrape-jsonl-urls-extractor extracts urls from jsonl produced by snscrape Usag

null 1 Feb 26, 2022
This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your username and app/website.

PasswordGeneratorAndVault This program generates a random 12 digit/character password (upper and lowercase) and stores it in a file along with your us

Chris 1 Feb 26, 2022
A PyTorch Implementation of "Watch Your Step: Learning Node Embeddings via Graph Attention" (NeurIPS 2018).

Attention Walk ⠀⠀ A PyTorch Implementation of Watch Your Step: Learning Node Embeddings via Graph Attention (NIPS 2018). Abstract Graph embedding meth

Benedek Rozemberczki 303 Dec 9, 2022
G-NIA model from "Single Node Injection Attack against Graph Neural Networks" (CIKM 2021)

Single Node Injection Attack against Graph Neural Networks This repository is our Pytorch implementation of our paper: Single Node Injection Attack ag

Shuchang Tao 18 Nov 21, 2022
Self-supervised learning on Graph Representation Learning (node-level task)

graph_SSL Self-supervised learning on Graph Representation Learning (node-level task) How to run the code To run GRACE, sh run_GRACE.sh To run GCA, sh

Namkyeong Lee 3 Dec 31, 2021
Node-level Graph Regression with Deep Gaussian Process Models

Node-level Graph Regression with Deep Gaussian Process Models Prerequests our implementation is mainly based on tensorflow 1.x and gpflow 1.x: python

null 1 Jan 16, 2022
This script scrapes and stores the availability of timeslots for Car Driving Test at all RTA Serivce NSW centres in the state.

This script scrapes and stores the availability of timeslots for Car Driving Test at all RTA Serivce NSW centres in the state. Dependencies Account wi

Balamurugan Soundararaj 21 Dec 14, 2022
Blender Python - Node-based multi-line text and image flowchart

MindMapper v0.8 Node-based text and image flowchart for Blender Mindmap with shortcuts visible: Mindmap with shortcuts hidden: Notes This was requeste

SpectralVectors 58 Oct 8, 2022
Text completion with Hugging Face and TensorFlow.js running on Node.js

Katana ML Text Completion ?? Description Runs with with Hugging Face DistilBERT and TensorFlow.js on Node.js distilbert-model - converter from Hugging

Katana ML 2 Nov 4, 2022
A library for building and serving multi-node distributed faiss indices.

About Distributed faiss index service. A lightweight library that lets you work with FAISS indexes which don't fit into a single server memory. It fol

Meta Research 170 Dec 30, 2022