PLUR is a collection of source code datasets suitable for graph-based machine learning.

Overview

PLUR

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.

Installation

SRC_DIR=${PWD}/src
mkdir -p ${SRC_DIR} && cd ${SRC_DIR}
# For Cubert.
git clone https://github.com/google-research/google-research --depth=1
export PYTHONPATH=${PYTHONPATH}:${SRC_DIR}/google-research
git clone https://github.com/google-research/plur && cd plur
python -m pip install -r requirements.txt
python setup.py install

Test execution on small dataset

cd plur
python3 plur_data_generation.py --dataset_name=manysstubs4j_dataset \
  --stage_1_dir=/tmp/manysstubs4j_dataset/stage_1 \
  --stage_2_dir=/tmp/manysstubs4j_dataset/stage_2 \
  --train_data_percentage=40 \
  --validation_data_percentage=30 \
  --test_data_percentage=30

Usage

Basic usage

Data generation (step 1)

Data generation is done by calling plur.plur_data_generation.create_dataset(). The data generation runs in two stages:

  1. Convert raw data to plur.utils.GraphToOutputExample.
  2. Convert plur.utils.GraphToOutputExample to TFExample.

Stage 1 is unique for each dataset, but stage 2 is the same for almost all datasets.

from plur.plur_data_generation import create_dataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

plur_data_generation.py also provides a command line interface, but it offers less flexibility.

python3 plur_data_generation.py --stage_1_dir=/tmp/code2seq_dataset/stage_1 --stage_2_dir=/tmp/code2seq_dataset/stage_2

Data loader (step 2)

After the data is generated, you can use PlurDataLoader to load the data. The data loader loads TFExamples but returns them as numpy arrays.

from plur.plur_data_loader import PlurDataLoader
from plur.util import constants

dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
split = constants.TRAIN_SPLIT_NAME
batch_size = 32
repeat_count = -1
drop_remainder = True
train_data_generator = PlurDataLoader(dataset_stage_2_directory, split, batch_size, repeat_count, drop_remainder)

for batch_data in train_data_generator:
  # your training loop...

Training (step 3)

This is the part where you use your own model to train on the PLUR data.

The models and the training code from the PLUR paper are not yet part of the current release. We plan to release it in the near future.

Evaluating (step 4)

Once the training is finished, you can generate the predictions on the test data and use plur_evaluator.py to evaluate the performance. plur_evaluator.py works in offline mode, meaning that it expects a file containing the ground truths, and a file containing the predictions.

python3 plur_evaluator.py --dataset_name=code2seq_dataset --target_file=/tmp/code2seq_dataset/targets.txt --prediction_file=/tmp/code2seq_dataset/predictions.txt

Transforming and filtering data

If there is something fundamental you want to change in the dataset, you should apply them in stage 1 of data generation, otherwise apply them in stage 2. The idea is that stage 1 should only be run once per dataset (to create the plur.utils.GraphToOutputExample), and stage 2 should be run each time you want to train on different data (to create the TFRecords).

All transformation and filtering functions are applied on plur.utils.GraphToOutputExample, see plur.utils.GraphToOutputExample for more information.

E.g. a transformation that can be run in stage 1 is that your model expects that graphs in the dataset have no loop, and you write your transformation function to remove loops. This will ensure that stage 2 will read data where the graph has no loops.

E.g. of filters that can be run in stage 2 is that you want to check your model performance on different graph sizes in terms of number of nodes. You write your own filter function to filter graphs with a large number of nodes.

from plur.plur_data_generation import create_dataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
def _filter_graph_size(graph_to_output_example, graph_size=1024):
  return len(graph_to_output_example.get_nodes()) <= graph_size
stage_2_kwargs = dict(
    train_filter_funcs=(_filter_graph_size,),
    validation_filter_funcs=(_filter_graph_size,)
)
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

Advanced usage

plur.plur_data_generation.create_dataset() is just a thin wrapper around plur.stage_1.plur_dataset and plur.stage_2.graph_to_output_example_to_tfexample.

from plur.plur_data_generation import create_dataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

is equivalent to

from plur.stage_1.code2seq_dataset import Code2seqDataset
from plur.stage_2.graph_to_output_example_to_tfexample import GraphToOutputExampleToTfexample

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
dataset = Code2seqDataset(dataset_stage_1_directory)
dataest.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

dataset = GraphToOutputExampleToTfexample(dataset_stage_1_directory, dataset_stage_2_directory, dataset_name)
dataset.stage_2_mkdirs()
dataset.run_pipeline()

You can check out plur.stage_1.code2seq_dataset for arguments relevant for code2seq dataset. For example code2seq dataset provides java-small, java-med and java-large datasets. Therefore you can create a java-large dataset in this way.

from plur.stage_1.code2seq_dataset import Code2seqDataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'

dataset = Code2seqDataset(dataset_stage_1_directory, dataset_size='large')
dataest.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

Adding a new dataset

All datasets should inherit plur.stage_1.plur_dataset.PlurDataset, and placed under plur/stage_1/, which requires you to implement:

  • download_dataset(): Code to download the dataset, we provide download_dataset_using_git() to download from git and download_dataset_using_requests() to download from a URL, which also works with a Google Drive URL. In download_dataset_using_git() we download the dataset from a specific commit id. In download_dataset_using_requests() we check the sha1sum for the downloaded files. This is to ensure that the same version of PLUR downloads the same raw data.
  • get_all_raw_data_paths(): It should return a list of paths, where each path is a file containing the raw data in the datasets.
  • raw_data_paths_to_raw_data_do_fn(): It should return a beam.DoFn class that overrides process(). The process() should tell beam how to open the files returned by get_all_raw_data_paths(). It is also here we define if the data belongs to any split (train/validation/test).
  • raw_data_to_graph_to_output_example(): This function transforms raw data from raw_data_paths_to_raw_data_do_fn() to GraphToOutputExample.

Then add/change the following lines in plur/plur_data_generation.py:

from plur.stage_1.foo_dataset import FooDataset

flags.DEFINE_enum('dataset_name', 'dummy_dataset',
                  ['code2seq_dataset', 'dummy_dataset',
                   'funcom_dataset', 'great_var_misuse_dataset',
                   'hoppity_single_ast_diff_dataset',
                   'manysstubs4j_dataset', 'foo_dataset'],
                  'Name of the dataset to generate data.')


def get_dataset_class(dataset_name):
  """Get the dataset class based on dataset_name."""
  if dataset_name == 'code2seq_dataset':
    return Code2SeqDataset
  elif dataset_name == 'dummy_dataset':
    return DummyDataset
  elif dataset_name == 'funcom_dataset':
    return FuncomDataset
  elif dataset_name == 'great_var_misuse_dataset':
    return GreatVarMisuseDataset
  elif dataset_name == 'hoppity_single_ast_diff_dataset':
    return HoppitySingleAstDiffDataset
  elif dataset_name == 'manysstubs4j_dataset':
    return ManySStuBs4JDataset
  elif dataset_name == 'foo_dataset':
    return FooDataset
  else:
    raise ValueError('{} is not supported.'.format(dataset_name))

Evaluation details

The details of how evaluation is performed are in plur/eval/README.md.

License

Licensed under the Apache 2.0 License.

Disclaimer

This is not an officially supported Google product.

Citation

Please cite the PLUR paper, Chen et al. https://proceedings.neurips.cc//paper/2021/hash/c2937f3a1b3a177d2408574da0245a19-Abstract.html

Comments
  • UnicodeDecodeError Query

    UnicodeDecodeError Query

    During following the installation process I faced errors showing...

    ' UnicodDecodeError: 'charmap [while running 'Read all raw data']' codec can't decode byte 0x8d in position 2625450: character maps to

    I am attaching the screenshot for convenience. I am not sure if this is a problem in my device. I am running the code in a Windows OS.

    Is it a problem regarding the OS? Do I have to use a Linux based OS to run PLUR (or is it recommended?).

    Thank You 01.

    opened by akibzaman 2
  • Correct PLUR Evaluation description

    Correct PLUR Evaluation description

    Corrects the example command to include the _pattern suffix for both prediction and target files, and adds a brief description of how to use wildcards in the file patterns.

    opened by VHellendoorn 0
  • Fix package versions to be compatible

    Fix package versions to be compatible

    Installing without freezing package versions leads to compatibility issues between (the latest versions of) the required packages. The version numbers proposed here were arrived at through trial and error, by iteratively or raising lowering version numbers when required, and may not represent the most up-to-date compatible versions for all packages. The proposed configuration successfully installs PLUR following the steps in the README in a clean Python:3.9 Docker container.

    opened by VHellendoorn 0
  • FailedPreconditionError error during evaluation

    FailedPreconditionError error during evaluation

    At the time of running evaluation for the hoppity dataset, we are encountering the following error:

    Traceback (most recent call last):
      File "train.py", line 340, in <module>
      File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/absl/app.py", line 308, in run
        _run_main(main, args)
      File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
        sys.exit(main(argv))
      File "train.py", line 281, in main
      File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 132, in evaluate
      File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 245, in generate_predictions
      File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 160, in evaluate_chunk
      File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 329, in _evaluate_chunk
      File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/plur_data_loader.py", line 459, in __next__
      File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/data_generation.py", line 33, in __call__
      File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4635, in __next__
        return nest.map_structure(to_numpy, next(self._iterator))
      File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 766, in __next__
        return self._next_internal()
      File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 749, in _next_internal
        ret = gen_dataset_ops.iterator_get_next(
      File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3017, in iterator_get_next
        _ops.raise_from_not_ok_status(e, name)
      File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7209, in raise_from_not_ok_status
        raise core._status_to_exception(e) from None  # pylint: disable=protected-access
    tensorflow.python.framework.errors_impl.FailedPreconditionError: {{function_node __wrapped__IteratorGetNext_output_types_10_device_/job:localhost/replica:0/task:0/device:CPU:0}} /arc/project/st-amesbah-1/plur-data/stage_2/tfrecords/test/hoppity_single_ast_diff_dataset-00761-of-01000.tfrecord; Bad file descriptor [Op:IteratorGetNext]
    

    Could you please assist us in debugging this error @smoitra-g?

    opened by nashid 1
  • Hoppity dataset generation

    Hoppity dataset generation

    Hi,

    I'm trying to run the data generation script using the cooked graphs from my dataset. However, i see in your code that you use something called hoppity_cg.tar.gz (https://github.com/google-research/plur/blob/main/plur/stage_1/hoppity_single_ast_diff_dataset.py#L61) to get some json files. What is this used for? This was not available in the hoppity repo - is this some pre-processing that you have done on your end?

    opened by msintaha 1
  • Query about evaluation script

    Query about evaluation script

    Readme mentions to run plur_evaluator.py.

    python3 plur_evaluator.py
       --dataset_name=manysstubs4j_dataset 
       --target_file_pattern=/tmp/manysstubs4j_dataset/targets.txt 
       --prediction_file_pattern=/tmp/manysstubs4j_dataset/predictions.txt
    

    Here the ground truth is targets.txt (`-target_file_pattern=/tmp/manysstubs4j_dataset/targets.txt).

    I have two queries:

    • What's the format of this file?
    • Should not be this targets.txt file created after running the Data loader script (plur_data_generation.py)?

    But after running plur_data_generation.py, I don't find the targets.txt in the /tmp/manysstubs4j_dataset/ folder.

    @smoitra87 @VHellendoorn @dan-zheng can you please help me with this query?

    opened by smith-co 3
  • Output Token Generation Capability | Program Repairing

    Output Token Generation Capability | Program Repairing

    Hello. I am working on a program repairing project using your platform. In the paper, it is written as follows:

    The TOCOPO output can be intuitively viewed as a script that describes the task output in terms of tokens, drawn from the output vocabulary, and pointers pointing to some input node, concluding with a DONE token marking the end of the output. Every task can make its own use of these facilities to express a grammar for its output. For example, a classification task can use token outputs, one per expected class; a sequence-prediction task can produce a sequence of tokens; a repair task can use a pointer to point at a particular input node, and a token output to replace that input node.

    My query is regarding the highlighted line. Can the output handle the occurrence of multiple errors in an input node i.e. can it produce multiple tokens to handle the errors? or is it limited to a single token per node only? Can you highlight this point?

    opened by akibzaman 3
  • Preserving the Characteristics of A Custom Graph

    Preserving the Characteristics of A Custom Graph

    PLUR framework can accept a custom graph and produce the output task accordingly. Also, the framework is capable of incorporating relational information that can be used to represent syntax, data flow and control flow.

    I had the following query regarding this point while exploring the PLUR framework:

    If a graph with various relational edges are given as an input, does PLUR preserve the characteristics of the edges while feeding into the model or it creates a common outlook for all types of graph? In Summary, Does PLUR preserve the characteristics of a custom graph?

    opened by akibzaman 3
Owner
Google Research
Google Research
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

null 6.2k Jan 1, 2023
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Robustness Gym 115 Dec 12, 2022
A Collection of Conference & School Notes in Machine Learning 🦄📝🎉

Machine Learning Conference & Summer School Notes. ??????

null 558 Dec 28, 2022
Pandas Machine Learning and Quant Finance Library Collection

Pandas Machine Learning and Quant Finance Library Collection

null 148 Dec 7, 2022
A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

?? Interactive Machine Learning experiments: ??️models training + ??models demo

Oleksii Trekhleb 1.4k Jan 6, 2023
A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

Will Fong 2 Dec 10, 2021
Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

FINRA 25 Dec 28, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
Pytools is an open source library containing general machine learning and visualisation utilities for reuse

pytools is an open source library containing general machine learning and visualisation utilities for reuse, including: Basic tools for API developmen

BCG Gamma 26 Nov 6, 2022
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Azaria Gebremichael 2 Jul 29, 2021
SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

SageMaker Python SDK SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker. With the S

Amazon Web Services 1.8k Jan 1, 2023
MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine Learning work with thousands of other users.

The collaboration platform for Machine Learning MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine

MLReef 1.4k Dec 27, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

A powerful and flexible machine learning platform for drug discovery

MilaGraph 1.1k Jan 8, 2023
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning

Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning

Felix Daudi 1 Jan 6, 2022