PLUR is a collection of source code datasets suitable for graph-based machine learning.

Google Research

Last update: Nov 25, 2022

Related tags

Overview

PLUR

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.

Installation

SRC_DIR=${PWD}/src
mkdir -p ${SRC_DIR} && cd ${SRC_DIR}
# For Cubert.
git clone https://github.com/google-research/google-research --depth=1
export PYTHONPATH=${PYTHONPATH}:${SRC_DIR}/google-research
git clone https://github.com/google-research/plur && cd plur
python -m pip install -r requirements.txt
python setup.py install

Test execution on small dataset

cd plur
python3 plur_data_generation.py --dataset_name=manysstubs4j_dataset \
  --stage_1_dir=/tmp/manysstubs4j_dataset/stage_1 \
  --stage_2_dir=/tmp/manysstubs4j_dataset/stage_2 \
  --train_data_percentage=40 \
  --validation_data_percentage=30 \
  --test_data_percentage=30

Usage

Basic usage

Data generation (step 1)

Data generation is done by calling plur.plur_data_generation.create_dataset(). The data generation runs in two stages:

Convert raw data to plur.utils.GraphToOutputExample.
Convert plur.utils.GraphToOutputExample to TFExample.

Stage 1 is unique for each dataset, but stage 2 is the same for almost all datasets.

from plur.plur_data_generation import create_dataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

plur_data_generation.py also provides a command line interface, but it offers less flexibility.

python3 plur_data_generation.py --stage_1_dir=/tmp/code2seq_dataset/stage_1 --stage_2_dir=/tmp/code2seq_dataset/stage_2

Data loader (step 2)

After the data is generated, you can use PlurDataLoader to load the data. The data loader loads TFExamples but returns them as numpy arrays.

from plur.plur_data_loader import PlurDataLoader
from plur.util import constants

dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
split = constants.TRAIN_SPLIT_NAME
batch_size = 32
repeat_count = -1
drop_remainder = True
train_data_generator = PlurDataLoader(dataset_stage_2_directory, split, batch_size, repeat_count, drop_remainder)

for batch_data in train_data_generator:
  # your training loop...

Training (step 3)

This is the part where you use your own model to train on the PLUR data.

The models and the training code from the PLUR paper are not yet part of the current release. We plan to release it in the near future.

Evaluating (step 4)

Once the training is finished, you can generate the predictions on the test data and use plur_evaluator.py to evaluate the performance. plur_evaluator.py works in offline mode, meaning that it expects a file containing the ground truths, and a file containing the predictions.

python3 plur_evaluator.py --dataset_name=code2seq_dataset --target_file=/tmp/code2seq_dataset/targets.txt --prediction_file=/tmp/code2seq_dataset/predictions.txt

Transforming and filtering data

If there is something fundamental you want to change in the dataset, you should apply them in stage 1 of data generation, otherwise apply them in stage 2. The idea is that stage 1 should only be run once per dataset (to create the plur.utils.GraphToOutputExample), and stage 2 should be run each time you want to train on different data (to create the TFRecords).

All transformation and filtering functions are applied on plur.utils.GraphToOutputExample, see plur.utils.GraphToOutputExample for more information.

E.g. a transformation that can be run in stage 1 is that your model expects that graphs in the dataset have no loop, and you write your transformation function to remove loops. This will ensure that stage 2 will read data where the graph has no loops.

E.g. of filters that can be run in stage 2 is that you want to check your model performance on different graph sizes in terms of number of nodes. You write your own filter function to filter graphs with a large number of nodes.

from plur.plur_data_generation import create_dataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
def _filter_graph_size(graph_to_output_example, graph_size=1024):
  return len(graph_to_output_example.get_nodes()) <= graph_size
stage_2_kwargs = dict(
    train_filter_funcs=(_filter_graph_size,),
    validation_filter_funcs=(_filter_graph_size,)
)
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

Advanced usage

plur.plur_data_generation.create_dataset() is just a thin wrapper around plur.stage_1.plur_dataset and plur.stage_2.graph_to_output_example_to_tfexample.

from plur.plur_data_generation import create_dataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

is equivalent to

from plur.stage_1.code2seq_dataset import Code2seqDataset
from plur.stage_2.graph_to_output_example_to_tfexample import GraphToOutputExampleToTfexample

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'
dataset_stage_2_directory = '/tmp/code2seq_dataset/stage_2'
dataset = Code2seqDataset(dataset_stage_1_directory)
dataest.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

dataset = GraphToOutputExampleToTfexample(dataset_stage_1_directory, dataset_stage_2_directory, dataset_name)
dataset.stage_2_mkdirs()
dataset.run_pipeline()

You can check out plur.stage_1.code2seq_dataset for arguments relevant for code2seq dataset. For example code2seq dataset provides java-small, java-med and java-large datasets. Therefore you can create a java-large dataset in this way.

from plur.stage_1.code2seq_dataset import Code2seqDataset

dataset_name = 'code2seq_dataset'
dataset_stage_1_directory = '/tmp/code2seq_dataset/stage_1'

dataset = Code2seqDataset(dataset_stage_1_directory, dataset_size='large')
dataest.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

Adding a new dataset

All datasets should inherit plur.stage_1.plur_dataset.PlurDataset, and placed under plur/stage_1/, which requires you to implement:

download_dataset(): Code to download the dataset, we provide download_dataset_using_git() to download from git and download_dataset_using_requests() to download from a URL, which also works with a Google Drive URL. In download_dataset_using_git() we download the dataset from a specific commit id. In download_dataset_using_requests() we check the sha1sum for the downloaded files. This is to ensure that the same version of PLUR downloads the same raw data.
get_all_raw_data_paths(): It should return a list of paths, where each path is a file containing the raw data in the datasets.
raw_data_paths_to_raw_data_do_fn(): It should return a beam.DoFn class that overrides process(). The process() should tell beam how to open the files returned by get_all_raw_data_paths(). It is also here we define if the data belongs to any split (train/validation/test).
raw_data_to_graph_to_output_example(): This function transforms raw data from raw_data_paths_to_raw_data_do_fn() to GraphToOutputExample.

Then add/change the following lines in plur/plur_data_generation.py:

from plur.stage_1.foo_dataset import FooDataset

flags.DEFINE_enum('dataset_name', 'dummy_dataset',
                  ['code2seq_dataset', 'dummy_dataset',
                   'funcom_dataset', 'great_var_misuse_dataset',
                   'hoppity_single_ast_diff_dataset',
                   'manysstubs4j_dataset', 'foo_dataset'],
                  'Name of the dataset to generate data.')


def get_dataset_class(dataset_name):
  """Get the dataset class based on dataset_name."""
  if dataset_name == 'code2seq_dataset':
    return Code2SeqDataset
  elif dataset_name == 'dummy_dataset':
    return DummyDataset
  elif dataset_name == 'funcom_dataset':
    return FuncomDataset
  elif dataset_name == 'great_var_misuse_dataset':
    return GreatVarMisuseDataset
  elif dataset_name == 'hoppity_single_ast_diff_dataset':
    return HoppitySingleAstDiffDataset
  elif dataset_name == 'manysstubs4j_dataset':
    return ManySStuBs4JDataset
  elif dataset_name == 'foo_dataset':
    return FooDataset
  else:
    raise ValueError('{} is not supported.'.format(dataset_name))

Evaluation details

The details of how evaluation is performed are in plur/eval/README.md.

License

Licensed under the Apache 2.0 License.

Disclaimer

This is not an officially supported Google product.

Citation

Please cite the PLUR paper, Chen et al. https://proceedings.neurips.cc//paper/2021/hash/c2937f3a1b3a177d2408574da0245a19-Abstract.html

Comments

UnicodeDecodeError Query

During following the installation process I faced errors showing...

' UnicodDecodeError: 'charmap [while running 'Read all raw data']' codec can't decode byte 0x8d in position 2625450: character maps to

I am attaching the screenshot for convenience. I am not sure if this is a problem in my device. I am running the code in a Windows OS.

Is it a problem regarding the OS? Do I have to use a Linux based OS to run PLUR (or is it recommended?).

Thank You .

opened by akibzaman 2
Correct PLUR Evaluation description

Corrects the example command to include the _pattern suffix for both prediction and target files, and adds a brief description of how to use wildcards in the file patterns.

opened by VHellendoorn 0
Fix package versions to be compatible

Installing without freezing package versions leads to compatibility issues between (the latest versions of) the required packages. The version numbers proposed here were arrived at through trial and error, by iteratively or raising lowering version numbers when required, and may not represent the most up-to-date compatible versions for all packages. The proposed configuration successfully installs PLUR following the steps in the README in a clean Python:3.9 Docker container.

opened by VHellendoorn 0

FailedPreconditionError error during evaluation

At the time of running evaluation for the hoppity dataset, we are encountering the following error:

Traceback (most recent call last):
  File "train.py", line 340, in <module>
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "train.py", line 281, in main
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 132, in evaluate
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 245, in generate_predictions
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 160, in evaluate_chunk
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/evaluation.py", line 329, in _evaluate_chunk
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/plur_data_loader.py", line 459, in __next__
  File "/scratch/st-amesbah-1/plur-experiment/src/plur/plur/model_design/data_generation.py", line 33, in __call__
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4635, in __next__
    return nest.map_structure(to_numpy, next(self._iterator))
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 766, in __next__
    return self._next_internal()
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 749, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3017, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/arc/project/st-amesbah-1/conda-envs/plur/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7209, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.FailedPreconditionError: {{function_node __wrapped__IteratorGetNext_output_types_10_device_/job:localhost/replica:0/task:0/device:CPU:0}} /arc/project/st-amesbah-1/plur-data/stage_2/tfrecords/test/hoppity_single_ast_diff_dataset-00761-of-01000.tfrecord; Bad file descriptor [Op:IteratorGetNext]

Could you please assist us in debugging this error @smoitra-g?

opened by nashid 1

Hoppity dataset generation

Hi,

I'm trying to run the data generation script using the cooked graphs from my dataset. However, i see in your code that you use something called hoppity_cg.tar.gz (https://github.com/google-research/plur/blob/main/plur/stage_1/hoppity_single_ast_diff_dataset.py#L61) to get some json files. What is this used for? This was not available in the hoppity repo - is this some pre-processing that you have done on your end?

opened by msintaha 1
Query about evaluation script
Readme mentions to run plur_evaluator.py.

python3 plur_evaluator.py --dataset_name=manysstubs4j_dataset --target_file_pattern=/tmp/manysstubs4j_dataset/targets.txt --prediction_file_pattern=/tmp/manysstubs4j_dataset/predictions.txt

Here the ground truth is targets.txt (`-target_file_pattern=/tmp/manysstubs4j_dataset/targets.txt).

I have two queries:

What's the format of this file?

Should not be this targets.txt file created after running the Data loader script (plur_data_generation.py)?

But after running plur_data_generation.py, I don't find the targets.txt in the /tmp/manysstubs4j_dataset/ folder.

@smoitra87 @VHellendoorn @dan-zheng can you please help me with this query?
opened by smith-co 3
Output Token Generation Capability | Program Repairing

Hello. I am working on a program repairing project using your platform. In the paper, it is written as follows:

The TOCOPO output can be intuitively viewed as a script that describes the task output in terms of tokens, drawn from the output vocabulary, and pointers pointing to some input node, concluding with a DONE token marking the end of the output. Every task can make its own use of these facilities to express a grammar for its output. For example, a classification task can use token outputs, one per expected class; a sequence-prediction task can produce a sequence of tokens; a repair task can use a pointer to point at a particular input node, and a token output to replace that input node.

My query is regarding the highlighted line. Can the output handle the occurrence of multiple errors in an input node i.e. can it produce multiple tokens to handle the errors? or is it limited to a single token per node only? Can you highlight this point?

opened by akibzaman 3
Preserving the Characteristics of A Custom Graph

PLUR framework can accept a custom graph and produce the output task accordingly. Also, the framework is capable of incorporating relational information that can be used to represent syntax, data flow and control flow.

I had the following query regarding this point while exploring the PLUR framework:

If a graph with various relational edges are given as an input, does PLUR preserve the characteristics of the edges while feeding into the model or it creates a common outlook for all types of graph? In Summary, Does PLUR preserve the characteristics of a custom graph?

opened by akibzaman 3

PLUR is a collection of source code datasets suitable for graph-based machine learning.

Related tags

Overview

PLUR

Installation

Usage

Basic usage

Data generation (step 1)

Data loader (step 2)

Training (step 3)

Evaluating (step 4)

Transforming and filtering data

Advanced usage

Adding a new dataset

Evaluation details

License

Disclaimer

Citation

Comments

If a graph with various relational edges are given as an input, does PLUR preserve the characteristics of the edges while feeding into the model or it creates a common outlook for all types of graph? In Summary, Does PLUR preserve the characteristics of a custom graph?

Owner

Google Research

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

A Collection of Conference & School Notes in Machine Learning 🦄📝🎉

Pandas Machine Learning and Quant Finance Library Collection

A collection of interactive machine-learning experiments: 🏋️models training + 🎨models demo

A collection of neat and practical data science and machine learning projects

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

Pytools is an open source library containing general machine learning and visualisation utilities for reuse

Data Version Control or DVC is an open-source tool for data science and machine learning projects

SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine Learning work with thousands of other users.

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

Machine learning template for projects based on sklearn library.

Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning