[ICML 2020] DrRepair: Learning to Repair Programs from Error Messages

Michihiro Yasunaga

Last update: Jan 8, 2023

Related tags

Overview

DrRepair: Learning to Repair Programs from Error Messages

This repo provides the source code & data of our paper: Graph-based, Self-Supervised Program Repair from Diagnostic Feedback (ICML 2020).

@InProceedings{Yasunaga20DrRepair,
  author =  {Michihiro Yasunaga and Percy Liang},
  title =   {Graph-based, Self-Supervised Program Repair from Diagnostic Feedback},
  year =    {2020},  
  booktitle =   {International Conference on Machine Learning (ICML)},  
}

Dependencies

GCC: Follow the SPoC requirement (https://github.com/Sumith1896/spoc)
Python 3.6.8 (e.g. conda create -n DrRepair python=3.6.8)
Python libraries
- torch==1.0.1, numpy, tqdm, regex, joblib, pyyaml, bottle, cheroot, tensorboardX
- clang==8.0.1 (do the following)
```
conda config --add channels conda-forge
conda install python-clang==8.0.1
```

Data

Download all the raw data -- DeepFix, SPoC, codeforce (for pretraining) -- by

./download_raw_data.sh

You can preprocess the raw data to get the program repair data by running the commands in

data/1.run-gen-err-dataset--orig-spoc.sh
data/2.run-gen-err-dataset--auto-corrupt--spoc.sh
data/3.run-gen-err-dataset--auto-corrupt--deepfix.sh

However, this takes a significant time, so for your convenience, you can download all the preprocessed data by

./download_preprocessed_data.sh

The repo structure looks like the following:

.
└─ raw_data/
   ├── codeforce_data/                  (raw programs from codeforce)
   ├── deepfix_data/                    (raw programs from deepfix)
   └── spoc_data/
       ├── spoc                              (SPoC data release)
       └── translation_preds                 (line-level code predictions from Kulal+19)

└─ data/                             
   ├── *.sh, *.py                       (preprocessing scripts)
   ├── err-data-compiler--orig-spoc/    (preprocessed, program repair data for spoc)
   ├── err-dev-compiler--for-SPoC/      (└─ dev data for spoc)
   ├── err-vocab-compiler--for-SPoC/    (└─ vocab for spoc)
   ...
   ... [similarly for deepfix and pre-training]

└─ utils/                      (utilities for code processing)

└─ model/                      (DrRepair model)

└─ evaluation/                 (to evaluate Repair model on deepfix/spoc test)
   ├── deepfix
   └── spoc
       ├── translation_preds_test/           (line-level code predictions from Kulal+19 for TestP/TestW)
       ...

Train models

Let's train program repair models. First, go to model directory. Then, run commands listed in run_deepfix.sh or run_spoc.sh. For example, if we train DrRepair ("base + graph" in the paper) on the DeepFix data, run:

name="code-compiler--2l-graph"
mkdir -p out_deepfix/${name}
python3 -u main_deepfix.py -o ${name} train \
    configs/base.yml  configs/data-deepfix/err-data-orig.yml \
    configs/model-code-compiler/2l-graph--dec-attn-all.yml

Evaluate models

We run the trained program repair model as a server. We then call this model on application tasks (DeepFix and SPoC) to evaluate the usefulness of the model.

DeepFix

1. Start server

First, go to model directory. We run a trained model (e.g. code-compiler--2l-graph) as a server by

name="SERVER--code-compiler--2l-graph"
mkdir out_deepfix/${name}
python3 -u main_deepfix.py -o ${name} server -p <port> \
    -l out_deepfix/code-compiler--2l-graph/<checkpoint> \
    configs/base.yml  configs/data-deepfix/err-data-orig.yml \
    configs/model-code-compiler/2l-graph--dec-attn-all.yml

For <port>, pick a port number (e.g. 8080) for the server. For <checkpoint>, pick a checkpoint (e.g. 150000) of the trained model. Then run ifconfig to get the IP address (e.g. 172.24.67.161) of the machine hosting this model. Concrete examples are provided in the second half of model/run_deepfix.sh.

2. Run model on DeepFix test

Go to evaluation/deepfix directory. First prepare:

repo_root="../../../.."
program_data_root=${repo_root}"/raw_data/deepfix_data"
test_split_root=${repo_root}"/data/err-data-compiler--auto-corrupt--orig-deepfix/bin4"

To run the trained model on the DeepFix test examples, do

name="code-compiler--2l-graph"
mkdir -p out/${name}/log
cd out/${name}

for entry in ${test_split_root}/*
do
  probid=`basename $entry`
  python3 -u ../../test_deepfix.py \
  --input-code-dir ${program_data_root}/${probid}/erroneous \
  --repairer-server  http://<IP>:<port>/pred
done

where you plug the IP address and port number into <IP> and <port>. After this completes, you can get the test accuracy by

python3 -u ../../collate_deepfix.py

Concrete examples are provided in evaluation/run_test_deepfix.sh.

SPoC

1. Start server

First, go to model directory. We run a trained model (e.g. code-compiler--2l-graph--finetune) as a server by

name="SERVER--code-compiler--2l-graph--finetune"
mkdir out_spoc/${name}
python3 -u main_spoc.py -o ${name} server -p <port> \
    -l out_spoc/code-compiler--2l-graph--finetune/<checkpoint> \
    configs/base.yml  configs/data-spoc/err-data-orig.yml \
    configs/model-code-compiler/2l-graph--dec-attn-all.yml

Similar to DeepFix, pick a port number and a checkpoint, and get the IP address. Concrete examples are provided in the second half of model/run_spoc.sh.

2. Run model on SPoC test

Go to evaluation/spoc directory. First prepare:

repo_root="../../../.."

To run the trained model on all the programs in SPoC TestW, do

name="code-compiler--2l-graph--finetune"

INPUT=translation_preds_test/testw    #change to testp if you want to evaluate on testp
N=$(tail -n+2 ${INPUT}.tsv | cut -f 3-6 | uniq | wc -l)  # Count the number of programs
interval=10

mkdir -p out_testw/${name}/log        #change to testp if you want to evaluate on testp
cd out_testw/${name}                  #change to testp if you want to evaluate on testp

i=1
while [[ $i -le $N ]]; do
  python -u ../../test_spoc.py -p 100 \
  --compile-budget 100 --n-parallel ${interval} \
  --repairer-server  http://<IP>:<port>/pred \
  ../../${INPUT} $i
  i=$(($i + ${interval}))
done

where you plug the IP address and port number into <IP> and <port>. After this completes, you can get the test accuracy by

python3 -u ../../collate_spoc.py

Concrete examples are provided in evaluation/run_test_spoc.sh.

Acknowledgment

The original DeepFix and SPoC data used in this work come from the following papers:

DeepFix: Fixing common C language errors by deep learning. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. AAAI 2017.
SPoC: Search-based Pseudocode to Code. Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken and Percy Liang. NeurIPS 2019.

Comments

about dataset

Hi, Dear Author @michiyasunaga, I successfully downloaded the datasets and tried to understand JSON's internal contents. If understand correctly, are the codes in X['lines'] are fixed codes?

I also found that each program (in a single json file) has a lot groups of buggy information by ["err_line", "err_msg", "mod_line", "mod_code"]. If my understanding is correct, when using the workable program in X['lines'], we can independently generate multiple buggy codes by each group of ["err_line", "err_msg", "mod_line", "mod_code"], is that a correct usage?

opened by HarrialX 1
What is the word "Vanilla" or "Substitute"?

Hello, Can you tell me what is the "Vanilla" or "Substitute"? I can see "SubstituteErrData" and "VanillaErrData".Why is the name chosen like this? What is their difference？Thanks.

opened by yayahgh 0

clang error

./2.run-gen-err-dataset--auto-corrupt--spoc.sh 

mkdir: cannot create directory ‘err-data-compiler--auto-corrupt--additional-codeforce--spoc-style’: File exists
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):

.local/lib/python3.8/site-packages/clang/cindex.py", line 4178, in get_cindex_library
    raise LibclangError(msg)
clang.cindex.LibclangError: libclang-11.so: cannot open shared object file: No such file or directory. To provide $
 path to libclang use Config.set_library_path() or Config.set_library_file().
"""

This fixed it for me

pip install clang
pip install libclang

opened by rajvijay68 0

Unable to evaluate model (on macOS)

DrRepair/evaluation/deepfix/out/code-compiler--2l-graph [master]
% for entry in ${test_split_root}/*
do
  probid=`basename $entry`
  python3 -u ../../test_deepfix.py \
  --input-code-dir ${program_data_root}/${probid}/erroneous \
  --repairer-server  http://0.0.0.0:8002/pred
done

Traceback (most recent call last):
  File "../../test_deepfix.py", line 331, in <module>
    main()
  File "../../test_deepfix.py", line 325, in main
    stitch()
  File "../../test_deepfix.py", line 290, in stitch
    stitch_helper(prog_fname)
  File "../../test_deepfix.py", line 231, in stitch_helper
    _code_str_tokenized = ' '.join(tokenize_code(_code, mod_brace=False))
DrRepair/utils/code_process.py", line 44, in tokenize_code
    clang.cindex.Config.set_library_path('/usr/local/Cellar/llvm/12.0.0/lib/')
  File "/Library/Python/3.8/lib/python/site-packages/clang/cindex.py", line 4107, in set_library_path
    raise Exception("library path must be set before before using " \
Exception: library path must be set before before using any other functionalities in libclang.

opened by rajvijay68 1

about test

Dear Dr. Yasunaga Hello ,Dear Author @michiyasunaga , I am reproducing your paper(Graph-based, Self-Supervised Program Repair from Diagnostic Feedback). I follow the step of your github.When i evaluate the model(code-compiler--2l-graph), I found the amount of passed and failed is 1,260.It is different from the paper(6,971).Is this path(/data/err-data-compiler--auto-corrupt--orig-deepfix/bin4) the real path of test set? How can i evaluate the Deepfix raw test?

I would be very grateful indeed for any help you could give me.

Best wishes

opened by yayahgh 0
MacOS - TypeError: type torch.cuda.FloatTensor not available. Torch not compiled with CUDA enabled.
Since Macbooks do not have NVIDIA GPUs, CUDA won't work on them.

Is there a workaround to this?

This occurs while running

name="code-compiler--2l-graph" mkdir -p out_deepfix/${name} python3 -u main_deepfix.py -o ${name} train \ configs/base.yml configs/data-deepfix/err-data-orig.yml \ configs/model-code-compiler/2l-graph--dec-attn-all.yml
opened by rajvijay68 1
list index out of bounds error

tok=ex.src_vocab[idx - len(self.vocab)] in file DrRepair/model/repairer/data/err_dataset.py line 181 throws list index out of bounds error. How should i go about resolving this issue? I'm using the preprocessed models.

opened by akash-isu 1

Owner

Michihiro Yasunaga

PhD Student in Computer Science

GitHub

This repo is a C++ version of yolov5_deepsort_tensorrt. Packing all C++ programs into .so files, using Python script to call C++ programs further.

yolov5_deepsort_tensorrt_cpp Introduction This repo is a C++ version of yolov5_deepsort_tensorrt. And packing all C++ programs into .so files, using P

41 Dec 27, 2022

[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

PG-MORL This repository contains the implementation for the paper Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Contro

65 Jan 7, 2023

The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

DS3L This is the code for paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020. Setups The code is implem

36 Oct 19, 2022

Decentralized Reinforcment Learning: Global Decision-Making via Local Economic Transactions (ICML 2020)

Decentralized Reinforcement Learning This is the code complementing the paper Decentralized Reinforcment Learning: Global Decision-Making via Local Ec

40 Oct 30, 2022

PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020).

Scaffold-Federated-Learning PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020). Environment numpy=

30 Dec 29, 2022

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

73 Nov 6, 2022

Code for Transformer Hawkes Process, ICML 2020.

Transformer Hawkes Process Source code for Transformer Hawkes Process (ICML 2020). Run the code Dependencies Python 3.7. Anaconda contains all the req

111 Dec 26, 2022

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

57 Dec 15, 2022

[ICML 2020] DrRepair: Learning to Repair Programs from Error Messages

Related tags

Overview

DrRepair: Learning to Repair Programs from Error Messages

Dependencies

Data

Train models

Evaluate models

DeepFix

1. Start server

2. Run model on DeepFix test

SPoC

1. Start server

2. Run model on SPoC test

Acknowledgment

Comments

about dataset

What is the word "Vanilla" or "Substitute"?

clang error

Unable to evaluate model (on macOS)

about test

MacOS - TypeError: type torch.cuda.FloatTensor not available. Torch not compiled with CUDA enabled.

list index out of bounds error

Owner

Michihiro Yasunaga

This repo is a C++ version of yolov5_deepsort_tensorrt. Packing all C++ programs into .so files, using Python script to call C++ programs further.

[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

Decentralized Reinforcment Learning: Global Decision-Making via Local Economic Transactions (ICML 2020)

PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020).

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

Code for Transformer Hawkes Process, ICML 2020.

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021)

Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

[ICML 2021] “ Self-Damaging Contrastive Learning”, Ziyu Jiang, Tianlong Chen, Bobak Mortazavi, Zhangyang Wang

[ICML 2021] "Graph Contrastive Learning Automated" by Yuning You, Tianlong Chen, Yang Shen, Zhangyang Wang

Implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021).

Implementation of Self-supervised Graph-level Representation Learning with Local and Global Structure (ICML 2021).

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)

This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

UDP++ (ECCVW 2020 Oral), (Winner of COCO 2020 Keypoint Challenge).