Measuring Coding Challenge Competence With APPS

Dan Hendrycks

Last update: Dec 27, 2022

Related tags

Deep Learning apps

Overview

Measuring Coding Challenge Competence With APPS

This is the repository for Measuring Coding Challenge Competence With APPS by Dan Hendrycks*, Steven Basart*, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt.

Download the APPS dataset here.

This repository contains evaluation code.

For other benchmarks of enormous Transformers, see a dataset which tests ability in competition MATH, a dataset which tests knowledge of ETHICS, and a dataset spanning 50+ academic subjects.

Citation

If you find this useful in your research, please consider citing

@article{hendrycksapps2021,
  title={Measuring Coding Challenge Competence With APPS},
  author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt},
  journal={arXiv preprint arXiv:2105.09938},
  year={2021}
}

Comments

evaluation on multiple solutions at once causes memory leak

Hi @xksteven , I have a question about why you advise to run the evaluation code for one solution at a time instead of doing it for all generations at once? I have added the metric to the HuggingFace hub https://huggingface.co/spaces/codeparrot/apps_metric (I didn’t change the core script testing_util.py) with evaluation done for all solutions at once and I sometimes get a memory leak for which I can’t identify the source because when I do the evaluation on the same solutions separately this doesn’t happen.

Below is the code that causes memory saturation:

from evaluate import load

generations = [["s = input()\nn = len(s)\nm = 0\n\nfor i in range(n):\n\tc = s[i]\n\tif c == '|':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\telif c == '\\n':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\nif m < 2:\n\tprint(-1)\nelse:\n\tprint(m * 2 - 1)\n"], ["\nx = int(input())\n\nl = list(range(x+1))\n\nm = next(l)\n\ns = sum(list([int(i) for i in str(m)]))\n\nif s > sum(list([int(i) for i in str(m)])) :\n\tm = next(l)\n\t\nprint(m)\n"]]

metric = load("codeparrot/apps_metric")

results = metric.compute(predictions=generations, level="all", debug=False)

While this works fine:

generation_1 = generations[:1]
generation_2 = generations[1:2]
results_1 = metric.compute(predictions=generation_1, level="all", debug=False)
results_2 = metric.compute(predictions=generation_2, level="all", debug=False)
print(results_1)
print(results_2)

{'avg_accuracy': 0.23185840707964603, 'strict_accuracy': 0.0, 'pass_at_k': None}
{'avg_accuracy': 0.0, 'strict_accuracy': 0.0, 'pass_at_k': None}

opened by loubnabnl 14

Computation of the accuracy scores when there are compilation and runtime errors
Hi thank you for this great dataset! I have some questions about how you compute the accuracy scores in this https://github.com/hendrycks/apps/blob/c55cce35806c14423b41decf7241615261cf9de0/eval/test_one_solution.py#L22-L42 I was curious why you use -2 and -1 for compilation and runtime errors and include them in the average computation of the accuracy which could lead to a negative score. It seems more natural to give a False label to a code with syntax/runtime error similarily to a code that just doesn’t pass the unit tests.

Also the expression all_correct.append(np.all(results[index])) will consider -2 and -1 as True since np.all evaluates non zero numbers to True, which could give a false accuracy.

Below is an example:

print_results({0: [[-2]], 1: [[-2]], 2: [[-2]], 3: [[-2]]}, args)

number of compile errors = 1 avg = 0.25 number of runtime errors = 1 avg = 0.25 number of test cases run = 4 Test Case Average (average accuracy over problems) = -2.0 Strict Accuracy (all test cases passed / total problems) = 1.0

Another thing regarding the expressions:

compile_errors = len(tmp_results[tmp_results==-2]) runtiome_errors = len(tmp_results[tmp_results==-1])

if I'm not mistaken this doesn't work (at least on Python 3.9), another implementation could be

compile_errors = len([e for e in tmp_results if -2 in e]) runtiome_errors = len([e for e in tmp_results if -1 in e])
opened by loubnabnl 7
Nan test case average

Hello, I am trying to evaluate my model's generated codes using scripts in eval. However, for a particular problem, results[index] turns out to be an empty array as a result of which calculating mean in print_results() gives nan. How should I handle this case?

opened by sindhura97 5
Missing apps-train-files json file?

Hi,

Thank you for releasing this amazing codebase! I found that the appsdata need to take apps-train-files json file as an input but I couldn't find anything in the provided apps dataset. I wonder if I am missing somewhere.

Thanks!

opened by ywen666 5
Problems With APPS

Hi. When using the dataset to evaluate the fine-tuned model, I found that the test set has some problems without solution.json. Could you provide the complete set?

opened by MT010104 4
Running instructions

Reference: https://github.com/hendrycks/apps#how-to-use

Files train/README and eval/README are not present in the repository. Would really appreciate if the instructions are added for training and evaluation of the models.

opened by PulkitMadan 4
Request for scripts of fine-tuning

Hi, thanks for the amazing work! I really appreciate that you released the dataset, but now I wanna apply it to other models downloaded from Hugging Face. I wonder if I can get the scripts of fine-tuning?

opened by MT010104 3
Categorization of Problem Difficulty

Hi, thanks for providing the benchmark dataset. From the paper, problems are categorized into Introductory, Interview, Competition, etc. Could you provide the problem ids corresponding to the difficulty level? It would help a lot.

opened by eelxpeng 3
Fix memory leak and add reliabilty guard

Hi, this fixes the memory leak issue mentioned #13 and adds HumanEval's reliabilty guard to limit the harm that can be caused by running untrusted code.

I also fixed the issue raised here #14 where a NaN score is returned, it's because when this exception is raised https://github.com/hendrycks/apps/blob/1b052764e10804ae79cf12c24801aaa818ea36ab/eval/testing_util.py#L229 the results list returned is empty so the mean computation over it returns NaN, we can consider this as a compilation error and fill the results list with -2 value:https://github.com/hendrycks/apps/blob/1b052764e10804ae79cf12c24801aaa818ea36ab/eval/test_one_solution.py#L39

Please note that I've only ran the tests with this new timeout setup on Hugging Face APPS metric and it didn't seem to affect the performance. I currently have some software issues in reading the standard data folder so It would be great if you can test more the new changes with this repo, especially since I changed what's returned in case there is a global timeout, I think it should be [-1] * number_test_cases as in here https://huggingface.co/spaces/codeparrot/apps_metric/blob/main/utils.py#L29 but I used 21 the average number of test cases for the test set as it's not straightforward to access the number of tests in this setting.

opened by loubnabnl 2

DeepSpeed config and TrainingArguments mismatch

Hi, I'm trying to run finetuning to replicate the results in the paper but am getting an error from a mismatch in hyperparameters between deepspeed_config.json and what's specified in tune_apps_gpt.py (e.g. an LR of 1e-4 in deepspeed_config.json, but 5e-5 in tune_apps_gpt.py).

Could you give any guidance on which to use?

The error I'm getting is:

Please correct the following DeepSpeed config values that mismatch TrainingArguments values:
- ds train_batch_size=8 vs hf train_batch_size (calculated)=128
- ds optimizer.params.lr=0.0001 vs hf learning_rate=5e-05
- ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=5e-05
- ds scheduler.params.warmup_num_steps=500 vs hf warmup_steps=0
The easiest method is to set these DeepSpeed config values to 'auto'.

and the command is

USE_TF=NO deepspeed tune_apps_gpt.py  \
  --save-dir=${save_dir}  \
  --arch=EleutherAI/gpt-neo-2.7B \
  --apps-train-files ../data/train \
  --apps-dataroot ../data/train/ \
  --grad-acc-steps=8 \
  --epochs=10 \
  --fp16 \
  --deepspeed deepspeed_config.json \
  --batch-size-per-replica=2 \
  | tee ${save_dir}/log.out

Thanks!

opened by dpfried 2

Request for pretrained models

Hey there! Congrats and thanks for the amazing work! The APPS dataset would benefit the community greatly. I really appreciate that you released the GPT2-1.5B finetuned model, but just curious would it possible to release the pretrained GPT2-1.5B model as well?

Thank you in advance and happy new year!

opened by changranelk 2
Steps About Generated Code Solutions Post-processing

Hi，thanks for the amazing work! I wank to ask about the detailed steps about generated code solutions post-processing when testing one solution. (e.g. After a code solution was generated, did you truncate it by stop tokens?(e.g. : “\nclass”, “\ndef”, “\n#”)) Thanks for your reply!

opened by MrBlack0220 0
answer_type calculation is different for train/val and eval
Not necessarily an issue, but I noticed that for train/val, the answer_type is based on whether starter_code exists but that at eval time, it's based on fn_name. Is there a reason for this difference?

train: https://github.com/hendrycks/apps/blob/main/train/dataset_apps/APPSBaseDataset.py#L67-L70

eval: https://github.com/hendrycks/apps/blob/main/eval/generate_gpt_codes.py#L67-L72
opened by minimario 1

Unable to run pre-trained (1.5B) model on test set

I'm trying to run the pre-trained 1.5B model linked in the README on the APPS test set. I downloaded the dataset and ran the script train/apps_create_split.py on it, then ran the model with

python generate_gpt_codes.py -t ~/Code/APPS/test.json --load ~/Code/APPS/models/1.5B --save ~/Code/APPS/output/15B

Note that I didn't do training beforehand, the directory models/1.5B is as it was when I downloaded it - I assume this is fine since the README says the models are fine-tuned.

When I look at the contents of all_codes.json, at first it looks okay, but pretty soon all I see are empty entries like this:

... "9": "", "10": "", "11": "", "12": "", "13": "", "14": "" ...

I see several messages in the script output that seem like potential errors:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Input length of input_ids is 1052, but `max_length` is set to 1023. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.

../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [207,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

Many of those errors are printed over and over, and then the end of the logs are just this message thousands of times:

Unexpected exception in generating solution
Batch dimension of `input_ids` should be 5, but is 4.

opened by geajack 1

Too Long Problems

There are some long problems in APPS, so I truncated them after encoding. But the output of the model is in the form of "problem + answer", so output is definitely longer than input. Max_length(1024-input_ids) is set for the output. Actually, if output's length needs to meet the requirement, input's length is much less than 1024. Otherwise we won't get a complete answer even if not reporting an error. Is it right? Also, why is max_length of output is set to "1024-inputs"?

opened by MT010104 8

Owner

Dan Hendrycks

PhD student at UC Berkeley.

GitHub

Illuminated3D This project participates in the Nasa Space Apps Challenge 2021.

1 Oct 9, 2021

This code reproduces the results of the paper, "Measuring Data Leakage in Machine-Learning Models with Fisher Information"

Fisher Information Loss This repository contains code that can be used to reproduce the experimental results presented in the paper: Awni Hannun, Chua

43 Dec 30, 2022

The coda and data for "Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach" (ACL '21)

We propose a hierarchical core-fringe learning framework to measure fine-grained domain relevance of terms – the degree that a term is relevant to a broad (e.g., computer science) or narrow (e.g., deep learning) domain.

14 Oct 21, 2022

Measuring and Improving Consistency in Pretrained Language Models

ParaRel ?? This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

26 Dec 2, 2022

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

48 Dec 30, 2022

An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding, top-down-bottom-up, and attention (consensus between columns)

GLOM - Pytorch (wip) An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding,

173 Dec 14, 2022

This repo is to be freely used by ML devs to check the GAN performances without coding from scratch.

GANs for Fun Created because I can! GOAL The goal of this repo is to be freely used by ML devs to check the GAN performances without coding from scrat

13 Jan 26, 2022

pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

16 Nov 4, 2020

Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana

DeepGeneAnnotator: A tool to annotate the gene in the genome The master thesis of the "Using deep learning to predict gene structures of the coding ge

3 Sep 9, 2022

Christmas face app for Decathlon xmas coding party!

Christmas Face Application Use this library to create the perfect picture for your christmas cards! Done by Hasib Zunair, Guillaume Brassard and Samue

4 Dec 20, 2021

Doge-Prediction - Coding Club prediction ig

Doge-Prediction Coding Club prediction ig Basically: Create an application that

1 Jan 10, 2022

HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps.

HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps. 中文介绍 Features Non-intrusive. Your iOS project does not need to be modi

47 Oct 22, 2022

Avatarify Python - Avatars for Zoom, Skype and other video-conferencing apps.

15.3k Jan 5, 2023

Source code and data from the RecSys 2020 article "Carousel Personalization in Music Streaming Apps with Contextual Bandits" by W. Bendada, G. Salha and T. Bontempelli

Carousel Personalization in Music Streaming Apps with Contextual Bandits - RecSys 2020 This repository provides Python code and data to reproduce expe

48 Jan 2, 2023

Measuring Coding Challenge Competence With APPS

Related tags

Overview

Measuring Coding Challenge Competence With APPS

Citation

Comments

Owner

Dan Hendrycks

Illuminated3D This project participates in the Nasa Space Apps Challenge 2021.

This code reproduces the results of the paper, "Measuring Data Leakage in Machine-Learning Models with Fisher Information"

The coda and data for "Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach" (ACL '21)

Measuring and Improving Consistency in Pretrained Language Models

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding, top-down-bottom-up, and attention (consensus between columns)

This repo is to be freely used by ML devs to check the GAN performances without coding from scratch.

pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana

Christmas face app for Decathlon xmas coding party!

Doge-Prediction - Coding Club prediction ig

HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps.

Avatarify Python - Avatars for Zoom, Skype and other video-conferencing apps.

Source code and data from the RecSys 2020 article "Carousel Personalization in Music Streaming Apps with Contextual Bandits" by W. Bendada, G. Salha and T. Bontempelli

Create Data & AI apps in 20 lines of code with Shimoku

Colour detection is necessary to recognize objects, it is also used as a tool in various image editing and drawing apps.

A machine learning malware analysis framework for Android apps.

The comma.ai Calibration Challenge!

CVPR 2021 Challenge on Super-Resolution Space