Measuring Coding Challenge Competence With APPS

Related tags

Deep Learning apps
Overview

Measuring Coding Challenge Competence With APPS

This is the repository for Measuring Coding Challenge Competence With APPS by Dan Hendrycks*, Steven Basart*, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt.

Download the APPS dataset here.

This repository contains evaluation code.

For other benchmarks of enormous Transformers, see a dataset which tests ability in competition MATH, a dataset which tests knowledge of ETHICS, and a dataset spanning 50+ academic subjects.

Citation

If you find this useful in your research, please consider citing

@article{hendrycksapps2021,
  title={Measuring Coding Challenge Competence With APPS},
  author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt},
  journal={arXiv preprint arXiv:2105.09938},
  year={2021}
}
Comments
  • evaluation on multiple solutions at once causes memory leak

    evaluation on multiple solutions at once causes memory leak

    Hi @xksteven , I have a question about why you advise to run the evaluation code for one solution at a time instead of doing it for all generations at once? I have added the metric to the HuggingFace hub https://huggingface.co/spaces/codeparrot/apps_metric (I didn’t change the core script testing_util.py) with evaluation done for all solutions at once and I sometimes get a memory leak for which I can’t identify the source because when I do the evaluation on the same solutions separately this doesn’t happen.

    Below is the code that causes memory saturation:

    from evaluate import load
    
    generations = [["s = input()\nn = len(s)\nm = 0\n\nfor i in range(n):\n\tc = s[i]\n\tif c == '|':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\telif c == '\\n':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\nif m < 2:\n\tprint(-1)\nelse:\n\tprint(m * 2 - 1)\n"], ["\nx = int(input())\n\nl = list(range(x+1))\n\nm = next(l)\n\ns = sum(list([int(i) for i in str(m)]))\n\nif s > sum(list([int(i) for i in str(m)])) :\n\tm = next(l)\n\t\nprint(m)\n"]]
    
    metric = load("codeparrot/apps_metric")
    
    results = metric.compute(predictions=generations, level="all", debug=False)
    

    While this works fine:

    generation_1 = generations[:1]
    generation_2 = generations[1:2]
    results_1 = metric.compute(predictions=generation_1, level="all", debug=False)
    results_2 = metric.compute(predictions=generation_2, level="all", debug=False)
    print(results_1)
    print(results_2)
    
    {'avg_accuracy': 0.23185840707964603, 'strict_accuracy': 0.0, 'pass_at_k': None}
    {'avg_accuracy': 0.0, 'strict_accuracy': 0.0, 'pass_at_k': None}
    
    opened by loubnabnl 14
  • Computation of the accuracy scores when there are compilation and runtime errors

    Computation of the accuracy scores when there are compilation and runtime errors

    Hi thank you for this great dataset! I have some questions about how you compute the accuracy scores in this https://github.com/hendrycks/apps/blob/c55cce35806c14423b41decf7241615261cf9de0/eval/test_one_solution.py#L22-L42 I was curious why you use -2 and -1 for compilation and runtime errors and include them in the average computation of the accuracy which could lead to a negative score. It seems more natural to give a False label to a code with syntax/runtime error similarily to a code that just doesn’t pass the unit tests.

    Also the expression all_correct.append(np.all(results[index])) will consider -2 and -1 as True since np.all evaluates non zero numbers to True, which could give a false accuracy.

    Below is an example:

    print_results({0: [[-2]], 1: [[-2]], 2: [[-2]], 3: [[-2]]}, args)
    
    number of compile errors = 1 avg = 0.25
    number of runtime errors = 1 avg = 0.25
    number of test cases run = 4
    Test Case Average (average accuracy over problems) = -2.0
    Strict Accuracy (all test cases passed / total problems) = 1.0
    

    Another thing regarding the expressions:

     compile_errors = len(tmp_results[tmp_results==-2])
     runtiome_errors = len(tmp_results[tmp_results==-1])
    

    if I'm not mistaken this doesn't work (at least on Python 3.9), another implementation could be

     compile_errors = len([e for e in tmp_results if -2 in e])
     runtiome_errors = len([e for e in tmp_results if -1 in e])
    
    opened by loubnabnl 7
  • Nan test case average

    Nan test case average

    Hello, I am trying to evaluate my model's generated codes using scripts in eval. However, for a particular problem, results[index] turns out to be an empty array as a result of which calculating mean in print_results() gives nan. How should I handle this case?

    opened by sindhura97 5
  • Missing apps-train-files json file?

    Missing apps-train-files json file?

    Hi,

    Thank you for releasing this amazing codebase! I found that the appsdata need to take apps-train-files json file as an input but I couldn't find anything in the provided apps dataset. I wonder if I am missing somewhere.

    Thanks!

    opened by ywen666 5
  • Problems With APPS

    Problems With APPS

    Hi. When using the dataset to evaluate the fine-tuned model, I found that the test set has some problems without solution.json. Could you provide the complete set? image

    opened by MT010104 4
  • Running instructions

    Running instructions

    Reference: https://github.com/hendrycks/apps#how-to-use

    Files train/README and eval/README are not present in the repository. Would really appreciate if the instructions are added for training and evaluation of the models.

    opened by PulkitMadan 4
  • Request for scripts of fine-tuning

    Request for scripts of fine-tuning

    Hi, thanks for the amazing work! I really appreciate that you released the dataset, but now I wanna apply it to other models downloaded from Hugging Face. I wonder if I can get the scripts of fine-tuning?

    opened by MT010104 3
  • Categorization of Problem Difficulty

    Categorization of Problem Difficulty

    Hi, thanks for providing the benchmark dataset. From the paper, problems are categorized into Introductory, Interview, Competition, etc. Could you provide the problem ids corresponding to the difficulty level? It would help a lot.

    opened by eelxpeng 3
  • Fix memory leak and add reliabilty guard

    Fix memory leak and add reliabilty guard

    Hi, this fixes the memory leak issue mentioned #13 and adds HumanEval's reliabilty guard to limit the harm that can be caused by running untrusted code.

    I also fixed the issue raised here #14 where a NaN score is returned, it's because when this exception is raised https://github.com/hendrycks/apps/blob/1b052764e10804ae79cf12c24801aaa818ea36ab/eval/testing_util.py#L229 the results list returned is empty so the mean computation over it returns NaN, we can consider this as a compilation error and fill the results list with -2 value:https://github.com/hendrycks/apps/blob/1b052764e10804ae79cf12c24801aaa818ea36ab/eval/test_one_solution.py#L39

    Please note that I've only ran the tests with this new timeout setup on Hugging Face APPS metric and it didn't seem to affect the performance. I currently have some software issues in reading the standard data folder so It would be great if you can test more the new changes with this repo, especially since I changed what's returned in case there is a global timeout, I think it should be [-1] * number_test_cases as in here https://huggingface.co/spaces/codeparrot/apps_metric/blob/main/utils.py#L29 but I used 21 the average number of test cases for the test set as it's not straightforward to access the number of tests in this setting.

    opened by loubnabnl 2
  • DeepSpeed config and TrainingArguments mismatch

    DeepSpeed config and TrainingArguments mismatch

    Hi, I'm trying to run finetuning to replicate the results in the paper but am getting an error from a mismatch in hyperparameters between deepspeed_config.json and what's specified in tune_apps_gpt.py (e.g. an LR of 1e-4 in deepspeed_config.json, but 5e-5 in tune_apps_gpt.py).

    Could you give any guidance on which to use?

    The error I'm getting is:

    Please correct the following DeepSpeed config values that mismatch TrainingArguments values:
    - ds train_batch_size=8 vs hf train_batch_size (calculated)=128
    - ds optimizer.params.lr=0.0001 vs hf learning_rate=5e-05
    - ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=5e-05
    - ds scheduler.params.warmup_num_steps=500 vs hf warmup_steps=0
    The easiest method is to set these DeepSpeed config values to 'auto'.  
    

    and the command is

    USE_TF=NO deepspeed tune_apps_gpt.py  \
      --save-dir=${save_dir}  \
      --arch=EleutherAI/gpt-neo-2.7B \
      --apps-train-files ../data/train \
      --apps-dataroot ../data/train/ \
      --grad-acc-steps=8 \
      --epochs=10 \
      --fp16 \
      --deepspeed deepspeed_config.json \
      --batch-size-per-replica=2 \
      | tee ${save_dir}/log.out
    

    Thanks!

    opened by dpfried 2
  • Request for pretrained models

    Request for pretrained models

    Hey there! Congrats and thanks for the amazing work! The APPS dataset would benefit the community greatly. I really appreciate that you released the GPT2-1.5B finetuned model, but just curious would it possible to release the pretrained GPT2-1.5B model as well?

    Thank you in advance and happy new year!

    opened by changranelk 2
  • Steps About Generated Code Solutions Post-processing

    Steps About Generated Code Solutions Post-processing

    Hi,thanks for the amazing work! I wank to ask about the detailed steps about generated code solutions post-processing when testing one solution. (e.g. After a code solution was generated, did you truncate it by stop tokens?(e.g. : “\nclass”, “\ndef”, “\n#”)) Thanks for your reply!

    opened by MrBlack0220 0
  • answer_type calculation is different for train/val and eval

    answer_type calculation is different for train/val and eval

    Not necessarily an issue, but I noticed that for train/val, the answer_type is based on whether starter_code exists but that at eval time, it's based on fn_name. Is there a reason for this difference?

    • train: https://github.com/hendrycks/apps/blob/main/train/dataset_apps/APPSBaseDataset.py#L67-L70
    • eval: https://github.com/hendrycks/apps/blob/main/eval/generate_gpt_codes.py#L67-L72
    opened by minimario 1
  • Unable to run pre-trained (1.5B) model on test set

    Unable to run pre-trained (1.5B) model on test set

    I'm trying to run the pre-trained 1.5B model linked in the README on the APPS test set. I downloaded the dataset and ran the script train/apps_create_split.py on it, then ran the model with

    python generate_gpt_codes.py -t ~/Code/APPS/test.json --load ~/Code/APPS/models/1.5B --save ~/Code/APPS/output/15B
    

    Note that I didn't do training beforehand, the directory models/1.5B is as it was when I downloaded it - I assume this is fine since the README says the models are fine-tuned.

    When I look at the contents of all_codes.json, at first it looks okay, but pretty soon all I see are empty entries like this:

    ... "9": "", "10": "", "11": "", "12": "", "13": "", "14": "" ...
    

    I see several messages in the script output that seem like potential errors:

    The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
    Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
    
    Input length of input_ids is 1052, but `max_length` is set to 1023. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
    
    ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [207,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    
    A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
    

    Many of those errors are printed over and over, and then the end of the logs are just this message thousands of times:

    Unexpected exception in generating solution
    Batch dimension of `input_ids` should be 5, but is 4.
    
    opened by geajack 1
  •  Too Long Problems

    Too Long Problems

    1668943017588 There are some long problems in APPS, so I truncated them after encoding. But the output of the model is in the form of "problem + answer", so output is definitely longer than input. Max_length(1024-input_ids) is set for the output. Actually, if output's length needs to meet the requirement, input's length is much less than 1024. Otherwise we won't get a complete answer even if not reporting an error. Is it right? Also, why is max_length of output is set to "1024-inputs"?

    opened by MT010104 8
Owner
Dan Hendrycks
PhD student at UC Berkeley.
Dan Hendrycks
Illuminated3D This project participates in the Nasa Space Apps Challenge 2021.

Illuminated3D This project participates in the Nasa Space Apps Challenge 2021.

Eleftheriadis Emmanouil 1 Oct 9, 2021
This code reproduces the results of the paper, "Measuring Data Leakage in Machine-Learning Models with Fisher Information"

Fisher Information Loss This repository contains code that can be used to reproduce the experimental results presented in the paper: Awni Hannun, Chua

Facebook Research 43 Dec 30, 2022
The coda and data for "Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach" (ACL '21)

We propose a hierarchical core-fringe learning framework to measure fine-grained domain relevance of terms – the degree that a term is relevant to a broad (e.g., computer science) or narrow (e.g., deep learning) domain.

Jie Huang 14 Oct 21, 2022
Measuring and Improving Consistency in Pretrained Language Models

ParaRel ?? This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

Yanai Elazar 26 Dec 2, 2022
ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Hao Su's Lab, UCSD 48 Dec 30, 2022
An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding, top-down-bottom-up, and attention (consensus between columns)

GLOM - Pytorch (wip) An attempt at the implementation of Glom, Geoffrey Hinton's new idea that integrates neural fields, predictive coding,

Phil Wang 173 Dec 14, 2022
This repo is to be freely used by ML devs to check the GAN performances without coding from scratch.

GANs for Fun Created because I can! GOAL The goal of this repo is to be freely used by ML devs to check the GAN performances without coding from scrat

Sagnik Roy 13 Jan 26, 2022
pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

Zhiqiang Shen 16 Nov 4, 2020
Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana

DeepGeneAnnotator: A tool to annotate the gene in the genome The master thesis of the "Using deep learning to predict gene structures of the coding ge

Ching-Tien Wang 3 Sep 9, 2022
Christmas face app for Decathlon xmas coding party!

Christmas Face Application Use this library to create the perfect picture for your christmas cards! Done by Hasib Zunair, Guillaume Brassard and Samue

Hasib Zunair 4 Dec 20, 2021
Doge-Prediction - Coding Club prediction ig

Doge-Prediction Coding Club prediction ig Basically: Create an application that

null 1 Jan 10, 2022
HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps.

HMLLDB is a collection of LLDB commands to assist in the debugging of iOS apps. 中文介绍 Features Non-intrusive. Your iOS project does not need to be modi

mao2020 47 Oct 22, 2022
Avatarify Python - Avatars for Zoom, Skype and other video-conferencing apps.

Avatarify Python - Avatars for Zoom, Skype and other video-conferencing apps.

Ali Aliev 15.3k Jan 5, 2023
Source code and data from the RecSys 2020 article "Carousel Personalization in Music Streaming Apps with Contextual Bandits" by W. Bendada, G. Salha and T. Bontempelli

Carousel Personalization in Music Streaming Apps with Contextual Bandits - RecSys 2020 This repository provides Python code and data to reproduce expe

Deezer 48 Jan 2, 2023
Create Data & AI apps in 20 lines of code with Shimoku

Install with: pip install shimoku-api-python Start with: from os import getenv import shimoku_api_python.client as Shimoku

Shimoku 5 Nov 7, 2022
Colour detection is necessary to recognize objects, it is also used as a tool in various image editing and drawing apps.

Colour Detection On Image Colour detection is the process of detecting the name of any color. Simple isn’t it? Well, for humans this is an extremely e

Astitva Veer Garg 1 Jan 13, 2022
A machine learning malware analysis framework for Android apps.

??️ A machine learning malware analysis framework for Android apps. ☢️ DroidDetective is a Python tool for analysing Android applications (APKs) for p

James Stevenson 77 Dec 27, 2022
The comma.ai Calibration Challenge!

Welcome to the comma.ai Calibration Challenge! Your goal is to predict the direction of travel (in camera frame) from provided dashcam video. This rep

comma.ai 697 Jan 5, 2023
CVPR 2021 Challenge on Super-Resolution Space

Learning the Super-Resolution Space Challenge NTIRE 2021 at CVPR Learning the Super-Resolution Space challenge is held as a part of the 6th edition of

andreas 104 Oct 26, 2022