GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

Overview

GPT-Code-Clippy (GPT-CC)

Please refer to our new GitHub Wiki which documents our efforts in detail in creating the open source version of GitHub Copilot



Courtesy of the awesome Aimee Trevett!

Introduction

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Datasets

The dataset used to train GPT-CC is obtained from SEART GitHub Search using the following criteria:

  • >10 GitHub stars
  • >2 commits
  • Must have a licence
  • Exclude forks
  • Size < 70708 bytes

These repositories are then combined with all of the GitHub repositories contain in The Pile.

The repositories are then filtered for duplicate files. Filtering is performed by regexing each file in each repository to obtain a list of "variables" (the tokens which only contain alphanumeric characters) and then filtering out any files which contain the same sequence of "variables. The deduplication script is available here.

The final dataset is available here. The dataset without the duplicates filtered out is also available here.

The datasheet discussing in more detail the construction, usage, and limitation of the dataset can be found here. We hope to get it officially into Huggingface's datasets library soon!

Models

The GPT-CC models are fine-tuned versions of GPT-2 and GPT-Neo.

The available models can be found here

The ones that perform relatively well (None improve on the standard GPT-Neo 125M model except for APPs specific models and only for the APPs task):

TODO: which is the recommended model?

Training

Training is done using the training scripts available here.

For fine-tuning GPTNeo-125M on CodeClippy dataset we used AdamW optimizer (beta1=0.9, beta2=0.95) with GPT3-like learning rate schedule (4k warmup steps from 0 to 5e-5 followed by 50k cosine decay steps to 5e-6), weight decay 0.1 and batch size 1024, sequence length 2048. The choice of relatively large batch size and low LR with long warmup are made to avoid agressive updates and preserve the knowledge contained in pretrained GPTNeo weights.

For fine-tuning GPTNe0-125M on APPS dataset we used AdamW optimizer (beta1=0.9, beta2=0.98) with linear learning rate schedule (800 warmup steps from 0 to peak LR followed by linear decay to 0, a range of value for peak LR was [1e-5; 1e-4]), weight decay 0.1 and batch size 256, sequence length 1024. We trained model for 5 epochs selecting best checkpoint judging by validation loss. The language modelling objective for APPS dataset is modified to backpropagate loss only for the tokens corresponding to code solution (refer to Hendrycks et al for more details).

For fine-tuning GPTNe0-1.3B on APPS dataset we used Adafactor optimizer with linear learning rate schedule (5k warmup steps from 0 to 2e-5 followed by linear decay to 0), weight decay 0.1 and batch size 24, sequence length 1024. The choice of hyperparameters for 1.3B model is in part determined by hardware limitations. We trained model for 5 epochs selecting best checkpoint judging by validation loss.

TODO: which is the recommended way to train GPT-CC?

Evaluation

The models are also evaluated on the APPS and HumanEval datasets.

Human Eval Results

Model [email protected] [email protected] [email protected] [email protected]
EleutherAI/gpt-neo 0.12% 0.24% 0.61% 1.22%
gpt-neo-125M-apps 0.06% 0.12% 0.30% 0.61%
dedup-filtered-no-resize-2048bs 0.00% 0.00% 0.00% 0.00%
1024-filtered 0.00% 0.00% 0.00% 0.00%
dedup-2048 0.00% 0.00% 0.00% 0.00%

APPS Eval Results

Coming soon...

Demo

A Visual Studio Code which uses the HuggingFace Inference API is available and can be found here.

We also have Huggingface's Space demo where you can specify and problem in the format of a programming competition question.

TODO: more information about this when complete.

Further Reading

For more information about GPT-CC, GitHub Copilot, etc, see:

TODO: add more further reading.

Acknowledgements

Special thanks to our contributors!!

Issues
  • **Code Datasets**

    **Code Datasets**

    • [x] Datasets to use?
    • [x] How to collect the datasets?
    • [x] How to store and organize the datasets?
    • [x] What filtering/preprocessing/processing needs to be done to the datasets?
    • [x] Merge data onto one TPU
    • [x] Figure out deduplicating dataset
    • [x] Setup dataloading of dataset using HF datasets
    • [x] Talk with owner of the eye archive community for hosting our dataset similar to the pile
    opened by ncoop57 16
  • EleutherAI/gpt-neo-1.3B Model works better than this.

    EleutherAI/gpt-neo-1.3B Model works better than this.

    Hi, You guys are doing a great job with it.

    I have tried your flax-community/gpt-neo-1.3B-apps-all model, and the generated code is kinda hit or miss.

    This is generated using flax-community/gpt-neo-1.3B-apps-all image

    and this is generated using EleutherAI/gpt-neo-1.3B image

    as far I know EleutherAI/gpt-neo-1.3B is trained on more generalized texts, which are not necessarily code.

    then why flax-community/gpt-neo-1.3B-apps-all performing much worse than EleutherAI/gpt-neo-1.3B?

    opened by bubundas17 9
  • **Code Model Evaluation**

    **Code Model Evaluation**

    • [x] How will we evaluate the model?
    • [x] What metrics will we use?
    • [x] What existing scripts could we repurpose?
    • [x] Modified/newly created eval script created to feed into the rest of the pipeline
    opened by ncoop57 5
  • Participation in an Open Source Language Modeling Dataset

    Participation in an Open Source Language Modeling Dataset

    Hi there, your repository has been selected to be included in an effort to train an open source version of GitHub and OpenAI's Copilot tool. You can find more information on our project here.

    If you are the owner/admin of this repository and would like to opt-out of this, please reply to this issue before July 9th with "yes" and we will remove your repository from our list.

    opened by ncoop57 5
  • **Code Model Demo**

    **Code Model Demo**

    • [x] How will we demo the model?
    opened by ncoop57 3
  • Low Pass@k

    Low [email protected]

    Hi, Thanks for the great work! Firstly I wanted to ask about the performance of the code-clippy models. It seems that the 125M parameter models are quite weak and perform quite poorly on human-eval dataset (even lower than GPT-Neo-1.3B?). Any idea why this is happening.

    Also is there some update on the evaluation of the GPT-Neo-1.3 B code-clippy model?

    Finally, I would love to contribute to upcoming iterations of code-clippy. Should I join the discord channel?

    opened by Naman-ntc 3
  • https://huggingface.co/spaces/flax-community/code-clippy-problem-solver gets stuck on generating solution

    https://huggingface.co/spaces/flax-community/code-clippy-problem-solver gets stuck on generating solution

    Was trying this spaces example https://huggingface.co/spaces/flax-community/code-clippy-problem-solver and it seems to get stuck for A function that prints prime numbers from 1 to 100

    opened by allthingssecurity 3
  • Vim Plugin

    Vim Plugin

    Awesome work, I try to begin developing vim plugin version of the project ( if my knowledge allows it). BTW, awesome job.

    opened by Shahin-rmz 3
  • What are the different 'repo_language' contained in the dataset?

    What are the different 'repo_language' contained in the dataset?

    I have only found Java. Wonder if someone can spare me the details without having to process the whole dataset :) Thank you for open sourcing it! Awesome stuff!

    opened by JoaoLages 3
  • How to get started?

    How to get started?

    Is there an easier way to get started?

    I tried to setup a machine and install all requirements. Would try tomorrow to go further but maybe I am doing something wrong:

    The error I am at currently is: """ 2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory Traceback (most recent call last): File "run_clm_apps.py", line 800, in main() File "run_clm_apps.py", line 342, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/home/pankaj/.local/lib/python3.8/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 14, in init File "run_clm_apps.py", line 174, in post_init raise ValueError("Need either a dataset name or a training/validation file.") ValueError: Need either a dataset name or a training/validation file. """ Also, getting the requirements to work was quite difficult on my machine. Wondering if I am doing something wrong.

    opened by pankajkumar229 11
  • Why GPT-CC is lower than CodeX significantly.

    Why GPT-CC is lower than CodeX significantly.

    This is Codex results image

    The following result is GPT-CC image

    • Is it caused by the quality of the pre-trained corpus data?

    Thanks

    opened by BitcoinNLPer 0
  • Wrong filenames in dataset

    Wrong filenames in dataset

    Hi, The filenames in code-clippy dedup dataset are wrong. In the repo with multiple files - though various files are present, they share a single random filename not adhering to the correct file extension as well. While for gpt-code-clippy training efforts this might not be an issue since only content of files might matter, it would be really great if this issue can be fixed or mentioned clearly otherwise.

    sample code to reproduce the issue (prints filenames in first 100 rows of jsonl)

    import os
    import json
    import uuid
    import zstandard
    import subprocess
    
    def loadJsonL(fname):
        import json
    
        data = []
        with open(fname) as fp:
            for line in fp.readlines():
                data.append(json.loads(line))
        return data
    
    
    def processZSTLink(url):
        zstfile = url.split('/')[-1]
        print(url)
        out = subprocess.run(f"wget {url}", shell=True, stdout=subprocess.DEVNULL)    
        jsonlfile = zstfile[:-4]    
        with open(zstfile, 'rb') as compressed:
            decomp = zstandard.ZstdDecompressor()
            with open(jsonlfile, 'wb') as destination:
                decomp.copy_stream(compressed, destination)
    
        data = loadJsonL(jsonlfile)
        newData = []
        for row in data[:100]:
            file_name = row['meta']['file_name']
            repo_name = row['meta']['repo_name']        
            print(f"{repo_name}/{file_name}")
    
    
    processZSTLink('https://the-eye.eu/public/AI/training_data/code_clippy_data//code_clippy_dedup_data/test/data_2814_time1626332048_default.jsonl.zst')
    
    bug enhancement 
    opened by Naman-ntc 3
  • Update Eval Results for HumanEval

    Update Eval Results for HumanEval

    Need to update the HumanEval results due to bug that was originally in our evaluation code and was fixed in this PR #62

    opened by ncoop57 5
  • CodeBleu Evaluation

    CodeBleu Evaluation

    adding codebleu evaluation scripts.

    opened by Mrinal18 0
  • **Training script**

    **Training script**

    • [x] - add bf16 support
    • [ ] - check if training with bf16 weights works fine
    • [x] - add resuming from ckpt
    • [x] - add wandb tracking
    • [x] - complete adafactor option
    • [x] - figure out how to best utilize profiler for training loop optimization
    • [x] - add gradient accumulation
    • [x] - support iterable datasets and max_steps argument
    • [x] - prefetch generator for dataloader
    opened by arampacha 5
  • How to store model weights in GCS

    How to store model weights in GCS

    In order to do distributed training across multiple TPUs and for hosting the model once we lose access to the TPUs, we need to figure out how to setup a GCS bucket to store the model in. Any help on this task would be greatly appreciated!

    https://cloud.google.com/storage/

    opened by ncoop57 4
  • Things That Could Go Wrong

    Things That Could Go Wrong

    Hi y'all I'd like to make sure we do plenty of brainstorming on where things can go wrong in terms of ethical concerns. I don't want our field to have the same issues that have happened in the other AI fields such as biases and lacking discussion of limitations. So, please use this issue to also (we also have an internal discord channel where we discuss this in a less formal setting, which I will be periodically synthesizing to here) discuss any things that could go wrong! Here are already a few things that have been discussed:

    1. vulnerabilities being inserted into completions
    2. Licensing Issues
    3. Automating developers out of a job
    opened by ncoop57 4
  • **Code Tokenization**

    **Code Tokenization**

    • [ ] What sort of tokenization will be done?
    • [ ] Scripts/tutorials that can do the tokenization?
    • [ ] Modified/newly created tokenization script to feed into the rest of the pipeline
    opened by ncoop57 3
Owner
Nathan Cooper
I'm a nerd.
Nathan Cooper
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 8 Dec 27, 2021
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

null 6 Sep 23, 2021
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 75 Jan 13, 2022
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 2.6k Jan 14, 2022
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 1.3k Jan 13, 2022
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 2.8k Jan 14, 2022
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 2.5k Feb 17, 2021
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 25 Aug 24, 2021
Seonghwan Kim 15 Jan 4, 2022
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 1 Dec 27, 2021
Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

Ashley Kim 2 Jan 9, 2022
Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

BADER ALABDAN 1 Jan 18, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 401 Jan 14, 2022
A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex)

CodeJ A python project made to generate code using either OpenAI's codex or GPT-J (Although not as good as codex) Install requirements pip install -r

TheProtagonist 1 Dec 6, 2021
Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

farisalasmary 44 Jan 13, 2022
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.2k Jan 9, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 30 Jan 15, 2022
Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

gpt3-instruct-sandbox Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API Description This project updates an existing GPT-3 san

null 271 Jan 8, 2022