Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

Facebook Research

Last update: Jan 1, 2023

Related tags

Deep Learning CodeGen

Overview

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and model evaluation.

We provide reference implementations of the following papers:

We also provide pre-trained models for language modeling, translation and deobfuscation.

Dependencies

Run install_env.sh. We use black code formatter.

Data

Source code processors

This repository contains programming languages processors for C++, Java and Python. These processors include:

tokenization and detokenization
obfuscation
function extractions

These processors are based on TreeSitter parsers. As these parsers are available in more than 30 programming languages, one can easily create a new programming language processor.

Example of code tokenization:

from codegen_sources.preprocessing.lang_processors.java_processor import JavaProcessor

java_code = r"""class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World!"); 
    }
}"""
java_processor = JavaProcessor(root_folder="<YOUR_TREESITER_FOLDER>")
tokenized_java_code = java_processor.tokenize_code(java_code)
print(tokenized_java_code)

BPE

This repository provides wrappers for fast BPE and Roberta BPE at file level.

Dataset Preprocessing

This repository contains a pipeline to create programming languages datasets. Now it supports four datasets modes:

Monolingual (ex: Java source code)
Monolingual Functions (ex: Java functions)
Monolingual Obfuscated (ex: Obfuscated Java source code. [Details here])
Monolingual Obfuscated Functions (ex: Obfuscated Java functions)

First, download C++ / Java / Python source code from Google BigQuery. To run our preprocessing pipeline, you need to donwload the raw source code on your machine in a JSON format. A sample of it is given here.

The pipeline does the following:

Source code extraction from json (.json.gz) and tokenization (.tok)
Train BPE codes and vocab
Apply BPE (.bpe)
Binarization (.pth)
Symlink folder with appropriate file names for .pth (XLM-syml). To be given as data_path argument for training.

To run the pipeline :

python -m codegen_sources.preprocessing.preprocess \
<DATA_PATH> \                            # folder containing json.gz
--langs java cpp python  \               # languages to process
--mode monolingual_functions \           # dataset mode
--bpe_mode=fast_bpe \                    # BPE mode. by default it is fast_BPE. can be roberta_bpe
--local=True \                           # Run on your local machine if True. If False run on a cluster (requires submitit setup)
--train_splits=1                         # Number of trainings splits

If you give several languages, the BPE codes and vocab will be learned commonly on these languages , so that you will have a common vocabulary to train one model for several languages. If you do not want that, launch the pipeline on every language separatly. These tests test the pipeline on different modes. It will give you an overview of the possible options.

Also, we provide the BPE codes and vocabulary here. These are the codes and vocabulary used for TransCoder and DOBF. They were learned on concatenated C++, Java, and Python data. If you want to use them instead of learning new ones, give the corresponding paths as fastbpe_code_path and fastbpe_vocab_path arguments.

In TransCoder and DOBF readmes, we provide the commands to preprocess the respective datasets.

Model

Overview

In this repository, we provide code to train transformer-based models (code based on XLM repository). The available training tasks are the following:

Masked Language Model (MLM)
Causal Language Model (CLM)
Supervised Machine translation (MT)
Classification
Deobfuscation = DOBF
Unsupervised Machine translation = TransCoder (Denoising auto encoding AE + Back Translation BT)

We evaluate our models with metrics adapted to each task (e.g. computation accuracy and BLEU score for TransCoder, subtoken score for Deobfuscation).

Also, we provide wrappers to fine-tune and evaluate our models on CodeXGLUE benchmark.

Download models

You can donwload the following models :

MLM
TransCoder. Use it to translate some code here.
DOBF. Use it to deobfuscate some code here.

Re train specific models

To have details on how to retrain specific models, please refer to the README specific to each model.

References

TransCoder model (NeurIPS 2020)

[1] B. Roziere*, M.A. Lachaux*, L. Chanussot, G. Lample Unsupervised Translation of Programming Languages.

@article{roziere2020unsupervised,
  title={Unsupervised translation of programming languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

DOBF

[2] B. Roziere*, M.A. Lachaux*, M. Szafraniec , G. Lample DOBF: A Deobfuscation Pre-Training Objective for Programming Languages.

@article{roziere2021dobf,
  title={DOBF: A Deobfuscation Pre-Training Objective for Programming Languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Szafraniec, Marc and Lample, Guillaume},
  journal={arXiv preprint arXiv:2102.07492},
  year={2021}
}

* Equal Contribution

License

CodeGen is under the license detailed in the Creative Commons Attribution-NonCommercial 4.0 International license. See LICENSE for more details.

Comments

Fine-tuning TransCoder

Hi,

We recently proposed a small-scale program translation dataset, called AVATAR. We want to fine-tune TransCoder on the translation task but we didn't find any documentation on that. Can you provide some guidelines on fine-tuning TransCoder?

opened by wasiahmad 27
Parallel datasets
Hi, I am trying to create a POC using CodeGen to translate code written in vb to Java and vice-versa. I downloaded the training data for vb and java using Google BigQuery. Also, I have completed the preprocessing step using commands:

python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual_functions --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10

python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10

As a result, the following files were created inside the folder XLM-syml:

test.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth

train.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa [0-9]].pth

valid.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth

Post that, I trained the MLM model using the following command: python codegen_sources/model/train.py --exp_name mlm_vb_java_fast_mono_updated_v0 --dump_path '/content/Facebook_CodeGen/dumpPath_fast_mono_updated' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --mlm_steps 'vb_sa,java_sa' --add_eof_to_stream true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' --encoder_only true --n_layers 6 --emb_dim 1024 --n_heads 8 --lgs 'vb_sa-java_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --amp 2 --fp16 true --batch_size 16 --bptt 512 --epoch_size 200 --max_epoch 100000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --save_periodic 0 --validation_metrics _valid_mlm_ppl --stopping_criterion '_valid_mlm_ppl,10'

However, when I am trying to train transcoder model using following command, I am getting AssertionError: /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml/valid.java_sa-vb_sa.java_sa.0.pth error. Command: python codegen_sources/model/train.py --exp_name transcoder_vb_java_updated_v1 --dump_path '/content/drive/MyDrive/dumpPath_updated_transcoder_v0' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --split_data_accross_gpu local --bt_steps 'vb_sa-java_sa-vb_sa,java_sa-vb_sa-java_sa' --ae_steps 'vb_sa,java_sa' --lambda_ae '0:1,30000:0.1,100000:0' --word_shuffle 3 --word_dropout '0.1' --word_blank '0.3' --encoder_only False --n_layers 0 --n_layers_encoder 6 --n_layers_decoder 6 --emb_dim 1024 --n_heads 8 --lgs 'java_sa-vb_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --reload_model '/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth,/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth' --reload_encoder_for_decoder true --amp 2 --fp16 true --tokens_per_batch 3000 --group_by_size true --max_batch_size 128 --epoch_size 100 --max_epoch 10000000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --eval_bleu true --eval_computation true --has_sentences_ids true --generate_hypothesis true --save_periodic 1 --validation_metrics 'valid_vb_-java_mt_comp_acc' --lgs_mapping 'vb_sa:vb,java_sa:java'

Could you please help me as to how do I get these parallel datasets? Also, is there something/some step that I am missing or doing incorrectly?
question
opened by prnk04 23
Preprocess step is completing but pth files are not generating as expected in XLM folder

Hello, After running preprocess steps from below command:

python -m codegen_sources.preprocessing.preprocess /path/data/mydata2 --langs java cpp --mode monolingual_functions --bpe_mode=fast --local=True --train_splits=1 --fastbpe_code_path=/path/data/bpe/cpp-java-python/ --fastbpe_vocab_path=/path/data/bpe/cpp-java-python/

XLM-syml folder is getting generated with file name like: test.cpp_cl.pth test.cpp_sa.pth test.java_cl.pth .... train.cpp_cl.0.pth train.cpp_sa.0.pth ...

But when using these folder/files in Training step(MLM) is giving error like:

XLM-syml/train.java.pth not found XLM-syml/valid.java.pth not found XLM-syml/test.java.pth not found XLM-syml/train.cpp.pth not found XLM-syml/valid.cpp.pth not found XLM-syml/test.cpp.pth not found

Train command: python train.py --exp_name mlm --dump_path '/path/CodeGen/data/models' --data_path '/path/data/mydata2/XLM-syml' --split_data_accross_gpu local --mlm_steps 'java,cpp' --add_eof_to_stream true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' --encoder_only true --n_layers 12 --emb_dim 768 --n_heads 12 --lgs 'java-cpp' --max_vocab 64000 --gelu_activation true --roberta_mode false --amp 2 --fp16 true --batch_size 8 --bptt 512 --epoch_size 1000 --max_epoch 2000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --save_periodic 0 --validation_metrics _valid_mlm_ppl --stopping_criterion '_valid_mlm_ppl,10'

I think files are getting generated with suffix like _cl.pth or _sa.pth which is not being considered in training step? OR I am doing something wrong?

Thanks

opened by Prathameshwar 12
Memory Usage Preprocessing

When running preprocessing on all the data, it seems that the job consumes almost all the memory on the system and the swap memory. It makes the whole job very slow (after 1-2 days only 10-100 files processed). I am wondering if there is any way to put a limit on the number of files that are loaded into the memory concurrently.

I see that there is an option job_mem when the job is running on clusters, but not when the job is running locally.
enhancement

opened by yazdanbakhsh 10

Small Training Dataset

Since the tokenization on all the dataset takes a lot of time, I have decided to create a small dataset with only 10-20 of the json.gz files. Once training starts, it gives the following error. Is it because the tokenization/BPE have not seen this character?

File "/CodeGen/codegen_sources/model/train.py", line 701, in <module> main(params) File "/CodeGen/codegen_sources/model/train.py", line 609, in main trainer.mlm_step( File 
"/CodeGen/codegen_sources/model/src/trainer.py", line 1005, in mlm_step show_batch( File 
"/CodeGen/codegen_sources/model/src/utils.py", line 74, in show_batch f"{label} sent: 
{restore_segmentation_sentence(source_sentence, roberta_mode)}" File "/CodeGen/codegen_sources/model/src/utils.py", 
line 563, in restore_segmentation_sentence return restore_roberta_segmentation_sentence(sentence) File 
"/CodeGen/codegen_sources/model/src/utils.py", line 601, in restore_roberta_segmentation_sentence res = 
bytearray([byte_decoder[c] for c in text]).decode("utf-8", errors="replace") File 
"/CodeGen/codegen_sources/model/src/utils.py", line 601, in <listcomp> res = bytearray([byte_decoder[c] for c in 
text]).decode("utf-8", errors="replace") KeyError: '郞'

opened by yazdanbakhsh 6

Clarification questions
Hi, I have a few questions regarding TransCoder's training data and optimization setting.

From the paper, it is clear that TransCoder is trained using Standalone functions during the DAE+BT training stage. But is TransCoder only trained using Standalone functions in the MLM stage too?

During the MLM stage, only the encoder part of TransCoder is pre-trained, right?

For the MLM pre-training, max_epoch and epoch_size are set to 100k. If I understand correctly, epoch_size basically refers to the number of instances used in each epoch. Is it correct? Also, for MLM pre-training, the following are set:

--validation_metrics _valid_mlm_ppl \ --stopping_criterion '_valid_mlm_ppl,10'

So, I am assuming TransCoder pre-training is stopped based on the stopping_criterion. Before, the MLM pre-training was stopped, how many optimization steps were executed?

Unlike the MLM pre-training stage, for the DAE+BT stage training, there is no stopping_criterion is set. And the epoch_size was set to 50000 and the max_epoch was set to 10000000. So, when the training stops? How many optimization steps were executed during this stage?
opened by wasiahmad 5
Question regarding Backtranslation
Hi,

I have a basic question to understand why the backtranslation works in this scenario. Typically in NLP, we collect some parallel data to train Transformer-like models and then use backtranslation (BT) on a large collection of monolingual data.

In contrast, TransCoder is first gone through a pre-training stage and then trained via BT. Since, TransCoder does not have any idea about cross-language generation, at the beginning of BT, TransCoder presumably would generate the sequence in the same language (from Java input to Java output, instead of python output). So, feeding the generated sequence to translate back to the original sequence is not going to help the model in learning translation. So, how backtranslation provides the learning bias to perform translation?

Recently, I tried to apply BT to our model, called PLBART to teach it to perform translation. However, at the very beginning of BT training, when I checked what PLBART generates for a given Java input, I saw it generates exactly the input sequence although the generation is done based on a prefix token for the target language python. For example,

# input static public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; } [java] # output [python] public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; }

As you can see above, exactly the same sequence is generated. PLBART is pre-trained via Denoising Autoencoding (DAE), thus it doesn't have any clue about cross-language generation. I am curious, how does TransCoder learn from BT?

If I am not wrong, TransCoder uses language embedding with each input token (REF). Do you think that can make a difference? Also, can you shed light on the TransCoder structure? It seems like TransCoder does not have a typical sequence-to-sequence architecture.
opened by wasiahmad 5
Assertion `srcIndex < srcSelectDimSize` failed

I managed to get beyond this issue. Turned out my input file had Windows line endings (can't believe I'm even admitting to that, but whatever, it happens) Also, my source files are C, not CPP, not sure if that was causing CUBLAS_STATUS_NOT_INITIALIZED but I intend to create a new processor for C.

Originally posted by @raffian in https://github.com/facebookresearch/CodeGen/issues/55#issuecomment-994103795

Dose it worked after you changed the line endings?I have the same error.But it doesn't work after I change these.

opened by 11asdad 4
Limited local parallelism across file lines

This change allow the users to limit the amount of parallelism across line files. Before, the maximum parallelism (cpu count) was used for tokenization.
CLA Signed

opened by yazdanbakhsh 4
How to extract functions?
I have downloaded the raw source code on my machine, for example, python.000000000000.json.gz. But when I run preprocessing pipeline, there is an error: AssertionError: failed to learn bpe on /data2/linjiayi/CodeGen-master/data/python.sa-cl.tok.shuf.50gb, command: /home/linjiayi/CodeGen-master/codegen_sources/model/tools/fastBPE/fast learnbpe 50000 /data2/linjiayi/CodeGen-master/data/python.sa-cl.tok.shuf.50gb > /data2/linjiayi/CodeGen-master/data/python.sa-cl.codes

After I changed the filename to python.000.json.gz, it run successfully. Actually, I only want to extract functions from the raw source code. So, I have a few questions:

Can data downloaded from BigQuery only be named in python.*[0-4][0-9][0-9].json.gz format?

When I import data from BigQuery Table into Google Storage using wildcards, the file named python.000000000000.json.gz. How to save a file name with only three zeros? Did you download the file and modify it?

After I run preprocessing pipeline, only one python.000.json.gz generates a lot of files.

The `.sa.tok` files is `standalone functions.` The `.cl.tok` files is `class functions`. What's in the other files?

Whether the Preprocessing Pipeline can extract description for each function? Which file to save the description in?

The format of the extracted function is robertglen/flask | def test_explicit_instance_paths ( modules_tmpdir ) : NEW_LINE INDENT with pytest . raises ( ValueError ) as excinfo : NEW_LINE INDENT flask . Flask ( __name__ , instance_path = ' instance ' ) NEW_LINE DEDENT assert ' must ▁ be ▁ absolute ' in str ( excinfo . value ) NEW_LINE app = flask . Flask ( __name__ , instance_path = str ( modules_tmpdir ) ) NEW_LINE assert app . instance_path == str ( modules_tmpdir ) NEW_LINE DEDENT There are NEW_LINE INDENT, NEW_LINE DEDENT and NEW_LINE in the cdoes. Whether they affect the parsing of the code?
opened by skye95git 4
/usr/bin/ld: cannot find -lc++ - distutils.errors.DistutilsExecError: command '/usr/bin/cc'
I am trying to use the pretrained model to transform from C++ to Java, but with no luck after installations with this error

` python -m codegen_sources.model.translate --src_lang cpp --tgt_lang java --model_path models/TransCoder_model_1.pth --beam_size 1 < zcpp_sample.cpp adding to path <>/workspaces/git_web/facebook_transcoder/CodeGen INFO - 10/02/21 08:26:08 - 0:00:04 - ============ Model Reloading INFO - 10/02/21 08:26:08 - 0:00:04 - Reloading encoder from models/TransCoder_model_1.pth ... WARNING - 10/02/21 08:26:10 - 0:00:06 - Lang cpp_sa matched to pretrained cpp_sa lang embedding. WARNING - 10/02/21 08:26:10 - 0:00:06 - Lang java_sa matched to pretrained java_sa lang embedding. WARNING - 10/02/21 08:26:10 - 0:00:06 - Lang python_sa matched to pretrained python_sa lang embedding. WARNING - 10/02/21 08:26:10 - 0:00:06 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times. INFO - 10/02/21 08:26:10 - 0:00:07 - Reloading decoders from models/TransCoder_model_1.pth ... WARNING - 10/02/21 08:26:11 - 0:00:08 - Lang cpp_sa matched to pretrained cpp_sa lang embedding. WARNING - 10/02/21 08:26:11 - 0:00:08 - Lang java_sa matched to pretrained java_sa lang embedding. WARNING - 10/02/21 08:26:11 - 0:00:08 - Lang python_sa matched to pretrained python_sa lang embedding. WARNING - 10/02/21 08:26:11 - 0:00:08 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times. INFO - 10/02/21 08:26:11 - 0:00:08 - Number of parameters (encoder): 143239641 INFO - 10/02/21 08:26:11 - 0:00:08 - Number of parameters (decoders): 168442329 INFO - 10/02/21 08:26:11 - 0:00:08 - Number of decoders: 1

Input cpp function: #include using namespace std;

int main() { cout<<"My First Program. Helllllllo."<<endl;

return 0;

} /usr/bin/ld: cannot find -lc++ collect2: error: ld returned 1 exit status Traceback (most recent call last): File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/unixccompiler.py", line 206, in link self.spawn(linker + ld_args) File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/ccompiler.py", line 910, in spawn spawn(cmd, dry_run=self.dry_run) File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/spawn.py", line 91, in spawn raise DistutilsExecError( distutils.errors.DistutilsExecError: command '/usr/bin/cc' failed with exit code 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "<>/anaconda3/envs/pyt/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "<>/anaconda3/envs/pyt/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/model/translate.py", line 254, in output = translator.translate( File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/model/translate.py", line 139, in translate src_lang_processor = LangProcessor.processors[lang1]( File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/preprocessing/lang_processors/cpp_processor.py", line 25, in init super().init( File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/preprocessing/lang_processors/tree_sitter_processor.py", line 40, in init self.create_treesiter_parser() File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/preprocessing/lang_processors/tree_sitter_processor.py", line 48, in create_treesiter_parser Language.build_library( File "<>/anaconda3/envs/pyt/lib/python3.9/site-packages/tree_sitter/init.py", line 72, in build_library compiler.link_shared_object(object_paths, output_path) File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/ccompiler.py", line 713, in link_shared_object self.link(CCompiler.SHARED_OBJECT, objects, File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/unixccompiler.py", line 208, in link raise LinkError(msg) distutils.errors.LinkError: command '/usr/bin/cc' failed with exit code 1 `

I run the following command in my environment ' cc --version cc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 '
opened by mostafa-saad 3
`fastBPE` fix path
I against assertion AssertionError: failed to learn bpe on /media/Z/dungnm31/transcoder/cpp-java-python.monolingual.tok.shuf.50gb, command: /home/dungnm/CodeGen/fastBPE/fast learnbpe 50000 /media/Z/dungnm31/transcoder/cpp-java-python.monolingual.tok.shuf.50gb > /media/Z/dungnm31/transcoder/cpp-java-python.monolingual.codes

It turn out the command itself was not right. The fastBPE path will be located at "codegen_sources/model/tools/fastBPE/fast" according to install_env.sh instead of "fastBPE/fast"

Suggest file codegen_sources/preprocessing/bpe_modes/fast_bpe_mode.py change

FAST = str(Path(__file__).parents[3].joinpath("fastBPE/fast"))

to

FAST = str(Path(__file__).parents[3].joinpath("codegen_sources/model/tools/fastBPE/fast"))
opened by nmd-2000 0

Empty .sa.tok files after select_functions & request to release self_training dataset

I am trying to create the self-training dataset, as per the instructions at https://github.com/facebookresearch/CodeGen/blob/main/docs/TransCoder-ST.md.

From Google BigQuery, I got 500 .json.gz files. Thereafter I preprocessed them and got the following symlinks successfully:

[abc@def CodeGen]$ ls xyz/java-FULL/XLM-syml/
test.java_cl.pth  train.java_cl.0.pth  train.java_cl.2.pth  train.java_sa.1.pth  valid.java_cl.pth
test.java_sa.pth  train.java_cl.1.pth  train.java_sa.0.pth  train.java_sa.2.pth  valid.java_sa.pth
[abc@def CodeGen]$

But now, as part of the final step, I am facing an issue on running create_self_training_dataset.sh. As per the following output that I am getting, all the .sa.tok files in the selected_functions folder are empty.

Repository root: .
python codegen_sources/test_generation/select_java_inputs.py --local True --input_path /home/xyz/CodeGen-data/java-FULL/ --output_path /home/xyz/CodeGen-data/dataset//selected_functions/ --rerun True
adding /project/6001889/xyz/CodeGen to path
adding to path /project/6001884/xyz/CodeGen
########## Selecting input functions ##########
100%|██████████| 500/500 [10:08:19<00:00, 73.00s/it] 
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000000.sa.tok
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000001.sa.tok
...
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000497.sa.tok
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000498.sa.tok
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000499.sa.tok

On debugging, I found that is_simple_standalone_func(func) in line 67 of at Link is returning False for all the Java functions. As such, the mask in line 114 in select_functions(funcpath) is an all-False list. Please suggest what to do in this case.

Also, it would be great if the authors can please release the training dataset of 135,000 parallel functions (as mentioned in the paper) between Java, Python, and C++, in the form of a shareable link.

opened by PrithwishJana 0

Could you please build a website to support API to translate the program languages?

It's too hard for me to translate my cpp code to python. I will appreciate it if you would build a website to provide the API to translate the code. Thank you.

opened by Bailey-24 0
Bug in epoch calculation

At line no. 1483 in the file codegen_sources/model/src/trainer.py. the code is self.n_sentences += params.batch_size I think it should be self.n_sentences += len1.size(0) https://github.com/facebookresearch/CodeGen/blob/6e93aca63e7bc77287c9965a5080456326651237/codegen_sources/model/src/trainer.py#L1483

With above bug notion of one epoch becomes wrong because of check at following line.

https://github.com/facebookresearch/CodeGen/blob/6e93aca63e7bc77287c9965a5080456326651237/codegen_sources/model/train.py#L742

opened by dineshkh 0
Pretrain modell

Hi, I want to train model for translation pascal2java. I have small datasets about 2 Gb of pascal and 2 Gb of java. I train model mlm and then train transcoder model but translation doesn't work. When I try to translate pascal2java I got same function of pascal or bad translation of java. It can be problem with small dataset or overfittig?

opened by Elvares 1

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

Related tags

Overview

Dependencies

Data

Source code processors

BPE

Dataset Preprocessing

Model

Overview

Download models

Re train specific models

References

TransCoder model (NeurIPS 2020)

DOBF

License

Comments

Owner

Facebook Research

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

SIMULEVAL A General Evaluation Toolkit for Simultaneous Translation

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Template repository for managing machine learning research projects built with PyTorch-Lightning

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

YOLOv5 🚀 is a family of object detection architectures and models pretrained on the COCO dataset

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

An evaluation toolkit for voice conversion models.

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.