Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

Overview

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and model evaluation.

We provide reference implementations of the following papers:

We also provide pre-trained models for language modeling, translation and deobfuscation.

Dependencies

Run install_env.sh. We use black code formatter.

Data

Source code processors

This repository contains programming languages processors for C++, Java and Python. These processors include:

  • tokenization and detokenization
  • obfuscation
  • function extractions

These processors are based on TreeSitter parsers. As these parsers are available in more than 30 programming languages, one can easily create a new programming language processor.

Example of code tokenization:

from codegen_sources.preprocessing.lang_processors.java_processor import JavaProcessor

java_code = r"""class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World!"); 
    }
}"""
java_processor = JavaProcessor(root_folder="<YOUR_TREESITER_FOLDER>")
tokenized_java_code = java_processor.tokenize_code(java_code)
print(tokenized_java_code)

BPE

This repository provides wrappers for fast BPE and Roberta BPE at file level.

Dataset Preprocessing

This repository contains a pipeline to create programming languages datasets. Now it supports four datasets modes:

  • Monolingual (ex: Java source code)
  • Monolingual Functions (ex: Java functions)
  • Monolingual Obfuscated (ex: Obfuscated Java source code. [Details here])
  • Monolingual Obfuscated Functions (ex: Obfuscated Java functions)

First, download C++ / Java / Python source code from Google BigQuery. To run our preprocessing pipeline, you need to donwload the raw source code on your machine in a JSON format. A sample of it is given here.

The pipeline does the following:

  • Source code extraction from json (.json.gz) and tokenization (.tok)
  • Train BPE codes and vocab
  • Apply BPE (.bpe)
  • Binarization (.pth)
  • Symlink folder with appropriate file names for .pth (XLM-syml). To be given as data_path argument for training.

To run the pipeline :

python -m codegen_sources.preprocessing.preprocess \
<DATA_PATH> \                            # folder containing json.gz
--langs java cpp python  \               # languages to process
--mode monolingual_functions \           # dataset mode
--bpe_mode=fast_bpe \                    # BPE mode. by default it is fast_BPE. can be roberta_bpe
--local=True \                           # Run on your local machine if True. If False run on a cluster (requires submitit setup)
--train_splits=1                         # Number of trainings splits

If you give several languages, the BPE codes and vocab will be learned commonly on these languages , so that you will have a common vocabulary to train one model for several languages. If you do not want that, launch the pipeline on every language separatly. These tests test the pipeline on different modes. It will give you an overview of the possible options.

Also, we provide the BPE codes and vocabulary here. These are the codes and vocabulary used for TransCoder and DOBF. They were learned on concatenated C++, Java, and Python data. If you want to use them instead of learning new ones, give the corresponding paths as fastbpe_code_path and fastbpe_vocab_path arguments.

In TransCoder and DOBF readmes, we provide the commands to preprocess the respective datasets.

Model

Overview

In this repository, we provide code to train transformer-based models (code based on XLM repository). The available training tasks are the following:

  • Masked Language Model (MLM)
  • Causal Language Model (CLM)
  • Supervised Machine translation (MT)
  • Classification
  • Deobfuscation = DOBF
  • Unsupervised Machine translation = TransCoder (Denoising auto encoding AE + Back Translation BT)

We evaluate our models with metrics adapted to each task (e.g. computation accuracy and BLEU score for TransCoder, subtoken score for Deobfuscation).

Also, we provide wrappers to fine-tune and evaluate our models on CodeXGLUE benchmark.

Download models

You can donwload the following models :

Re train specific models

To have details on how to retrain specific models, please refer to the README specific to each model.

References

TransCoder model (NeurIPS 2020)

[1] B. Roziere*, M.A. Lachaux*, L. Chanussot, G. Lample Unsupervised Translation of Programming Languages.

@article{roziere2020unsupervised,
  title={Unsupervised translation of programming languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

DOBF

[2] B. Roziere*, M.A. Lachaux*, M. Szafraniec , G. Lample DOBF: A Deobfuscation Pre-Training Objective for Programming Languages.

@article{roziere2021dobf,
  title={DOBF: A Deobfuscation Pre-Training Objective for Programming Languages},
  author={Roziere, Baptiste and Lachaux, Marie-Anne and Szafraniec, Marc and Lample, Guillaume},
  journal={arXiv preprint arXiv:2102.07492},
  year={2021}
}

* Equal Contribution

License

CodeGen is under the license detailed in the Creative Commons Attribution-NonCommercial 4.0 International license. See LICENSE for more details.

Comments
  • Fine-tuning TransCoder

    Fine-tuning TransCoder

    Hi,

    We recently proposed a small-scale program translation dataset, called AVATAR. We want to fine-tune TransCoder on the translation task but we didn't find any documentation on that. Can you provide some guidelines on fine-tuning TransCoder?

    opened by wasiahmad 27
  • Parallel datasets

    Parallel datasets

    Hi, I am trying to create a POC using CodeGen to translate code written in vb to Java and vice-versa. I downloaded the training data for vb and java using Google BigQuery. Also, I have completed the preprocessing step using commands:

    1. python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual_functions --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10
    2. python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10

    As a result, the following files were created inside the folder XLM-syml:

    1. test.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth
    2. train.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa [0-9]].pth
    3. valid.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth

    Post that, I trained the MLM model using the following command: python codegen_sources/model/train.py --exp_name mlm_vb_java_fast_mono_updated_v0 --dump_path '/content/Facebook_CodeGen/dumpPath_fast_mono_updated' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --mlm_steps 'vb_sa,java_sa' --add_eof_to_stream true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' --encoder_only true --n_layers 6 --emb_dim 1024 --n_heads 8 --lgs 'vb_sa-java_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --amp 2 --fp16 true --batch_size 16 --bptt 512 --epoch_size 200 --max_epoch 100000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --save_periodic 0 --validation_metrics _valid_mlm_ppl --stopping_criterion '_valid_mlm_ppl,10'

    However, when I am trying to train transcoder model using following command, I am getting AssertionError: /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml/valid.java_sa-vb_sa.java_sa.0.pth error. Command: python codegen_sources/model/train.py --exp_name transcoder_vb_java_updated_v1 --dump_path '/content/drive/MyDrive/dumpPath_updated_transcoder_v0' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --split_data_accross_gpu local --bt_steps 'vb_sa-java_sa-vb_sa,java_sa-vb_sa-java_sa' --ae_steps 'vb_sa,java_sa' --lambda_ae '0:1,30000:0.1,100000:0' --word_shuffle 3 --word_dropout '0.1' --word_blank '0.3' --encoder_only False --n_layers 0 --n_layers_encoder 6 --n_layers_decoder 6 --emb_dim 1024 --n_heads 8 --lgs 'java_sa-vb_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --reload_model '/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth,/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth' --reload_encoder_for_decoder true --amp 2 --fp16 true --tokens_per_batch 3000 --group_by_size true --max_batch_size 128 --epoch_size 100 --max_epoch 10000000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --eval_bleu true --eval_computation true --has_sentences_ids true --generate_hypothesis true --save_periodic 1 --validation_metrics 'valid_vb_-java_mt_comp_acc' --lgs_mapping 'vb_sa:vb,java_sa:java'

    Could you please help me as to how do I get these parallel datasets? Also, is there something/some step that I am missing or doing incorrectly?

    question 
    opened by prnk04 23
  • Preprocess step is completing but pth files are not generating as expected in XLM folder

    Preprocess step is completing but pth files are not generating as expected in XLM folder

    Hello, After running preprocess steps from below command:

    python -m codegen_sources.preprocessing.preprocess /path/data/mydata2 --langs java cpp --mode monolingual_functions --bpe_mode=fast --local=True --train_splits=1 --fastbpe_code_path=/path/data/bpe/cpp-java-python/ --fastbpe_vocab_path=/path/data/bpe/cpp-java-python/

    XLM-syml folder is getting generated with file name like: test.cpp_cl.pth test.cpp_sa.pth test.java_cl.pth .... train.cpp_cl.0.pth train.cpp_sa.0.pth ...

    But when using these folder/files in Training step(MLM) is giving error like:

    XLM-syml/train.java.pth not found XLM-syml/valid.java.pth not found XLM-syml/test.java.pth not found XLM-syml/train.cpp.pth not found XLM-syml/valid.cpp.pth not found XLM-syml/test.cpp.pth not found

    Train command: python train.py --exp_name mlm --dump_path '/path/CodeGen/data/models' --data_path '/path/data/mydata2/XLM-syml' --split_data_accross_gpu local --mlm_steps 'java,cpp' --add_eof_to_stream true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' --encoder_only true --n_layers 12 --emb_dim 768 --n_heads 12 --lgs 'java-cpp' --max_vocab 64000 --gelu_activation true --roberta_mode false --amp 2 --fp16 true --batch_size 8 --bptt 512 --epoch_size 1000 --max_epoch 2000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --save_periodic 0 --validation_metrics _valid_mlm_ppl --stopping_criterion '_valid_mlm_ppl,10'

    I think files are getting generated with suffix like _cl.pth or _sa.pth which is not being considered in training step? OR I am doing something wrong?

    Thanks

    opened by Prathameshwar 12
  • Memory Usage Preprocessing

    Memory Usage Preprocessing

    When running preprocessing on all the data, it seems that the job consumes almost all the memory on the system and the swap memory. It makes the whole job very slow (after 1-2 days only 10-100 files processed). I am wondering if there is any way to put a limit on the number of files that are loaded into the memory concurrently.

    I see that there is an option job_mem when the job is running on clusters, but not when the job is running locally.

    enhancement 
    opened by yazdanbakhsh 10
  • Small Training Dataset

    Small Training Dataset

    Since the tokenization on all the dataset takes a lot of time, I have decided to create a small dataset with only 10-20 of the json.gz files. Once training starts, it gives the following error. Is it because the tokenization/BPE have not seen this character?

    File "/CodeGen/codegen_sources/model/train.py", line 701, in <module> main(params) File "/CodeGen/codegen_sources/model/train.py", line 609, in main trainer.mlm_step( File 
    "/CodeGen/codegen_sources/model/src/trainer.py", line 1005, in mlm_step show_batch( File 
    "/CodeGen/codegen_sources/model/src/utils.py", line 74, in show_batch f"{label} sent: 
    {restore_segmentation_sentence(source_sentence, roberta_mode)}" File "/CodeGen/codegen_sources/model/src/utils.py", 
    line 563, in restore_segmentation_sentence return restore_roberta_segmentation_sentence(sentence) File 
    "/CodeGen/codegen_sources/model/src/utils.py", line 601, in restore_roberta_segmentation_sentence res = 
    bytearray([byte_decoder[c] for c in text]).decode("utf-8", errors="replace") File 
    "/CodeGen/codegen_sources/model/src/utils.py", line 601, in <listcomp> res = bytearray([byte_decoder[c] for c in 
    text]).decode("utf-8", errors="replace") KeyError: '郞'
    
    opened by yazdanbakhsh 6
  • Clarification questions

    Clarification questions

    Hi, I have a few questions regarding TransCoder's training data and optimization setting.

    1. From the paper, it is clear that TransCoder is trained using Standalone functions during the DAE+BT training stage. But is TransCoder only trained using Standalone functions in the MLM stage too?
    2. During the MLM stage, only the encoder part of TransCoder is pre-trained, right?
    3. For the MLM pre-training, max_epoch and epoch_size are set to 100k. If I understand correctly, epoch_size basically refers to the number of instances used in each epoch. Is it correct? Also, for MLM pre-training, the following are set:
    --validation_metrics _valid_mlm_ppl \
    --stopping_criterion '_valid_mlm_ppl,10' 
    

    So, I am assuming TransCoder pre-training is stopped based on the stopping_criterion. Before, the MLM pre-training was stopped, how many optimization steps were executed?

    1. Unlike the MLM pre-training stage, for the DAE+BT stage training, there is no stopping_criterion is set. And the epoch_size was set to 50000 and the max_epoch was set to 10000000. So, when the training stops? How many optimization steps were executed during this stage?
    opened by wasiahmad 5
  • Question regarding Backtranslation

    Question regarding Backtranslation

    Hi,

    I have a basic question to understand why the backtranslation works in this scenario. Typically in NLP, we collect some parallel data to train Transformer-like models and then use backtranslation (BT) on a large collection of monolingual data.

    In contrast, TransCoder is first gone through a pre-training stage and then trained via BT. Since, TransCoder does not have any idea about cross-language generation, at the beginning of BT, TransCoder presumably would generate the sequence in the same language (from Java input to Java output, instead of python output). So, feeding the generated sequence to translate back to the original sequence is not going to help the model in learning translation. So, how backtranslation provides the learning bias to perform translation?

    Recently, I tried to apply BT to our model, called PLBART to teach it to perform translation. However, at the very beginning of BT training, when I checked what PLBART generates for a given Java input, I saw it generates exactly the input sequence although the generation is done based on a prefix token for the target language python. For example,

    # input
    static public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; } [java] 
    
    # output
    [python] public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; }
    

    As you can see above, exactly the same sequence is generated. PLBART is pre-trained via Denoising Autoencoding (DAE), thus it doesn't have any clue about cross-language generation. I am curious, how does TransCoder learn from BT?

    If I am not wrong, TransCoder uses language embedding with each input token (REF). Do you think that can make a difference? Also, can you shed light on the TransCoder structure? It seems like TransCoder does not have a typical sequence-to-sequence architecture.

    opened by wasiahmad 5
  • Assertion `srcIndex < srcSelectDimSize` failed

    Assertion `srcIndex < srcSelectDimSize` failed

    I managed to get beyond this issue. Turned out my input file had Windows line endings (can't believe I'm even admitting to that, but whatever, it happens) Also, my source files are C, not CPP, not sure if that was causing CUBLAS_STATUS_NOT_INITIALIZED but I intend to create a new processor for C.

    Originally posted by @raffian in https://github.com/facebookresearch/CodeGen/issues/55#issuecomment-994103795

    Dose it worked after you changed the line endings?I have the same error.But it doesn't work after I change these.

    opened by 11asdad 4
  • Limited local parallelism across file lines

    Limited local parallelism across file lines

    This change allow the users to limit the amount of parallelism across line files. Before, the maximum parallelism (cpu count) was used for tokenization.

    CLA Signed 
    opened by yazdanbakhsh 4
  • How to extract functions?

    How to extract functions?

    I have downloaded the raw source code on my machine, for example, python.000000000000.json.gz. But when I run preprocessing pipeline, there is an error: AssertionError: failed to learn bpe on /data2/linjiayi/CodeGen-master/data/python.sa-cl.tok.shuf.50gb, command: /home/linjiayi/CodeGen-master/codegen_sources/model/tools/fastBPE/fast learnbpe 50000 /data2/linjiayi/CodeGen-master/data/python.sa-cl.tok.shuf.50gb > /data2/linjiayi/CodeGen-master/data/python.sa-cl.codes

    After I changed the filename to python.000.json.gz, it run successfully. Actually, I only want to extract functions from the raw source code. So, I have a few questions:

    1. Can data downloaded from BigQuery only be named in python.*[0-4][0-9][0-9].json.gz format?

    2. When I import data from BigQuery Table into Google Storage using wildcards, the file named python.000000000000.json.gz. How to save a file name with only three zeros? Did you download the file and modify it?

    3. After I run preprocessing pipeline, only one python.000.json.gz generates a lot of files.

    捕获1 The `.sa.tok` files is `standalone functions.` The `.cl.tok` files is `class functions`. What's in the other files?
    1. Whether the Preprocessing Pipeline can extract description for each function? Which file to save the description in?

    2. The format of the extracted function is robertglen/flask | def test_explicit_instance_paths ( modules_tmpdir ) : NEW_LINE INDENT with pytest . raises ( ValueError ) as excinfo : NEW_LINE INDENT flask . Flask ( __name__ , instance_path = ' instance ' ) NEW_LINE DEDENT assert ' must ▁ be ▁ absolute ' in str ( excinfo . value ) NEW_LINE app = flask . Flask ( __name__ , instance_path = str ( modules_tmpdir ) ) NEW_LINE assert app . instance_path == str ( modules_tmpdir ) NEW_LINE DEDENT There are NEW_LINE INDENT, NEW_LINE DEDENT and NEW_LINE in the cdoes. Whether they affect the parsing of the code?

    opened by skye95git 4
  • /usr/bin/ld: cannot find -lc++   -   distutils.errors.DistutilsExecError: command '/usr/bin/cc'

    /usr/bin/ld: cannot find -lc++ - distutils.errors.DistutilsExecError: command '/usr/bin/cc'

    I am trying to use the pretrained model to transform from C++ to Java, but with no luck after installations with this error

    ` python -m codegen_sources.model.translate --src_lang cpp --tgt_lang java --model_path models/TransCoder_model_1.pth --beam_size 1 < zcpp_sample.cpp adding to path <>/workspaces/git_web/facebook_transcoder/CodeGen INFO - 10/02/21 08:26:08 - 0:00:04 - ============ Model Reloading INFO - 10/02/21 08:26:08 - 0:00:04 - Reloading encoder from models/TransCoder_model_1.pth ... WARNING - 10/02/21 08:26:10 - 0:00:06 - Lang cpp_sa matched to pretrained cpp_sa lang embedding. WARNING - 10/02/21 08:26:10 - 0:00:06 - Lang java_sa matched to pretrained java_sa lang embedding. WARNING - 10/02/21 08:26:10 - 0:00:06 - Lang python_sa matched to pretrained python_sa lang embedding. WARNING - 10/02/21 08:26:10 - 0:00:06 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times. INFO - 10/02/21 08:26:10 - 0:00:07 - Reloading decoders from models/TransCoder_model_1.pth ... WARNING - 10/02/21 08:26:11 - 0:00:08 - Lang cpp_sa matched to pretrained cpp_sa lang embedding. WARNING - 10/02/21 08:26:11 - 0:00:08 - Lang java_sa matched to pretrained java_sa lang embedding. WARNING - 10/02/21 08:26:11 - 0:00:08 - Lang python_sa matched to pretrained python_sa lang embedding. WARNING - 10/02/21 08:26:11 - 0:00:08 - The size of position embeddings in current model is 2048, the size of reloaded is 1024. need to repeat last positions 1024 times. INFO - 10/02/21 08:26:11 - 0:00:08 - Number of parameters (encoder): 143239641 INFO - 10/02/21 08:26:11 - 0:00:08 - Number of parameters (decoders): 168442329 INFO - 10/02/21 08:26:11 - 0:00:08 - Number of decoders: 1

    Input cpp function: #include using namespace std;

    int main() { cout<<"My First Program. Helllllllo."<<endl;

    return 0;
    

    } /usr/bin/ld: cannot find -lc++ collect2: error: ld returned 1 exit status Traceback (most recent call last): File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/unixccompiler.py", line 206, in link self.spawn(linker + ld_args) File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/ccompiler.py", line 910, in spawn spawn(cmd, dry_run=self.dry_run) File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/spawn.py", line 91, in spawn raise DistutilsExecError( distutils.errors.DistutilsExecError: command '/usr/bin/cc' failed with exit code 1

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "<>/anaconda3/envs/pyt/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "<>/anaconda3/envs/pyt/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/model/translate.py", line 254, in output = translator.translate( File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/model/translate.py", line 139, in translate src_lang_processor = LangProcessor.processors[lang1]( File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/preprocessing/lang_processors/cpp_processor.py", line 25, in init super().init( File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/preprocessing/lang_processors/tree_sitter_processor.py", line 40, in init self.create_treesiter_parser() File "<>/workspaces/git_web/facebook_transcoder/CodeGen/codegen_sources/preprocessing/lang_processors/tree_sitter_processor.py", line 48, in create_treesiter_parser Language.build_library( File "<>/anaconda3/envs/pyt/lib/python3.9/site-packages/tree_sitter/init.py", line 72, in build_library compiler.link_shared_object(object_paths, output_path) File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/ccompiler.py", line 713, in link_shared_object self.link(CCompiler.SHARED_OBJECT, objects, File "<>/anaconda3/envs/pyt/lib/python3.9/distutils/unixccompiler.py", line 208, in link raise LinkError(msg) distutils.errors.LinkError: command '/usr/bin/cc' failed with exit code 1 `

    I run the following command in my environment ' cc --version cc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 '

    opened by mostafa-saad 3
  • `fastBPE` fix path

    `fastBPE` fix path

    I against assertion AssertionError: failed to learn bpe on /media/Z/dungnm31/transcoder/cpp-java-python.monolingual.tok.shuf.50gb, command: /home/dungnm/CodeGen/fastBPE/fast learnbpe 50000 /media/Z/dungnm31/transcoder/cpp-java-python.monolingual.tok.shuf.50gb > /media/Z/dungnm31/transcoder/cpp-java-python.monolingual.codes

    It turn out the command itself was not right. The fastBPE path will be located at "codegen_sources/model/tools/fastBPE/fast" according to install_env.sh instead of "fastBPE/fast"

    Suggest file codegen_sources/preprocessing/bpe_modes/fast_bpe_mode.py change

    FAST = str(Path(__file__).parents[3].joinpath("fastBPE/fast"))
    

    to

    FAST = str(Path(__file__).parents[3].joinpath("codegen_sources/model/tools/fastBPE/fast"))
    
    opened by nmd-2000 0
  • Empty .sa.tok files after select_functions & request to release self_training dataset

    Empty .sa.tok files after select_functions & request to release self_training dataset

    I am trying to create the self-training dataset, as per the instructions at https://github.com/facebookresearch/CodeGen/blob/main/docs/TransCoder-ST.md.

    From Google BigQuery, I got 500 .json.gz files. Thereafter I preprocessed them and got the following symlinks successfully:

    [abc@def CodeGen]$ ls xyz/java-FULL/XLM-syml/
    test.java_cl.pth  train.java_cl.0.pth  train.java_cl.2.pth  train.java_sa.1.pth  valid.java_cl.pth
    test.java_sa.pth  train.java_cl.1.pth  train.java_sa.0.pth  train.java_sa.2.pth  valid.java_sa.pth
    [abc@def CodeGen]$
    

    But now, as part of the final step, I am facing an issue on running create_self_training_dataset.sh. As per the following output that I am getting, all the .sa.tok files in the selected_functions folder are empty.

    Repository root: .
    python codegen_sources/test_generation/select_java_inputs.py --local True --input_path /home/xyz/CodeGen-data/java-FULL/ --output_path /home/xyz/CodeGen-data/dataset//selected_functions/ --rerun True
    adding /project/6001889/xyz/CodeGen to path
    adding to path /project/6001884/xyz/CodeGen
    ########## Selecting input functions ##########
    100%|██████████| 500/500 [10:08:19<00:00, 73.00s/it] 
    Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000000.sa.tok
    Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000001.sa.tok
    ...
    Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000497.sa.tok
    Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000498.sa.tok
    Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000499.sa.tok
    

    On debugging, I found that is_simple_standalone_func(func) in line 67 of at Link is returning False for all the Java functions. As such, the mask in line 114 in select_functions(funcpath) is an all-False list. Please suggest what to do in this case.

    Also, it would be great if the authors can please release the training dataset of 135,000 parallel functions (as mentioned in the paper) between Java, Python, and C++, in the form of a shareable link.

    opened by PrithwishJana 0
  • Could you please build a website to support API to translate the program languages?

    Could you please build a website to support API to translate the program languages?

    It's too hard for me to translate my cpp code to python. I will appreciate it if you would build a website to provide the API to translate the code. Thank you.

    opened by Bailey-24 0
  • Bug in epoch calculation

    Bug in epoch calculation

    At line no. 1483 in the file codegen_sources/model/src/trainer.py. the code is self.n_sentences += params.batch_size I think it should be self.n_sentences += len1.size(0) https://github.com/facebookresearch/CodeGen/blob/6e93aca63e7bc77287c9965a5080456326651237/codegen_sources/model/src/trainer.py#L1483

    With above bug notion of one epoch becomes wrong because of check at following line.

    https://github.com/facebookresearch/CodeGen/blob/6e93aca63e7bc77287c9965a5080456326651237/codegen_sources/model/train.py#L742

    opened by dineshkh 0
  • Pretrain modell

    Pretrain modell

    Hi, I want to train model for translation pascal2java. I have small datasets about 2 Gb of pascal and 2 Gb of java. I train model mlm and then train transcoder model but translation doesn't work. When I try to translate pascal2java I got same function of pascal or bad translation of java. It can be problem with small dataset or overfittig?

    opened by Elvares 1
Owner
Facebook Research
Facebook Research
Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

Nader Akoury 27 Dec 20, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Jan 4, 2023
SIMULEVAL A General Evaluation Toolkit for Simultaneous Translation

SimulEval SimulEval is a general evaluation framework for simultaneous translation on text and speech. Requirement python >= 3.7.0 Installation git cl

Facebook Research 48 Dec 28, 2022
Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Pytorch Lightning 1.4k Jan 1, 2023
This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

Hansheng Jiang 6 Nov 18, 2022
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

Aiden Nibali 36 Oct 30, 2022
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

Aiden Nibali 25 Jun 20, 2021
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Gautam Singh 66 Dec 26, 2022
Template repository for managing machine learning research projects built with PyTorch-Lightning

Tutorial Repository with a minimal example for showing how to deploy training across various compute infrastructure.

Sidd Karamcheti 3 Feb 11, 2022
Facebook Research 605 Jan 2, 2023
YOLOv5 🚀 is a family of object detection architectures and models pretrained on the COCO dataset

YOLOv5 ?? is a family of object detection architectures and models pretrained on the COCO dataset, and represents Ultralytics open-source research int

阿才 73 Dec 16, 2022
The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

Yuki M. Asano 249 Dec 22, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

null 30 Aug 29, 2022
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | 简体中文 Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

Alibaba 185 Dec 21, 2022