Code Transformer

This is an official PyTorch implementation of the CodeTransformer model proposed in:

D. Zügner, T. Kirschstein, M. Catasta, J. Leskovec, and S. Günnemann, “Language-agnostic representation learning of source code from structure and context”

which appeared at ICLR'2021.
An online demo is available at https://code-transformer.org.

[Paper (PDF) | Poster | Slides | Online Demo]

The CodeTransformer is a Transformer based architecture that jointly learns from source code (Context) and parsed abstract syntax trees (AST; Structure). It does so by linking source code tokens to AST nodes and using pairwise distances (e.g., Shortest Paths, PPR) between the nodes to represent the AST. This combined representation is processed in the model by adding the contributions of each distance type to the raw self-attention score between two input tokens (See the paper for more details).

Strengths of the CodeTransformer:

Outperforms other approaches on the source code summarization task.
Effectively leverages similarities among different programming languages when trained in a multi-lingual setting.
Produces useful embeddings that may be employed for other down-stream tasks such as finding similar code snippets across languages.

Cite

Please cite our paper if you use the model, experimental results, or our code in your own work:

@inproceedings{zuegner_code_transformer_2021,
title = {Language-Agnostic Representation Learning of Source Code from Structure and Context},
author = {Z{\"u}gner, Daniel and Kirschstein, Tobias and Catasta, Michele and Leskovec, Jure and G{\"u}nnemann, Stephan},
booktitle={International Conference on Learning Representations (ICLR)},
year = {2021} }

1. Repository Setup
2. Preprocessing
3. Training
4. Evaluation
- 4.1. Single Language Evaluation
- 4.2. Multi-language Evaluation
5. Models from the paper

1. Repository Setup

1.1. Data

To run experiments with the CodeTransformer you can either:

Create a new dataset from raw code snippets
or
Download the already preprocessed datasets we conducted our experiments on

1.1.1. Raw Data

To use our pipeline to generate a new dataset for code summarization, a collection of methods in the target language is needed. In our experiments, we use the following unprocessed datasets:

Name	Description	Obtain from
code2seq	We use `java-small` for Code Summarization as well as `java-medium` and `java-large` for Pretraining	code2seq repository
CodeSearchNet (CSN)	For our (multilingual) experiments on Python, JavaScript, Ruby and Go, we employ the raw data from the CSN challenge	CodeSearchNet repository
`java-pretrain`	For our Pretraining experiments, we compiled and deduplicated a large code method dataset based on `java-small`, `java-medium` and `java-large`.	java-pretrain-raw.tar.gz: .txt files with paths to the .java files from the code2seq datasets for each partition java-pretrain-raw-methods.tar.gz: Extracted Java methods stored in .json files. Can be directly fed into stage 1 Preprocessing

1.1.2. Preprocessed Data

We make our preprocessed datasets available for a quick setup and easy reproducibility of our results:

Name	Language(s)	Based on	Download
Python	Python	CSN	python.tar.gz
JavaScript	JavaScript	CSN	javascript.tar.gz
Ruby	Ruby	CSN	ruby.tar.gz
Go	Go	CSN	go.tar.gz
Multi-language	Python, JavaScript, Ruby, Go	CSN	multi-language.tar.gz
java-small	Java	code2seq	java-small.tar.gz java-small-pretrain.tar.gz: For fine-tuning the pretrained model on `java-small`
java-pretrain	Java	code2seq	Full dataset available on request due its enormous size java-pretrain-vocabularies.tar.gz: Contains only the vocabularies from pretraining and can be used for fine-tuning the pretrained `CT-LM-1` model on any other Java dataset

1.2. Example notebooks

The notebooks/ folder contains two example notebooks that showcase the CodeTransformer:

interactive_prediction.ipynb: Lets you load any of the models and specify an arbitrary code snippet to get a real-time prediction for its method name. Also showcases stage 1 and stage 2 preprocessing.
deduplicate_java_pretrain.ipynb: Explains how we deduplicated the large java-pretrain dataset that we created

1.3. Environment Variables

All environment variables (and thus external dependencies on the host machine) used in the project have to be specified in an .env configuration file. These have to be set to suit your local setup before anything can be run.
The .env.example file gives an example configuration. The actual configuration has to be put into ${HOME}/.config/code_transformer/.env.
Alternatively, you can also directly specify the paths as environment variables, e.g., by sourcing the .env file.

Variable (+ `CODE_TRANSFORMER_` prefix)	Description	Preprocessing	Training	Inference/Evaluation
Mandatory
`DATA_PATH`	Location for storing datasets	X	X	X
`BINARY_PATH`	Location for storing executables	X	-	-
`MODELS_PATH`	Location for storing model configs, snapshots and predictions	-	X	X
`LOGS_PATH`	Location for logging train metrics	-	X	-
Optional
`CSN_RAW_DATA_PATH`	Location of the downloaded raw CSN dataset files	X	-	-
`CODE2SEQ_RAW_DATA_PATH`	Location of the downloaded raw code2seq dataset files (Java classes)	X	-	-
`CODE2SEQ_EXTRACTED_METHODS_DATA_PATH`	Location of the code snippets extracted from the raw code2seq dataset with the JavaMethodExtractor	X	-	-
`DATA_PATH_STAGE_1`	Location of stage 1 preprocessing result (parsed ASTs)	X	-	-
`DATA_PATH_STAGE_2`	Location of stage 2 preprocessing result (computed distances)	X	X	X
`JAVA_EXECUTABLE`	Command for executing java on the machine	X	-	-
`JAVA_METHOD_EXTRACTOR_EXECUTABLE`	Path to the built .jar from the `java-method-extractor` submodule used for extracting methods from raw .java files	X	-	-
`JAVA_PARSER_EXECUTABLE`	Path to the built .jar from the `java-parser` submodule used for parsing Java ASTs	X	-	-
`SEMANTIC_EXECUTABLE`	Path to the built semantic executable used for parsing Python, JavaScript, Ruby and Go ASTs	X	-	-

1.4. Repository Structure

├── notebooks               # Example notebooks that showcase the CodeTransformer
├── code_transformer        # Main python package containing most functionality
│   ├── configuration         # Dict-style Configurations of ML models
│   ├── experiments           # Experiment setups for running preprocessing or training
│   │   ├── mixins              # Lightweight experiment modules that can be easily 
│   │   │                       #   combined for different datasets and models
│   │   ├── code_transformer    # Experiment configs for training the CodeTransformer
│   │   ├── great               # Experiment configs for training GREAT
│   │   ├── xl_net              # Experiment configs for training XLNet
│   │   ├── preprocessing       # Implementation scripts for stage 1 and stage 2 preprocessing
│   │   │   ├── preprocess-1.py   # Parallelized stage 1 preprocessing (Generating of ASTs from methods + word counts)
│   │   │   └── preprocess-2.py   # Parallelized stage 2 preprocessing (Calculating of distances in AST + vocabulary)
│   │   ├── paper               # Train configs for reproducing results of all models mentioned in the paper
│   │   └── experiment.py       # Main experiment setup containing training loop, evaluation, 
│   │                           #   metrics, logging and loading of pretrained models
│   ├── modeling              # PyTorch implementations of the Code Transformer, 
│   │   │                     #   GREAT and XLNet with different heads
│   │   ├── code_transformer    # CodeTransformer implementation
│   │   ├── decoder             # Transformer Decoder implementation with Pointer Network
│   │   ├── great_transformer   # Adapted implementation of GREAT for code summarization
│   │   ├── xl_net              # Adapted implementation of XLNet for code summarization
│   │   ├── modelmanager        # Easy loading of stored model parameters
│   │   └── constants.py        # Several constants affecting preprocessing and vocabularies
│   ├── preprocessing         # Implementation of preprocessing pipeline + data loading
│   │   │                     #   modeling, e.g., special tokens or number of method name tokens
│   │   ├── pipeline            # Stage 1 and Stage 2 preprocessing of CodeSearchNet code snippets
│   │   │   ├── code2seq.py       # Adaptation of code2seq AST path walks for CSN datasets and languages
│   │   │   ├── filter.py         # Low-level textual code snippet processing used during stage 1 preprocessing
│   │   │   ├── stage1.py         # Applies filters to code snippets and calls semantic to generate ASTs
│   │   │   └── stage2.py         # Contains only definitions of training samples, actual
│   │   │                         #   graph distance calculation is contained in preprocessing/graph/distances.py
│   │   ├── datamanager         # Easy loading and storing of preprocessed code snippets
│   │   │   ├── c2s               # Loading of raw code2seq dataset files
│   │   │   ├── csn               # Loading and storing of CSN dataset files
│   │   │   └── preprocessed.py   # Dataset-agnostic loading of stage 1 and stage 2 preprocessed samples
│   │   ├── dataset             # Final task-specific preprocessing before data is fed into model, i.e.,
│   │   │   │                   #   python modules to be used with torch.utils.data.DataLoader
│   │   │   ├── base.py           # task-agnostic preprocessing such as mapping sequence tokens to graph nodes
│   │   │   ├── ablation.py       # only-AST ablation
│   │   │   ├── code_summarization.py # Main preprocessing for the Code Summarization task.
│   │   │   │                         #   Masking the function name in input, drop punctuation tokens
│   │   │   └── lm.py             # Language Modeling pretraining task. Generate permutations
│   │   ├── graph               # Algorithms on ASTs
│   │   │   ├── alg.py            # Graph distance metrics such as next siblings
│   │   │   ├── ast.py            # Generalized AST as graph that handles semantic and Java-parser ASTs.
│   │   │   │                     #   Allows assigning tokens that have no corresponding AST node
│   │   │   ├── binning.py        # Graph distance binning (equal and exponential)
│   │   │   ├── distances.py      # Higher level modularized distance and binning wrappers for use in preprocessing
│   │   │   └── transform.py      # Core of stage2 preprocessing that calculates general graph distances  
│   │   └── nlp                 # Algorithms on text
│   │       ├── javaparser.py     # Python wrapper of java-parser to generate ASTs from Java methods
│   │       ├── semantic.py       # Python wrapper of semantic parser to generate ASTs
│   │       │                     #   from languages supported by semantic
│   │       ├── text.py           # Simple text handling such as positions of words in documents
│   │       ├── tokenization.py   # Mostly wrapper around Pygments Tokenizer to tokenize (and sub-tokenize) code snippets
│   │       └── vocab.py          # Mapping of the most frequent tokens to IDs understandable for ML models
│   ├── utils
│   │   └── metrics.py        # Evaluation metrics. Mostly, different F1-scores
│   └── env.py               # Defines several environment variables such as paths to executables 
├── scripts               # (Python) scripts intended to be run from the command line "python -m scripts/{SCRIPT_NAME}"
│   ├── code2seq
│   │   ├── combine-vocabs-code2seq.sh    # Creates code2seq vocabularies for multi-language setting 
│   │   ├── preprocess-code2seq.py        # Preprocessing for code2seq (Generating of tree paths).
│   │   │                                 #   Works with any datasets created by preprocess-1.py
│   │   ├── preprocess-code2seq.sh        # Calls preprocess-code2seq.py and preprocess-code2seq-helper.py.
│   │   │                                 #   Generates everything else code2seq needs, such as vocabularies
│   │   └── preprocess-code2seq-helper.py # Copied from code2seq. Performs vocabulary generation and normalization of snippets
│   ├── evaluate.py                   # Loads a trained model and evaluates it on a specified dataset
│   ├── evaluate-multilanguage.py     # Loads a trained multi-language model and evaluates it on a multi-language database  
│   ├── deduplicate-java-pretrain.py  # De-duplicates a directory of .java files (used for java-pretrain)
│   ├── extract-java-methods.py       # Extracts Java methods from raw .java files to feed into stage 1 preprocessing
│   ├── run-experiment.py             # Parses a .yaml config file and starts training of a CodeTransformer/GREAT/XLNet model
│   └── run-preprocessing.py          # Parses a .yaml config file and starts stage 1 or stage 2 preprocessing
├── sub_modules             # Separate modules
│   ├── code2seq               # code2seq adaptation: Mainly modifies code2seq for multi-language setting
│   ├── java-method-extractor  # code2seq adaptation: Extracts Java methods from .java files as JSON 
│   └── java-parser            # Simple parser wrapper for generating Java ASTs
├── tests                   # Unit Tests for parts of the project
└── .env.example            # Example environment variables configuration file

2. Preprocessing

2.1. semantic

The first stage of our preprocessing pipeline makes use of semantic to generate ASTs from code snippet that are written in Python, JavaScript, Ruby or Go.
semantic is a command line tool written in Haskell that is capable of parsing source code in a variety of languages. The generated ASTs mostly share a common set of node types which is important for multi-lingual experiments. For Java, we employ a separate AST parser, as the language currently is not supported by semantic.
To obtain the ASTs, we rely on the --json-graph option that has been dropped temporarily from semantic. As such, the stage 1 preprocessing requires a semantic executable built from a revision before Mar 27, 2020. E.g., the revision 34ea0d1dd6.

To enable stage 1 preprocessing, you can either:

Build semantic on your machine using a revision with the --json-graph option. We refer to the semantic documentation for build instructions.
or
Use the statically linked semantic executable that we built for our experiments: semantic.tar.gz

2.2. CSN Dataset

Download the raw CSN dataset files as described in the raw data section.

Compute ASTs (stage 1)

 python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-1-csn.yaml {python|javascript|ruby|go} {train|valid|test}

Compute graph distances (stage 2)

python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml {python|javascript|ruby|go} {train|valid|test}

The .yaml files contain configurations for preprocessing (e.g., which distance metrics to use and how the vocabularies are generated).
It is important to run the preprocessing for the train partition first as statistics for generating the vocabulary that are needed for the other partitions are computed there.

2.3. code2seq Dataset

Download the raw code2seq dataset files (Java classes) as described in the raw data section.

Extract methods from the raw Java class files via

python -m scripts.extract-java-methods {java-small|java-medium|java-large}

Compute ASTs (stage 1)

python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-1.yaml {java-small|java-medium|java-large} {train|valid|test}

Compute graph distances (stage 2)

python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml {java-small|java-medium|java-large} {train|valid|test}

The .yaml files contain configurations for preprocessing (e.g., which distance metrics to use and how the vocabularies are generated).
Ensure to run the preprocessing for the train partition first as statistics for generating the vocabulary that are needed for the other partitions are computed there.

2.4. Multilanguage Dataset

Builds upon the stage 1 CSN datasets computed as shown above. Datasets are then combined by simply running the stage 2 preprocessing with a comma-separated string containing the desired languages. For the experiments in our paper we combined Python, JavaScript, Ruby and Go:

python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml python,javascript,ruby,go {train|valid|test}

3. Training

Similar to how preprocessing works, training is configured via .yaml files that describe the data representation that is used, the model hyperparameters and how training should go about.
Train metrics are logged to a tensorboard in LOGS_PATH.

3.1. Code Transformer

python -m scripts.run-experiment code_transformer/experiments/code_transformer/code_summarization.yaml

This will start training of a CodeTransformer model. The .yaml file contains model and training configurations (e.g., which dataset to use, model hyperparameters or when to store checkpoints).

3.2. GREAT

We adapted the Graph Relational Embedding Attention Transformer (GREAT) for comparison.

python -m scripts.run-experiment code_transformer/experiments/great/code_summarization.yaml

This will start training of a GREAT model. The .yaml file contains model and training configurations (e.g., which dataset to use, model hyperparameters or when to store checkpoints).

3.3. XLNet (No structure)

For comparison, we also employed an XLNet architecture that only learns from the source code token sequence.

python -m scripts.run-experiment code_transformer/experiments/xl_net/code_summarization.yaml

This will start training of a XLNet model. The .yaml file contains model and training configurations (e.g., which dataset to use, model hyperparameters or when to store checkpoints).

3.4. Language Modeling Pretraining

The performance of Transformer architectures can often be further improved by first pretraining the model on a language modeling task.
In our case, we make use of XL-Nets permutation based masked language modeling.
Language Modelling Pretraining can be run via:

python -m scripts.run-experiment code_transformer/experiments/code_transformer/language_modeling.yaml

The pretrained model can then be finetuned on the Code Summarization task using the regular training script as described above. The transfer_learning section in the .yaml configuration file is used to define the model and snapshot to be finetuned.

4. Evaluation

4.1. Single Language

python -m scripts.evaluate {code_transformer|great|xl_net} {run_id} {snapshot} {valid|test}

where run_id is the unique name of the run as printed during training. This also corresponds to the folder name that contains the respective stored snapshots of the model.
snapshot it the training iteration in which the snapshot was stored, e.g., 50000.

4.2. Multilanguage

python -m scripts.evaluate-multilanguage {code_transformer|great|xl_net} {run_id} {snapshot} {valid|test} [--filter-language {language}]

The --filter-language option can be used to run the evaluation only on one of the single languages that the respective multilanguage dataset is comprised of (used for CT-[11-14]).

5. Models from the Paper

5.1. Models trained on CSN Dataset

Name in Paper	Run ID	Snapshot	Language	Hyperparameters
Single Language
GREAT (Python)	GT-1	350000	Python	great_python.yaml
GREAT (Javascript)	GT-2	60000	JavaScript	great_javascript.yaml
GREAT (Ruby)	GT-3	30000	Ruby	great_ruby.yaml
GREAT (Go)	GT-4	220000	Go	great_go.yaml
Ours w/o structure (Python)	XL-1	400000	Python	xl_net_python.yaml
Ours w/o structure (Javascript)	XL-2	260000	JavaScript	xl_net_javascript.yaml
Ours w/o structure (Ruby)	XL-3	60000	Ruby	xl_net_ruby.yaml
Ours w/o structure (Go)	XL-4	200000	Go	xl_net_go.yaml
Ours w/o pointer net (Python)	CT-1	280000	Python	ct_no_pointer_python.yaml
Ours w/o pointer net (Javascript)	CT-2	120000	JavaScript	ct_no_pointer_javascript.yaml
Ours w/o pointer net (Ruby)	CT-3	520000	Ruby	ct_no_pointer_ruby.yaml
Ours w/o pointer net (Go)	CT-4	320000	Go	ct_no_pointer_go.yaml
Ours (Python)	CT-5	500000	Python	ct_python.yaml
Ours (Javascript)	CT-6	90000	JavaScript	ct_javascript.yaml
Ours (Ruby)	CT-7	40000	Ruby	ct_ruby.yaml
Ours (Go)	CT-8	120000	Go	ct_go.yaml
Multilanguage Models
Great (Multilang.)	GT-5	320000	Python, JavaScript, Ruby and Go	great_multilang.yaml
Ours w/o structure (Mult.)	XL-5	420000	Python, JavaScript, Ruby and Go	xl_net_multilang.yaml
Ours w/o pointer (Mult.)	CT-9	570000	Python, JavaScript, Ruby and Go	ct_no_pointer_multilang.yaml
Ours (Multilanguage)	CT-10	650000	Python, JavaScript, Ruby and Go	ct_multilang.yaml
Mult. Pretraining
Ours (Mult. + Finetune Python)	CT-11	120000	Python, JavaScript, Ruby and Go	ct_multilang.yaml → ct_multilang_python.yaml
Ours (Mult. + Finetune Javascript)	CT-12	20000	Python, JavaScript, Ruby and Go	ct_multilang.yaml → ct_multilang_javascript.yaml
Ours (Mult. + Finetune Ruby)	CT-13	10000	Python, JavaScript, Ruby and Go	ct_multilang.yaml → ct_multilang_ruby.yaml
Ours (Mult. + Finetune Go)	CT-14	60000	Python, JavaScript, Ruby and Go	ct_multilang.yaml → ct_multilang_go.yaml
Ours (Mult. + LM Pretrain)	CT-15	280000	Python, JavaScript, Ruby and Go	ct_multilang_lm.yaml → ct_multilang_lm_pretrain.yaml

5.2. Models trained on code2seq Dataset

Name in Paper	Run ID	Snapshot	Language	Hyperparameters
Without Pointer Net
Ours w/o structure	XL-6	400000	Java	xl_net_no_pointer_java_small.yaml
Ours w/o context	CT-16	150000	Java	ct_no_pointer_java_small_only_ast.yaml
Ours	CT-17	410000	Java	ct_no_pointer_java_small.yaml
With Pointer Net
GREAT	GT-6	170000	Java	great_java_small.yaml
Ours w/o structure	XL-7	170000	Java	xl_net_java_small.yaml
Ours w/o context	CT-18	90000	Java	ct_java_small_only_ast.yaml
Ours	CT-19	250000	Java	ct_java_small.yaml
Ours + Pretrain	CT-20	30000	Java	ct_java_pretrain_lm.yaml → ct_java_small_pretrain.yaml

5.3. Models from Ablation Studies

Name in Paper	Run ID	Snapshot	Language	Hyperparameters
Sibling Shortest Paths	CT-21	310000	Java	ct_java_small_ablation_only_sibling_sp.yaml
Ancestor Shortest Paths	CT-22	250000	Java	ct_java_small_ablation_only_ancestor_sp.yaml
Shortest Paths	CT-23	190000	Java	ct_java_small_ablation_only_shortest_paths.yaml
Personalized Page Rank	CT-24	210000	Java	ct_java_small_ablation_only_ppr.yaml

5.4. Download Models

We also make all our trained models that are mentioned in the paper available for easy reproducibility of our results:

Name	Description	Models	Download
CSN Single Language	All models trained on one of the Python, JavaScript, Ruby or Go datasets	`GT-[1-4]`, `XL-[1-4]`, `CT-[1-8]`	csn-single-language-models.tar.gz
CSN Multi-Language	All models trained on the multi-language dataset + Pretraining	`GT-5`, `XL-5`, `CT-[9-15]`, `CT-LM-2`	csn-multi-language-models.tar.gz
code2seq	All models trained on the code2seq `java-small` dataset + Pretraining	`XL-[6+7]`, `GT-6`, `CT-[16-20]`, `CT-LM-1`	code2seq-models.tar.gz
Ablation	The models trained for ablation purposes on `java-small`	`CT-[21-24]`	ablation-models.tar.gz

Once downloaded, you can test any of the above models in the interactive_prediction.ipynb notebook.

Dear Authors,

Thank you very much for your work! I used the scripts for preprocessing the java code2seq-type datasets to test the performance of the method name prediction model on the java-medium dataset later but was faced with a strange error. Everything was fine with the stage1 scripts, but after I ran the stage2 script (with a command like python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml java-medium train, in particular), I have found the process failed (after a couple of hours, with a setting batch_size: 1000, num_processes: 8) with such an error:

struct.error: 'i' format requires -2147483648 <= number <= 2147483647

The detailed traceback follows:

Spoiler

Traceback (most recent calls WITHOUT Sacred internals):
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result
    exception=exception))
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 240, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

Traceback (most recent calls WITHOUT Sacred internals):
  File "code_transformer/experiments/preprocessing/preprocess-2.py", line 340, in main
    Preprocess2Container().run()
  File "code_transformer/experiments/preprocessing/preprocess-2.py", line 312, in run
    for batch in dataset_slice)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in __call__
    self.retrieve()
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Unfortunately, I didn't manage to specify the procedure causing this error. So, my questions are what is the reason why this error appears in java-medium (and not appears on smaller datasets), and how can I resolve this problem (at least, what particular line may produce it – maybe I can catch an exception somewhere on some poor data)?

Can i build function classifier using code-transformer

Hi, I would like to build a method/function classifier using code-transformer, the labels are going to be 0,1,2,..9,

Could i get some suggestions on how to build it?

Any guidelines would be hugely appreciated! :D

opened by nbdfls 4

Error when preprocessing java-medium

opened by dmivilensky 3

How to just get the code embedding of entire code_snippet, without having to predict any function name

Hi, I want to get code embedding of entire code_snippet. How to get it? As per Code Snippet embedding section in interactive_prediction.ipynb, it gives the embedding of only masked method. I don't have any masked method or func name to predict. I just want the embedding of entire code. How can we do it?

opened by TejaswiniiB 2
Code embeddings

First of all thanks for publishing such great work.

In the example notebook you are showing how to use "Query Stream Embedding of the masked method name token in the final encoder layer" as "meaningful embedding of the provided AST/Source code pair". Typically I would use the embedding of [CLS] token. Is the [CLS] token added underneath when I use the processing from the notebook or should I add it myself?

opened by emsi 2
Number of training steps needed?

Thanks for releasing this amazing repo! The documentation is thorough and extremely helpful!

I didn't find the number of training steps or epochs needed in Appendix A.6 in the paper. I am running python -m scripts.run-experiment code_transformer/experiments/code_transformer/code_summarization.yaml (I changed the #layers in yaml file from 1 to 3 according to the appendix in the paper) over 2 days on a single GPU.

I have run for 600k steps and F-1 score in the tensorboard (I guess this is average F-1 score over 4 coding languages?) is around 0.27 (the micro F-1 is 0.33). The number is still a bit off from table 2. I wonder should I just train longer or something is wrong with my training.

opened by ywen666 2
Missing 'Experiment' Class Definition

Hi Authors,

Thanks for open source the awesome code!

In experiment.py, it tries to "from sacred import Experiment", but I couldn't really find the its definition anywhere, inside or outside 'code_transformer.utils.sacred'. Afterwards the code defines "ex = Experiment(base_dir='../../', interactive=False)", and use 'ex' in quite some places too.

May I enquire if it is something critical, and where to find it?

Thank you so much!

Regards,

opened by zhaoyuehrb 2
Where to obtain and how to use the binary files for javaparser and javamethodextractor

Dear Authors,

Thank you so much for open source the great work. I am very new to this, may I ask how to get the java-parser-1.0-SNAPSHOT.jar and JavaMethodExtractor-1.0.0-SNAPSHOT.jar? They are listed as required files under CODE_TRANSFORMER_BINARY_PATH, in parallel with semantic.

The java_to_ast method requires the file. Without it, the notebook interactive_prediction cannot run too.

I tried to download and use https://jar-download.com/artifacts/com.google.code.javaparser/javaparser/1.0.8/source-code but couldn't make it work. I got error of 'no main manifest attribute, in ./javaparser-1.0.8.jar\n'.

Help will be greatly appreciated!

opened by zhaoyuehrb 2
Using code-transformer on c++ code

Hello, I'm currently working on clustering some competitive programming code based on the algorithm used and I would like to know if I could use your model to predict method's names. Couldn't find anything cpp related.

Thanks ! Have a good day !

opened by bfdyoy 1

TypeError: 'float' object cannot be interpreted as an integer on modern pytroch

When run on pytorch '1.10.2+cu102' I get following error:

Traceback (most recent call last):
  File "/code_transformer/preprocessing/graph/transform.py", line 30, in __call__
    distance_matrix = distance_metric(adj)
  File "/code_transformer/preprocessing/graph/distances.py", line 76, in __call__
    sp_length = all_pairs_shortest_paths(G=G)
  File "/code_transformer/preprocessing/graph/alg.py", line 45, in all_pairs_shortest_paths
    values = torch.tensor([(dct[0], key, value) for dct in sps for key, value in dct[1].items()],
TypeError: 'float' object cannot be interpreted as an integer

opened by emsi 1

PreprocessingException: Error processing batch 0: No snippets left after filtering step. Skipping batch
Hi, Preprocessing is running fine without errors for few codes, but it is throwing the exception "PreprocessingException: Error processing batch 0: No snippets left after filtering step. Skipping batch" for few other codes. Can anyone tell how to overcome this error?

The code being used for preprocessing is from the interactive_prediction.ipynb notebook provided.

preprocessor = CTStage1Preprocessor(code_snippet_language, allow_empty_methods=True) stage1_sample = preprocessor.process([("", "", code_snippet)], 0)
opened by TejaswiniiB 1

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Related tags

Overview

Code Transformer

Cite

Table of Contents

1. Repository Setup

1.1. Data

1.1.1. Raw Data

1.1.2. Preprocessed Data

1.2. Example notebooks

1.3. Environment Variables

1.4. Repository Structure

2. Preprocessing

2.1. semantic

2.2. CSN Dataset

2.3. code2seq Dataset

2.4. Multilanguage Dataset

3. Training

3.1. Code Transformer

3.2. GREAT

3.3. XLNet (No structure)

3.4. Language Modeling Pretraining

4. Evaluation

4.1. Single Language

4.2. Multilanguage

5. Models from the Paper

5.1. Models trained on CSN Dataset

5.2. Models trained on code2seq Dataset

5.3. Models from Ablation Studies

5.4. Download Models

Comments

Owner

Daniel Zügner

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

A PyTorch re-implementation of the paper 'Exploring Simple Siamese Representation Learning'. Reproduced the 67.8% Top1 Acc on ImageNet.

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

Official implementation of the ICLR 2021 paper

Implementation of Nyström Self-attention, from the paper Nyströmformer

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021

Implementation of Barlow Twins paper

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

Functional TensorFlow Implementation of Singular Value Decomposition for paper Fast Graph Learning