TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

Last update: Nov 6, 2022

Related tags

Deep Learning language code models source codegen structural icml2020 anycodegen

Overview

SLM: Structural Language Models of Code

This is an official implementation of the model described in:

"Structural Language Models of Code" [PDF]

To appear in ICML'2020.

An online demo is available at https://AnyCodeGen.org.

This repository currently contains the dataset and the data extractor that we used to create the Java dataset in the paper. The TensorFlow code will be released soon.

Feel free to open a new issue for any question. We always respond quickly.

Requirements
Download our preprocessd dataset
Creating a new dataset
Datasets
Querying the trained model
Citation

Requirements

python3
TensorFlow 1.13 or newer (install). To check TensorFlow version:

python3 -c 'import tensorflow as tf; print(tf.__version__)'

For creating a new Java dataset: JDK 12

Download our preprocessed Java-small dataset

This dataset contains ~1.3M examples (1.1GB).

mkdir data
cd data
wget https://codegen-slm.s3.us-east-2.amazonaws.com/data/java-small-preprocessed.tar.gz
tar -xvzf java-small-preprocessed.tar.gz

This will create a data/java-small/ sub-directory, containing the files that hold training, test and validation sets, a dict file for various dataset properties and histograms, and a grammar file that is used during beam search to distinguish between terminal and non-terminal nodes.

Creating and preprocessing a new Java dataset

To create and preprocess a new dataset (for example, to compare SLM to a new model on another dataset):

Edit the file preprocess.sh using the instructions there, pointing it to the correct training, validation and test directories.
Run the preprocess.sh file:

bash preprocess.sh

Datasets

Java

To download the Java-small as raw *.java files, use:

Java-small

To download the preprocessed dataset, use:

Java-small-preprocessed

To download the dataset in a tokenized format that can be used in seq2seq models (for example, with OpenNMT-py), use:

Java-small-seq2seq

The following JSON files are the files that are created by the JavaExtractor. The preprocessed and the seq2seq files are created from these JSON files:

Java-small-json

Every line is a JSON object that contains the following fields: num_targets, num_nodes, targets, is_token, target_child_id, internal_paths, relative_paths, head_paths, head_root_path, head_child_id, linearized_tree, filepath, left_context, right_context, target_seq, line.

C#

The C# dataset that we used in the paper was created using the raw (*.cs files) dataset of Allamanis et al., 2018, (https://aka.ms/iclr18-prog-graphs-dataset) and can be found here: https://aka.ms/iclr18-prog-graphs-dataset.

To extract examples from the C# files, we modified the data extraction code of Brockschmidt et al., 2019: https://github.com/microsoft/graph-based-code-modelling/.

Querying the Trained Model

To query the trained model, use the following API, where MYCODE is the given code snippet, that includes two question marks (??) to mark the "hole" that should be completed:

curl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "MYCODE"}'

For example:

curl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "public static Path[] stat2Paths(FileStatus[] stats) {  if (stats == null) return null;  Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i) { ret[i] = ??; } return ret; }"}'

Citation

Structural Language Models of Code

@article{alon2019structural,
  title={Structural Language Models of Code},
  author={Alon, Uri and Sadaka, Roy and Levy, Omer and Yahav, Eran},
  journal={arXiv preprint arXiv:1910.00577},
  year={2019}
}

Comments

Use code2seq to complete expression generation
Hello Uri Alon， After reading your paper and seeing that you use code2seq to achieve expression generation, there are some questions that need your help:

How to express the expression as a camel case representation similar to that used by code2seq.

How to modify the code of code2seq to make it possible to generate top5 candidates (acc5 mentioned in your paper) Thank you.
opened by CplandS 15
TensorFlow code

Dear authors, Thanks for your outstanding work. I am very interested in the implementation details of your model after reading your paper. Can you share your training scripts and model code? Thanks.

This repository currently contains the dataset and the data extractor that we used to create the Java dataset in the paper. The TensorFlow code will be released soon.

opened by PikachuHy 3
About the baseline of Java any-code completion

Dear authors, Thanks for your work! Your work is really great since there is no any-code completion tool before. I wonder would you release the baseline you designed to compare with AnyCodeGen on Java language (e.g., the retrained code2seq). I find building and retraining them would cause lots of time and energy. Thanks.

opened by ShangwenWang 3
ASTs for incomplete code

Hi @urialon!

I read the paper and have some questions and would be really grateful if you could answer. You mention your model generates likely completions given a piece of code with missing holes. Can your proposed model be used when the hole appears not in between some source code lines but at the end (like an autocomplete model)?

In your evaluation do you compute ASTs for complete code and then mask subtrees and their corresponding tokens for prediction in your experiments? or are there ways to compute ASTs for incomplete code? A previous work https://arxiv.org/pdf/2005.08025.pdf (section 5) mentions that ASTs can only be retrieved on complete code snippets that are syntactically correct, which is often not available for a code completion system. I wanted to understand how the setup you study is different from this one.

Thanks!

opened by akhileshgotmare 2
Embeddings
Hello Uri Alon， After reading through the paper, I had some queries.

Embeddings that are using for Encoding AST paths are they learned while training or they are pre-learned matrixes?
opened by MadRajib 2
The (sub)tokenizer logic used to produce the seq2seq dataset?

Hi Uri Alon,

Thanks for the impressive work, and especially thank you for releasing the data which is kinda hard to collect for various previous publications as there are so many variants and version.

I am interested in the Java seq2seq dataset you presented, and I am wondering what tokenization logic is used? Is it BPE or some Java-specific heuristics? Thank you!

opened by frankxu2004 2
Train own models?

Hi,

Thanks for releasing the amazing repo! I wonder is it possible to train the model by ourselves instead of querying it? I want to make some small modifications on top of the transformer aggregation.

Thanks!

opened by ywen666 1
Bump gson from 2.8.5 to 2.8.9 in /JavaExtractor/JPredict
Bumps gson from 2.8.5 to 2.8.9.

Release notes

Sourced from gson's releases.

Gson 2.8.9

Make OSGi bundle's dependency on sun.misc optional (#1993).

Deprecate Gson.excluder() exposing internal Excluder class (#1986).

Prevent Java deserialization of internal classes (#1991).

Improve number strategy implementation (#1987).

Fix LongSerializationPolicy null handling being inconsistent with Gson (#1990).

Support arbitrary Number implementation for Object and Number deserialization (#1290).

Bump proguard-maven-plugin from 2.4.0 to 2.5.1 (#1980).

Don't exclude static local classes (#1969).

Fix RuntimeTypeAdapterFactory depending on internal Streams class (#1959).

Improve Maven build (#1964).

Make dependency on java.sql optional (#1707).

Gson 2.8.8

Fixed issue with recursive types (#1390).

Better behaviour with Java 9+ and Unsafe if there is a security manager (#1712).

EnumTypeAdapter now works better when ProGuard has obfuscated enum fields (#1495).

Changelog

Sourced from gson's changelog.

Version 2.8.9

Make OSGi bundle's dependency on sun.misc optional (#1993).

Deprecate Gson.excluder() exposing internal Excluder class (#1986).

Prevent Java deserialization of internal classes (#1991).

Improve number strategy implementation (#1987).

Fix LongSerializationPolicy null handling being inconsistent with Gson (#1990).

Support arbitrary Number implementation for Object and Number deserialization (#1290).

Bump proguard-maven-plugin from 2.4.0 to 2.5.1 (#1980).

Don't exclude static local classes (#1969).

Fix RuntimeTypeAdapterFactory depending on internal Streams class (#1959).

Improve Maven build (#1964).

Make dependency on java.sql optional (#1707).

Version 2.8.8

Fixed issue with recursive types (#1390).

Better behaviour with Java 9+ and Unsafe if there is a security manager (#1712).

EnumTypeAdapter now works better when ProGuard has obfuscated enum fields (#1495).

Version 2.8.7

Fixed ISO8601UtilsTest failing on systems with UTC+X.

Improved javadoc for JsonStreamParser.

Updated proguard.cfg (#1693).

Fixed IllegalStateException in JsonTreeWriter (#1592).

Added JsonArray.isEmpty() (#1640).

Added new test cases (#1638).

Fixed OSGi metadata generation to work on JavaSE < 9 (#1603).

Version 2.8.6

2019-10-04 GitHub Diff

Added static methods JsonParser.parseString and JsonParser.parseReader and deprecated instance method JsonParser.parse

Java 9 module-info support

Commits

6a368d8 [maven-release-plugin] prepare release gson-parent-2.8.9

ba96d53 Fix missing bounds checks for JsonTreeReader.getPath() (#2001)

ca1df7f #1981: Optional OSGi bundle's dependency on sun.misc package (#1993)

c54caf3 Deprecate Gson.excluder() exposing internal Excluder class (#1986)

e6fae59 Prevent Java deserialization of internal classes (#1991)

bda2e3d Improve number strategy implementation (#1987)

cd748df Fix LongSerializationPolicy null handling being inconsistent with Gson (#1990)

fe30b85 Support arbitrary Number implementation for Object and Number deserialization...

1cc1627 Fix incorrect feature request template label (#1982)

7b9a283 Bump bnd-maven-plugin from 5.3.0 to 6.0.0 (#1985)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
How to distinguish between terminal nodes and non terminal nodes

Dear authors, Thanks for your outstanding work. I have a question for you. When you want to predict a node, you don't know whether it is a terminal node or a non terminal node in advance，and this two kinds of nodes are predicted in different ways(described in the article as two methods：Predicting AST Nodes and Predicting Subtokens). So, how to distinguish these two nodes in order to use different prediction methods in code implementation？

opened by liu1234567yi 1
Reason for not releasing the source code?

Hello authors,

I am very interested in this project and would like to have the true implementation of the model to see if I can somehow improve it. However, as can be seen from issue #11 and #8 , it is believed that the implementation will not see the public release date any time soon, although you promised to do it in #4. Can I ask why is it the case? If it is because of human resource shortage, I am willing to help you out.

Thank you for creating this wonderful project and I look forward to hearing from you soon.

Best Regards, Son.

opened by xuansontrinh 5
Missing EOS?
Hi, I'm looking through your training data (the json representation). I found a instance where tokens are not followed by a EOS node.

"targets": [ "Nm", "idle,sources", "Cal", "Nm", "get", "EOS", "Nm", "i", "EOS", "EOS" ], "target_seq": "idleSources.get(i)",

Could you please elaborate why there is no EOS after "idle,sources" in this case?
opened by Zadagu 12
Bump commons-io from 1.3.2 to 2.7 in /JavaExtractor/JPredict
Bumps commons-io from 1.3.2 to 2.7.

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump junit from 4.12 to 4.13.1 in /JavaExtractor/JPredict
Bumps junit from 4.12 to 4.13.1.

Release notes

Sourced from junit's releases.

JUnit 4.13.1

Please refer to the release notes for details.

JUnit 4.13

Please refer to the release notes for details.

JUnit 4.13 RC 2

Please refer to the release notes for details.

JUnit 4.13 RC 1

Please refer to the release notes for details.

JUnit 4.13 Beta 3

Please refer to the release notes for details.

JUnit 4.13 Beta 2

Please refer to the release notes for details.

JUnit 4.13 Beta 1

Please refer to the release notes for details.

Commits

1b683f4 [maven-release-plugin] prepare release r4.13.1

ce6ce3a Draft 4.13.1 release notes

c29dd82 Change version to 4.13.1-SNAPSHOT

1d17486 Add a link to assertThrows in exception testing

543905d Use separate line for annotation in Javadoc

510e906 Add sub headlines to class Javadoc

610155b Merge pull request from GHSA-269g-pwp5-87pp

b6cfd1e Explicitly wrap float parameter for consistency (#1671)

a5d205c Fix GitHub link in FAQ (#1672)

3a5c6b4 Deprecated since jdk9 replacing constructor instance of Double and Float (#1660)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

Related tags

Overview

SLM: Structural Language Models of Code

Table of Contents

Requirements

Download our preprocessed Java-small dataset

Creating and preprocessing a new Java dataset

Datasets

Java

C#

Querying the Trained Model

Citation

Comments

Gson 2.8.9

Gson 2.8.8

Version 2.8.9

Version 2.8.8

Version 2.8.7

Version 2.8.6

JUnit 4.13.1

JUnit 4.13

JUnit 4.13 RC 2

JUnit 4.13 RC 1

JUnit 4.13 Beta 3

JUnit 4.13 Beta 2

JUnit 4.13 Beta 1

Owner

Official implementation of GraphMask as presented in our paper Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking.

Collection of TensorFlow2 implementations of Generative Adversarial Network varieties presented in research papers.

Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face Manipulation" published in CVPR 2020.

Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.

The materials used in the SaxonJS tutorial presented at Declarative Amsterdam, 2021

Projects for AI/ML and IoT integration for games and other presented at re:Invent 2021.

Prototypical python implementation of the trust-region algorithm presented in Sequential Linearization Method for Bound-Constrained Mathematical Programs with Complementarity Constraints by Larson, Leyffer, Kirches, and Manns.

This is a model made out of Neural Network specifically a Convolutional Neural Network model

This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CNPs), Neural Processes (NPs), Attentive Neural Processes (ANPs).

TensorFlow-based neural network library

Use tensorflow to implement a Deep Neural Network for real time lane detection

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

A python-image-classification web application project, written in Python and served through the Flask Microframework. This Project implements the VGG16 covolutional neural network, through Keras and Tensorflow wrappers, to make predictions on uploaded images.

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

Unofficial Tensorflow 2 implementation of the paper Implicit Neural Representations with Periodic Activation Functions

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

TensorFlow Ranking is a library for Learning-to-Rank (LTR) techniques on the TensorFlow platform

Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!