TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

Overview

SLM: Structural Language Models of Code

This is an official implementation of the model described in:

"Structural Language Models of Code" [PDF]

To appear in ICML'2020.

An online demo is available at https://AnyCodeGen.org.

This repository currently contains the dataset and the data extractor that we used to create the Java dataset in the paper. The TensorFlow code will be released soon.

Feel free to open a new issue for any question. We always respond quickly.

Table of Contents

Requirements

  • python3
  • TensorFlow 1.13 or newer (install). To check TensorFlow version:

python3 -c 'import tensorflow as tf; print(tf.__version__)'

Download our preprocessed Java-small dataset

This dataset contains ~1.3M examples (1.1GB).

mkdir data
cd data
wget https://codegen-slm.s3.us-east-2.amazonaws.com/data/java-small-preprocessed.tar.gz
tar -xvzf java-small-preprocessed.tar.gz

This will create a data/java-small/ sub-directory, containing the files that hold training, test and validation sets, a dict file for various dataset properties and histograms, and a grammar file that is used during beam search to distinguish between terminal and non-terminal nodes.

Creating and preprocessing a new Java dataset

To create and preprocess a new dataset (for example, to compare SLM to a new model on another dataset):

  • Edit the file preprocess.sh using the instructions there, pointing it to the correct training, validation and test directories.
  • Run the preprocess.sh file:

bash preprocess.sh

Datasets

Java

To download the Java-small as raw *.java files, use:

To download the preprocessed dataset, use:

To download the dataset in a tokenized format that can be used in seq2seq models (for example, with OpenNMT-py), use:

The following JSON files are the files that are created by the JavaExtractor. The preprocessed and the seq2seq files are created from these JSON files:

Every line is a JSON object that contains the following fields: num_targets, num_nodes, targets, is_token, target_child_id, internal_paths, relative_paths, head_paths, head_root_path, head_child_id, linearized_tree, filepath, left_context, right_context, target_seq, line.

C#

The C# dataset that we used in the paper was created using the raw (*.cs files) dataset of Allamanis et al., 2018, (https://aka.ms/iclr18-prog-graphs-dataset) and can be found here: https://aka.ms/iclr18-prog-graphs-dataset.

To extract examples from the C# files, we modified the data extraction code of Brockschmidt et al., 2019: https://github.com/microsoft/graph-based-code-modelling/.

Querying the Trained Model

To query the trained model, use the following API, where MYCODE is the given code snippet, that includes two question marks (??) to mark the "hole" that should be completed:

curl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "MYCODE"}'

For example:

curl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "public static Path[] stat2Paths(FileStatus[] stats) {  if (stats == null) return null;  Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i) { ret[i] = ??; } return ret; }"}'

Citation

Structural Language Models of Code

@article{alon2019structural,
  title={Structural Language Models of Code},
  author={Alon, Uri and Sadaka, Roy and Levy, Omer and Yahav, Eran},
  journal={arXiv preprint arXiv:1910.00577},
  year={2019}
}
Comments
  • Use code2seq to complete expression generation

    Use code2seq to complete expression generation

    Hello Uri Alon, After reading your paper and seeing that you use code2seq to achieve expression generation, there are some questions that need your help:

    1. How to express the expression as a camel case representation similar to that used by code2seq.
    2. How to modify the code of code2seq to make it possible to generate top5 candidates (acc5 mentioned in your paper) Thank you.
    opened by CplandS 15
  • TensorFlow code

    TensorFlow code

    Dear authors, Thanks for your outstanding work. I am very interested in the implementation details of your model after reading your paper. Can you share your training scripts and model code? Thanks.

    This repository currently contains the dataset and the data extractor that we used to create the Java dataset in the paper. The TensorFlow code will be released soon.

    opened by PikachuHy 3
  • About the baseline of Java any-code completion

    About the baseline of Java any-code completion

    Dear authors, Thanks for your work! Your work is really great since there is no any-code completion tool before. I wonder would you release the baseline you designed to compare with AnyCodeGen on Java language (e.g., the retrained code2seq). I find building and retraining them would cause lots of time and energy. Thanks.

    opened by ShangwenWang 3
  • ASTs for incomplete code

    ASTs for incomplete code

    Hi @urialon!

    I read the paper and have some questions and would be really grateful if you could answer. You mention your model generates likely completions given a piece of code with missing holes. Can your proposed model be used when the hole appears not in between some source code lines but at the end (like an autocomplete model)?

    In your evaluation do you compute ASTs for complete code and then mask subtrees and their corresponding tokens for prediction in your experiments? or are there ways to compute ASTs for incomplete code? A previous work https://arxiv.org/pdf/2005.08025.pdf (section 5) mentions that ASTs can only be retrieved on complete code snippets that are syntactically correct, which is often not available for a code completion system. I wanted to understand how the setup you study is different from this one.

    Thanks!

    opened by akhileshgotmare 2
  • Embeddings

    Embeddings

    Hello Uri Alon, After reading through the paper, I had some queries.

    1. Embeddings that are using for Encoding AST paths are they learned while training or they are pre-learned matrixes?
    opened by MadRajib 2
  • The (sub)tokenizer logic used to produce the seq2seq dataset?

    The (sub)tokenizer logic used to produce the seq2seq dataset?

    Hi Uri Alon,

    Thanks for the impressive work, and especially thank you for releasing the data which is kinda hard to collect for various previous publications as there are so many variants and version.

    I am interested in the Java seq2seq dataset you presented, and I am wondering what tokenization logic is used? Is it BPE or some Java-specific heuristics? Thank you!

    opened by frankxu2004 2
  • Train own models?

    Train own models?

    Hi,

    Thanks for releasing the amazing repo! I wonder is it possible to train the model by ourselves instead of querying it? I want to make some small modifications on top of the transformer aggregation.

    Thanks!

    opened by ywen666 1
  • Bump gson from 2.8.5 to 2.8.9 in /JavaExtractor/JPredict

    Bump gson from 2.8.5 to 2.8.9 in /JavaExtractor/JPredict

    Bumps gson from 2.8.5 to 2.8.9.

    Release notes

    Sourced from gson's releases.

    Gson 2.8.9

    • Make OSGi bundle's dependency on sun.misc optional (#1993).
    • Deprecate Gson.excluder() exposing internal Excluder class (#1986).
    • Prevent Java deserialization of internal classes (#1991).
    • Improve number strategy implementation (#1987).
    • Fix LongSerializationPolicy null handling being inconsistent with Gson (#1990).
    • Support arbitrary Number implementation for Object and Number deserialization (#1290).
    • Bump proguard-maven-plugin from 2.4.0 to 2.5.1 (#1980).
    • Don't exclude static local classes (#1969).
    • Fix RuntimeTypeAdapterFactory depending on internal Streams class (#1959).
    • Improve Maven build (#1964).
    • Make dependency on java.sql optional (#1707).

    Gson 2.8.8

    • Fixed issue with recursive types (#1390).
    • Better behaviour with Java 9+ and Unsafe if there is a security manager (#1712).
    • EnumTypeAdapter now works better when ProGuard has obfuscated enum fields (#1495).
    Changelog

    Sourced from gson's changelog.

    Version 2.8.9

    • Make OSGi bundle's dependency on sun.misc optional (#1993).
    • Deprecate Gson.excluder() exposing internal Excluder class (#1986).
    • Prevent Java deserialization of internal classes (#1991).
    • Improve number strategy implementation (#1987).
    • Fix LongSerializationPolicy null handling being inconsistent with Gson (#1990).
    • Support arbitrary Number implementation for Object and Number deserialization (#1290).
    • Bump proguard-maven-plugin from 2.4.0 to 2.5.1 (#1980).
    • Don't exclude static local classes (#1969).
    • Fix RuntimeTypeAdapterFactory depending on internal Streams class (#1959).
    • Improve Maven build (#1964).
    • Make dependency on java.sql optional (#1707).

    Version 2.8.8

    • Fixed issue with recursive types (#1390).
    • Better behaviour with Java 9+ and Unsafe if there is a security manager (#1712).
    • EnumTypeAdapter now works better when ProGuard has obfuscated enum fields (#1495).

    Version 2.8.7

    • Fixed ISO8601UtilsTest failing on systems with UTC+X.
    • Improved javadoc for JsonStreamParser.
    • Updated proguard.cfg (#1693).
    • Fixed IllegalStateException in JsonTreeWriter (#1592).
    • Added JsonArray.isEmpty() (#1640).
    • Added new test cases (#1638).
    • Fixed OSGi metadata generation to work on JavaSE < 9 (#1603).

    Version 2.8.6

    2019-10-04 GitHub Diff

    • Added static methods JsonParser.parseString and JsonParser.parseReader and deprecated instance method JsonParser.parse
    • Java 9 module-info support
    Commits
    • 6a368d8 [maven-release-plugin] prepare release gson-parent-2.8.9
    • ba96d53 Fix missing bounds checks for JsonTreeReader.getPath() (#2001)
    • ca1df7f #1981: Optional OSGi bundle's dependency on sun.misc package (#1993)
    • c54caf3 Deprecate Gson.excluder() exposing internal Excluder class (#1986)
    • e6fae59 Prevent Java deserialization of internal classes (#1991)
    • bda2e3d Improve number strategy implementation (#1987)
    • cd748df Fix LongSerializationPolicy null handling being inconsistent with Gson (#1990)
    • fe30b85 Support arbitrary Number implementation for Object and Number deserialization...
    • 1cc1627 Fix incorrect feature request template label (#1982)
    • 7b9a283 Bump bnd-maven-plugin from 5.3.0 to 6.0.0 (#1985)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • How to distinguish between terminal nodes and non terminal nodes

    How to distinguish between terminal nodes and non terminal nodes

    Dear authors, Thanks for your outstanding work. I have a question for you. When you want to predict a node, you don't know whether it is a terminal node or a non terminal node in advance,and this two kinds of nodes are predicted in different ways(described in the article as two methods:Predicting AST Nodes and Predicting Subtokens). So, how to distinguish these two nodes in order to use different prediction methods in code implementation?

    opened by liu1234567yi 1
  • Reason for not releasing the source code?

    Reason for not releasing the source code?

    Hello authors,

    I am very interested in this project and would like to have the true implementation of the model to see if I can somehow improve it. However, as can be seen from issue #11 and #8 , it is believed that the implementation will not see the public release date any time soon, although you promised to do it in #4. Can I ask why is it the case? If it is because of human resource shortage, I am willing to help you out.

    Thank you for creating this wonderful project and I look forward to hearing from you soon.

    Best Regards, Son.

    opened by xuansontrinh 5
  • Missing EOS?

    Missing EOS?

    Hi, I'm looking through your training data (the json representation). I found a instance where tokens are not followed by a EOS node.

      "targets": [
        "Nm",
        "idle,sources",
        "Cal",
        "Nm",
        "get",
        "EOS",
        "Nm",
        "i",
        "EOS",
        "EOS"
      ],
      "target_seq": "idleSources.get(i)",
    

    Could you please elaborate why there is no EOS after "idle,sources" in this case?

    opened by Zadagu 12
  • Bump commons-io from 1.3.2 to 2.7 in /JavaExtractor/JPredict

    Bump commons-io from 1.3.2 to 2.7 in /JavaExtractor/JPredict

    Bumps commons-io from 1.3.2 to 2.7.

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump junit from 4.12 to 4.13.1 in /JavaExtractor/JPredict

    Bump junit from 4.12 to 4.13.1 in /JavaExtractor/JPredict

    Bumps junit from 4.12 to 4.13.1.

    Release notes

    Sourced from junit's releases.

    JUnit 4.13.1

    Please refer to the release notes for details.

    JUnit 4.13

    Please refer to the release notes for details.

    JUnit 4.13 RC 2

    Please refer to the release notes for details.

    JUnit 4.13 RC 1

    Please refer to the release notes for details.

    JUnit 4.13 Beta 3

    Please refer to the release notes for details.

    JUnit 4.13 Beta 2

    Please refer to the release notes for details.

    JUnit 4.13 Beta 1

    Please refer to the release notes for details.

    Commits
    • 1b683f4 [maven-release-plugin] prepare release r4.13.1
    • ce6ce3a Draft 4.13.1 release notes
    • c29dd82 Change version to 4.13.1-SNAPSHOT
    • 1d17486 Add a link to assertThrows in exception testing
    • 543905d Use separate line for annotation in Javadoc
    • 510e906 Add sub headlines to class Javadoc
    • 610155b Merge pull request from GHSA-269g-pwp5-87pp
    • b6cfd1e Explicitly wrap float parameter for consistency (#1671)
    • a5d205c Fix GitHub link in FAQ (#1672)
    • 3a5c6b4 Deprecated since jdk9 replacing constructor instance of Double and Float (#1660)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Official implementation of GraphMask as presented in our paper Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking.

GraphMask This repository contains an implementation of GraphMask, the interpretability technique for graph neural networks presented in our ICLR 2021

Michael Schlichtkrull 29 Sep 2, 2022
Collection of TensorFlow2 implementations of Generative Adversarial Network varieties presented in research papers.

TensorFlow2-GAN Collection of tf2.0 implementations of Generative Adversarial Network varieties presented in research papers. Model architectures will

null 41 Apr 28, 2022
Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face Manipulation" published in CVPR 2020.

FFD Source Code Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face M

null 88 Nov 22, 2022
Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.

Population-Based Bandits (PB2) Code for the Population-Based Bandits (PB2) Algorithm, from the paper Provably Efficient Online Hyperparameter Optimiza

Jack Parker-Holder 22 Nov 16, 2022
The materials used in the SaxonJS tutorial presented at Declarative Amsterdam, 2021

SaxonJS-Tutorial-2021, version 1.0.4 Last updated on 4 November, 2021. Table of contents Background Prerequisites Starting a web server Running a Java

Saxonica 11 Oct 23, 2022
Projects for AI/ML and IoT integration for games and other presented at re:Invent 2021.

Playground4AWS Projects for AI/ML and IoT integration for games and other presented at re:Invent 2021. Architecture Minecraft and Lamps This project i

Vinicius Senger 5 Nov 30, 2022
Prototypical python implementation of the trust-region algorithm presented in Sequential Linearization Method for Bound-Constrained Mathematical Programs with Complementarity Constraints by Larson, Leyffer, Kirches, and Manns.

Prototypical python implementation of the trust-region algorithm presented in Sequential Linearization Method for Bound-Constrained Mathematical Programs with Complementarity Constraints by Larson, Leyffer, Kirches, and Manns.

null 3 Dec 2, 2022
This is a model made out of Neural Network specifically a Convolutional Neural Network model

This is a model made out of Neural Network specifically a Convolutional Neural Network model. This was done with a pre-built dataset from the tensorflow and keras packages. There are other alternative libraries that can be used for this purpose, one of which is the PyTorch library.

null 9 Oct 18, 2022
This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CNPs), Neural Processes (NPs), Attentive Neural Processes (ANPs).

The Neural Process Family This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CN

DeepMind 892 Dec 28, 2022
TensorFlow-based neural network library

Sonnet Documentation | Examples Sonnet is a library built on top of TensorFlow 2 designed to provide simple, composable abstractions for machine learn

DeepMind 9.5k Jan 7, 2023
Use tensorflow to implement a Deep Neural Network for real time lane detection

LaneNet-Lane-Detection Use tensorflow to implement a Deep Neural Network for real time lane detection mainly based on the IEEE IV conference paper "To

MaybeShewill-CV 1.9k Jan 8, 2023
Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition" Pre-trained Deep Convo

Ankush Malaker 5 Nov 11, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

J K Terry 32 Nov 9, 2021
Unofficial Tensorflow 2 implementation of the paper Implicit Neural Representations with Periodic Activation Functions

Siren: Implicit Neural Representations with Periodic Activation Functions The unofficial Tensorflow 2 implementation of the paper Implicit Neural Repr

Seyma Yucer 2 Jun 27, 2022
Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy. Now with tensorflow 1.0 support. Evaluation usa

Marcel R. 349 Aug 6, 2022
TensorFlow Ranking is a library for Learning-to-Rank (LTR) techniques on the TensorFlow platform

TensorFlow Ranking is a library for Learning-to-Rank (LTR) techniques on the TensorFlow platform

null 2.6k Jan 4, 2023
Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Peter Lin 6.5k Jan 4, 2023