NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models

Last update: Dec 28, 2022

Related tags

Overview

NaturalCC

NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models for many software engineering tasks, e.g., code summarization, code retrieval, code completion, code clone detection and type inference. Our vision is to bridge the gap between programming language and natural language through machine learning techniques.

⭐ Features

A collection of code corpus with data preprocessing
Performance benchmark
Mixed precision training
- Nvidia APEX
- Automatic Mixed Precision
Multi-GPU training
Better logging output
Various Implementations:
- tensorflow gradient clipping
- optimizers or learning schedulers
- baseline models
- binary data formats

🚀 Installation

Requirements

PyTorch version >= 1.6.0
Python version >= 3.6
GCC/G++ > 5.0
For training new models, you'll also need an NVIDIA GPU and NCCL
(optional) For faster training, you need to install NVIDIA's apex library.

1. Install prerequisite libraries

git clone https://github.com/xcodemind/naturalcc && cd naturalcc
pip install -r requirements.txt

Once you installed prerequisite libraries, you can check them via python -m env_test

2. Build or install NaturalCC

Export your NaturalCC cache directory (data and models will be saved in this directory) to user variables(~/.bashrc or ~/.zshrc).

> ~/.bashrc">

echo "export NCC=/data/ncc_data" >> ~/.bashrc

Note: PyCharm cannot get environment variables and, therefore, we recommend you to register your NCC variable at ncc/__init__.py.

Compile Cython files to accelerate programs and register NaturalCC into your pip list

# compile for debug
# python setup.py build_ext --inplace
# install 
pip install --editable ./

3. Half precision computation (optional)

NaturalCC supports half precision training.

If your Pytorch.__version__ < 1.6.0 and nvcc -V is runnable, please install apex.
Otherwise, use Automatic Mixed Precision (AMP). Available Now (set amp: 1 in yaml file, An example).

4. Install GCC/G++ with conda (if you do not have permission)

Since NCC is build via Cython, your GCC/G++ version should be greater than 4.9. If you have the root permission, update GCC/G++; otherwise, install GCC/G++ with conda.

# install GCC/G++ with conda
conda install -c anaconda gxx_linux-64
conda install -c conda-forge gcc_linux-64
cd ~/anaconda/envs/XXX/bin
ln -s x86_64-conda_cos6-linux-gnu-gcc gcc
ln -s x86_64-conda_cos6-linux-gnu-g++ g++
# check
conda deactivate
conda activate XXX
>> type "gcc/g++ -v" in terminals

📚 Dataset

Currently, we have processed the following datasets:

🤖 Implementations

Code retrieval (search)

Code completion

SeqRNN
GPT2

Heterogeneous mapping

Code summarization

Naive Copy
CodeNN
DeepCom
Seq2Seeq + Attention
Nary-/ChildSum-Tree2Seq
Code2Seq
Transformer + (Sinusoidal/Relative/Learned Position Encoding)
CodeBERT
GraphCodeBERT
PLBART

📋 Experiments

Code Summarization

Dataset: Python (Wan et al.)

	BLEU-4	METEOR	ROUGE-L	Cost	Logs
Seq2Seq+Attn	25.57	14.40	39.41	0.09s/b	click here
Tree2Seq+Attn	23.35	12.59	36.49	0.48s/b	click here
Transformer	30.64	17.65	44.59	0.26s/b	click here
Transformer+RPE	31.57	17.74	45.18	0.27s/b	click here
PLBART	32.71	18.13	46.05	0.80s/b	TBC

Code Retrieval

Dataset: CodeSearchNet (Husain et al.)

MRR	Go	Java	JS	PHP	Python	Ruby	Cost	Logs
NBOW	66.59	59.92	47.15	54.75	63.33	42.86	0.16s/b	click here
ConV1d	70.87	60.49	38.81	61.92	67.29	36.53	0.30s/b	click here
BiRNN	65.80	48.60	23.23	51.36	48.28	19.35	0.74s/b	click here
SelfAttn	78.45	66.55	50.38	65.78	79.09	47.96	0.25s/b	click here

Code Completion

Dataset: Py150 (official processed) (raw)

MRR	Attr	Num	Name	Param	Tokens	Cost	Logs
LSTM	51.67	47.45	46.52	66.06	73.73	0.31s/b	click here
GTP-2	70.37	62.20	63.84	73.54	82.17	0.43s/b	click here
TravTrans	72.08	68.55	76.33	71.08	83.17	0.43s/b	click here

Type Inference

Dataset: CodeSearchNet-Java (Husain et al.)

	Acc@1 (All types)	Acc@5 (All types)	Acc@1 (Any types)	Acc@5 (Any types)	Cost	Logs
DeepTyper	0.52	0.67	0.43	0.67	0.42s/b	TBC
Transformer	0.32	0.64	0.37	0.75	0.85s/b	TBC

Heterogeneous Mapping

Dataset: OpenCL (Grewe et al.)

Accuracy	AMD	NVIDIA
Static mapping	58.82	56.91
Decision tree	70.29	74.56
Inst2vec	82.79	81.76
DeepTune	83.24	80.15

🏫 Examples & Tutorials

All the running commands here should be executed in the root of project folder (the path of your naturalcc). For example, in my environment I will stay at /data/wanyao/Dropbox/ghproj-v100/naturalcc.

We also have more detailed READMEs to start your tutorial of NaturalCC.

Step 1: Download and process a dataset from `datasets`, and follow the instructions from the README.md file.

# ref: dataset/python_wan/README.md
# download dataset
bash dataset/python_wan/download.sh
# clean data
python -m dataset.python_wan.clean
# cast data attributes into different files
python -m dataset.python_wan.attributes_cast

# ref: dataset/python_wan/summarization/README.md
# save code tokens and docstirng tokens into MMAP format
python -m dataset.python_wan.summarization.preprocess

Step 2 (optional): Register your self-defined models

If you want to create a new model, please add your model at ncc/models and ncc/modules.
If your training policy are more complex than we thought, you should update your criterions and training procedure at ncc/criterions and ncc/trainers, respectively.

Do not forget to update your self defined module at ncc/XX/__init__.py.

Step 3: Training and inference.

Select a task and a model from task list and follow the instructions in its README.md to start your learning.

# ref: run/summarization/transformer/README.md
# train
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m run.summarization.transformer.train -f config/python_wan/python > run/summarization/transformer/config/python_wan/python.log 2>&1 &
# inference
CUDA_VISIBLE_DEVICES=0 python -m run.summarization.transformer.eval -f config/python_wan/python -o run/summarization/transformer/config/python_wan/python.txt

❓ FAQ

Please fell free to contact me if you have any troubles.

😘 License and Acknowledgement

NaturalCC is MIT-licensed. The license applies to the pre-trained models as well. This project is also highly inspired by Fairseq and AllenNLP.

🔗 Related Links

NaturalCC-demo
About us: XCodeMind

❤️ Citation

Please cite as:

under reviewing

Comments

FileNotFoundError: [Errno 2] No such file or directory: '/data/ncc_data/codesearchnet/raw/code.json'

Hello，dear author，I'm a chinese students in WuHan. I'm sorry I am confused about it.I can not use geogle drive. So I download it by myself.Could you tell me how to use geogle drive in Lab...oh my god. And I excute python -m dataset.python_wan.attributes_cast it tell me error as up. It said FileNotFoundError: [Errno 2] No such file or directory: '/data/ncc_data/codesearchnet/raw/code.json'.How can I run it? Please help me thank you.

opened by SoWhereAreYou 8
OOM error

Hi, Thank you before and i find how to train. But the when I have trained a epoch.It will out of memory afte any set in max_sentences: 1 Was it has not fresh the gpu after ever epoch? Looking for your reply. Thanks.

opened by SoWhereAreYou 4
About XLIR

Hello, dear authors, Naturalcc is an excellent work. I read the paper "Cross-Language Binary-Source Code Matching with Intermediate Representations" and interested in the approach of embedding of IR Bitcode. Unfortunately, I didn't find available code or apporach of XLIR. Could you give me some help to find the location of this work?

Look forward to your favourable reply.

opened by ibeidu 1
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 1
missing code?
Hi, The code keeps crashing with this error:

... from ncc.data.contracode.contracode_dataset import ContraCodeDataset ModuleNotFoundError: No module named 'ncc.data.contracode'

By inspecting the folder ncc/data/, it seems contracode is missing. Please advise.
opened by ibrahimabdelaziz 1
AttributeError: type object 'SeqIndexedDataset' has no attribute 'Index'

hi, I want to run GPT-2 After i execute: python -m ncc_dataset.raw_py150.completion.preprocess，got one error as the picture：

AttributeError: type object 'SeqIndexedDataset' has no attribute 'Index'

Looking for your reply. Thanks.

opened by fuchengjie 4
Problem in reproducing attention analysis from the paper "What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code"
Hi,

First of all, thank you for such detailed writing for discussion on pre-trained models for source code.

I am currently trying to reproduce the result, but in compute_edge_features.py, line 133, you are referring to a path ../data/code_new/code_contact_map/noneighbor/train.json, which I could not find anywhere.

I did try to change the path to the train.ast file provided in the Python AST dataset, but another error is raised.

Layers: 12 Heads: 12 Loading dataset 100% 5000/5000 [00:00<00:00, 1458178.28it/s] 0% 0/5000 [00:00<?, ?it/s] Traceback (most recent call last): File "compute_edge_features.py", line 155, in <module> min_attn=min_attn) File "compute_edge_features.py", line 64, in compute_mean_attention feature_map=item['feature_map'] KeyError: 'feature_map'

I hope you can give me an instruction to resolve the problem.

Many thanks!
opened by dfighter1312 0
FileNotFoundError: [Errno 2] No such file or directory: '~/.ncc/code_search_net_feng/summarization/java/data-mmap/code_tokens.dict.jsonl'

Dear Authors： When I run all projects under naturalcc/run/summarization, there will be an error like "FileNotFoundError: [Errno 2] No such file or directory: '~/.ncc/code_search_net_feng/summarization/java/data-mmap/code_tokens.dict.jsonl". where can I get these .dict.jsonl files. Looking for your reply.
Thank you.

opened by EileenWang1228 1
How to register the ncc variable at ncc/__init__.py.?

Hello, I have build and install all the required libraries. But when i try to run the 'pip install --editable ./' this command, it return the error as follow picture. I guess the reason is not register the ncc variable at 'ncc/init.py'. Thus, i want to know how to register it . on the other , if is caused by another reasons, please tell me!!

Thanks!

opened by rebel-ly 6
questions about paper "A Structural Analysis of Pre-Trained Language Models for Source Code"
The high variability would suggest a content-dependent head, while low variability would indicate a content-independent head.

Figure 7: Visualization of attention heads in CodeBERT, along with the value of attention analysis ( 𝑝 𝛼 ( 𝑓 )), and attention variability, given a Python code snippet.

What are high, low and attention variability?

What are the inputs and outputs of models in Syntax Tree Induction?

Why is it content dependent?
opened by skye95git 8

Owner

CGCL/SCTS/BDTS Lab

GitHub http://xcodemind.github.io

Fedlearn支持前沿算法研发的Python工具库 | Fedlearn algorithm toolkit for researchers

FedLearn-algo Installation Development Environment Checklist python3 (3.6 or 3.7) is required. To configure and check the development environment is c

89 Nov 14, 2022

Sequence to Sequence Models with PyTorch

Sequence to Sequence models with PyTorch This repository contains implementations of Sequence to Sequence (Seq2Seq) models in PyTorch At present it ha

708 Dec 19, 2022

a delightful machine learning tool that allows you to train, test and use models without writing code

igel A delightful machine learning tool that allows you to train/fit, test and use models without writing code Note I'm also working on a GUI desktop

3k Jan 5, 2023

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

8.7k Jan 5, 2023

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

8.7k Dec 31, 2022

Toolbox of models, callbacks, and datasets for AI/ML researchers.

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch Website • Installation • Main

1.4k Dec 30, 2022

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

22 Nov 25, 2022

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

1.4k Jan 8, 2023

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

161 Dec 8, 2022

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

112 Dec 16, 2022

NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models

Related tags

Overview

NaturalCC

⭐ Features

🚀 Installation

Requirements

1. Install prerequisite libraries

2. Build or install NaturalCC

3. Half precision computation (optional)

4. Install GCC/G++ with conda (if you do not have permission)

📚 Dataset

🤖 Implementations

Code retrieval (search)

Code completion

Heterogeneous mapping

Code summarization

📋 Experiments

🏫 Examples & Tutorials

Step 1: Download and process a dataset from datasets, and follow the instructions from the README.md file.

Step 2 (optional): Register your self-defined models

Step 3: Training and inference.

❓ FAQ

😘 License and Acknowledgement

🔗 Related Links

❤️ Citation

Comments

Patching CVE-2007-4559

Owner

Fedlearn支持前沿算法研发的Python工具库 | Fedlearn algorithm toolkit for researchers

Sequence to Sequence Models with PyTorch

a delightful machine learning tool that allows you to train, test and use models without writing code

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Toolbox of models, callbacks, and datasets for AI/ML researchers.

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sequence-to-Sequence learning using PyTorch

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

An implementation of a sequence to sequence neural network using an encoder-decoder

Sequence lineage information extracted from RKI sequence data repo

Sequence modeling benchmarks and temporal convolutional networks

Clean and readable code for Decision Transformer: Reinforcement Learning via Sequence Modeling

Quickly and easily create / train a custom DeepDream model

A repo to show how to use custom dataset to train s2anet, and change backbone to resnext101

Step 1: Download and process a dataset from `datasets`, and follow the instructions from the README.md file.