Semantic code search implementation using Tensorflow framework and the source code data from the CodeSearchNet project

Chen Wu

Last update: Nov 29, 2022

Related tags

Deep Learning SemanticCodeSearch

Overview

Semantic Code Search

Semantic code search implementation using Tensorflow framework and the source code data from the CodeSearchNet project. The model training pipeline was based on the implementation in CodeSearchNet repository. Python, Java, Go, Php, Javascript, and Ruby programming language are supported.

Model Description

BPE tokenizer is used to encode both code strings and query strings(docstrings are used as queries in training). Code strings are padded and encoded to the length of 200 tokens. Query strings are padded and encoded to the length of 30 tokens. Both code embedding size and query embedding size are 256. Token embeddings are masked and then an unweighted mean is performed to get a vector with 256 dimensions for code strings and query strings. Cosine similarity is calculated between the code representations and the query representations. Further details can be found on the WANDB run

Model Structure

Deep Structured Semantic Model
Wide & Deep Learning

Project Structure

Python package with scripts to prepare the data, train/test the model and predict.

Data

We use the data from the CodeSearchNet project. The downloaded data is around 20GB.

Training the model

To install the reqiured dependencies

pip3 install -r requirements.txt

Preparing data

Data preparation step is seperated from the training step because of computing time and memory consumption.

Training and evaluation

Start the training

python3 -m train --model neuralbow_v1

The model will be trained for each language. The evaluation metric is MRR for validation and test sets, however, the output of prediction will be evaluated by GitHub using nDCG.

Model Downloads

Hybrid Model

Query the trained model

Predict

python3 predict.py -r wuchen/SemanticCodeSearch/1fpfl6dq

Online Semantic Code Search website

Requirements: Flask
Import source code file
Running the dev server

Model Server

python3 -m server.main

Web Front-End

cd react-code-search
npm install
npm start

Comments

CodeTrans
Hi @overwindows ,

Thanks for sharing your brilliant work!

However, I got an error when I tried to run the server:

~/g/SemanticCodeSearch (master)> python3 -m server.main Traceback (most recent call last): File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/xxgj/github/SemanticCodeSearch/server/main.py", line 7, in <module> from server.code_trans import CodeTrans

The error occurred at https://github.com/overwindows/SemanticCodeSearch/blob/master/server/main.py#L7 .

It seems that I could not find server.code_trans on master branch. After reading your code, I find that this module does not seem to be much related to SemanticCodeSearch. Is it okay to just remove related code?

Please correct me if I mistake anything.

Thanks in advance!

Regards, xxgj
opened by xxgj 1

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

19 Sep 29, 2022

A deep learning based semantic search platform that computes similarity scores between provided query and documents

semanticsearch This is a deep learning based semantic search platform that computes similarity scores between provided query and documents. Documents

1 Nov 30, 2021

Densely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)

DenseNAS The code of the CVPR2020 paper Densely Connected Search Space for More Flexible Neural Architecture Search. Neural architecture search (NAS)

291 Nov 18, 2022

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation) Download Synthia dataset The model uses

32 Sep 21, 2022

Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation, CVPR 2018

Learning Pixel-level Semantic Affinity with Image-level Supervision This code is deprecated. Please see https://github.com/jiwoon-ahn/irn instead. Int

337 Dec 15, 2022

Siamese-nn-semantic-text-similarity - A repository containing comprehensive Neural Networks based PyTorch implementations for the semantic text similarity task

Siamese Deep Neural Networks for Semantic Text Similarity PyTorch A repository c

32 Dec 15, 2022

Build upon neural radiance fields to create a scene-specific implicit 3D semantic representation, Semantic-NeRF

Semantic code search implementation using Tensorflow framework and the source code data from the CodeSearchNet project

Related tags

Overview

Semantic Code Search

Model Description

Model Structure

Project Structure

Data

Training the model

Preparing data

Training and evaluation

Model Downloads

Query the trained model

Online Semantic Code Search website

Model Server

Web Front-End

You might also like...

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

A deep learning based semantic search platform that computes similarity scores between provided query and documents

Densely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation, CVPR 2018

Siamese-nn-semantic-text-similarity - A repository containing comprehensive Neural Networks based PyTorch implementations for the semantic text similarity task

Build upon neural radiance fields to create a scene-specific implicit 3D semantic representation, Semantic-NeRF

FEDn is an open-source, modular and ML-framework agnostic framework for Federated Machine Learning

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

Comments

CodeTrans

Owner

Chen Wu

This repo contains the code and data used in the paper "Wizard of Search Engine: Access to Information Through Conversations with Search Engines"

Model search is a framework that implements AutoML algorithms for model architecture search at scale

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics.

Tensorflow 2.x implementation of Panoramic BlitzNet for object detection and semantic segmentation on indoor panoramic images.

A python-image-classification web application project, written in Python and served through the Flask Microframework. This Project implements the VGG16 covolutional neural network, through Keras and Tensorflow wrappers, to make predictions on uploaded images.

An example of semantic segmentation using tensorflow in eager execution.

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data