Type4Py: Deep Similarity Learning-Based Type Inference for Python

Software Analytics Lab

Last update: Dec 15, 2022

Related tags

Deep Learning python machinelearning deeplearning typeinference similarity-learning ml4se type4py

Overview

Type4Py: Deep Similarity Learning-Based Type Inference for Python

This repository contains the implementation of Type4Py and instructions for re-producing the results of the paper.

Dataset
Installation Guide
Usage Guide
Converting Type4Py to ONNX
VSCode Extension
Type4Py Server
Citing Type4Py

Dataset

For Type4Py, we use the ManyTypes4Py dataset. You can download the latest version of the dataset here. Also, note that the dataset is already de-duplicated.

Code De-deduplication

If you want to use your own dataset, it is essential to de-duplicate the dataset by using a tool like CD4Py.

Installation Guide

Requirements

Linux-based OS
Python 3.5 or newer
An NVIDIA GPU with CUDA support

Quick Install

git clone https://github.com/saltudelft/type4py.git && cd type4py
pip install .

Usage Guide

Follow the below steps to train and evaluate the Type4Py model.

1. Extraction

NOTE: Skip this step if you're using the ManyTypes4Py dataset.

$ type4py extract --c $DATA_PATH --o $OUTPUT_DIR --d $DUP_FILES --w $CORES

Description:

$DATA_PATH: The path to the Python corpus or dataset.
$OUTPUT_DIR: The path to store processed projects.
$DUP_FILES: The path to the duplicate files, i.e., the *.jsonl.gz file produced by CD4Py. [Optional]
$CORES: Number of CPU cores to use for processing projects.

2. Preprocessing

$ type4py preprocess --o $OUTPUT_DIR --l $LIMIT

Description:

$OUTPUT_DIR: The path that was used in the first step to store processed projects. For the MT4Py dataset, use the directory in which the dataset is extracted.
$LIMIT: The number of projects to be processed. [Optional]

3. Vectorizing

$ type4py vectorize --o $OUTPUT_DIR

Description:

$OUTPUT_DIR: The path that was used in the previous step to store processed projects.

4. Learning

$ type4py learn --o $OUTPUT_DIR --c --p $PARAM_FILE

Description:

$OUTPUT_DIR: The path that was used in the previous step to store processed projects.
--c: Trains the complete model. Use type4py learn -h to see other configurations.
--p $PARAM_FILE: The path to user-provided hyper-parameters for the model. See this file as an example. [Optional]

5. Testing

$ type4py predict --o $OUTPUT_DIR --c

Description:

$OUTPUT_DIR: The path that was used in the first step to store processed projects.
--c: Predicts using the complete model. Use type4py predict -h to see other configurations.

6. Evaluating

$ type4py eval --o $OUTPUT_DIR --t c --tp 10

Description:

$OUTPUT_DIR: The path that was used in the first step to store processed projects.
--t: Evaluates the model considering different prediction tasks. E.g., --t c considers all predictions tasks, i.e., parameters, return, and variables. [Default: c]
--tp 10: Considers Top-10 predictions for evaluation. For this argument, You can choose a positive integer between 1 and 10. [Default: 10]

Use type4py eval -h to see other options.

Converting Type4Py to ONNX

To convert the pre-trained Type4Py model to the ONNX format, use the following command:

$ type4py to_onnx --o $OUTPUT_DIR

Description:

$OUTPUT_DIR: The path that was used in the usage section to store processed projects and the model.

VSCode Extension

Type4Py can be used in VSCode, which provides ML-based type auto-completion for Python files. The Type4Py's VSCode extension can be installed from the VS Marketplace here.

Type4Py Server

The Type4Py server is deployed on our server, which exposes a public API and powers the VSCode extension. However, if you would like to deploy the Type4Py server on your own machine, you can adapt the server code here. Also, please feel free to reach out to us for deployment, using the pre-trained Type4Py model and how to train your own model by creating an issue.

Citing Type4Py

@article{mir2021type4py,
  title={Type4Py: Deep Similarity Learning-Based Type Inference for Python},
  author={Mir, Amir M and Latoskinas, Evaldas and Proksch, Sebastian and Gousios, Georgios},
  journal={arXiv preprint arXiv:2101.04470},
  year={2021}
}

Comments

Crash when trying to infer single file with freshly trained model using ManyTypes4Py
Hello, thank you for creating and providing this great project! I plan to use this project for my bachelor thesis. Therefore, I am mainly interested in the inference functionality provided with infer.py on branch server (branch infer seems to be outdated). I am aware of the VS Code extension and the public JSON API. I, however, prefer to use this project locally.

Since infer.py takes a pre-trained model as a program argument, I followed all the steps in the README to train such a model. Unfortunately, the script crashes with the following message (excerpt):

onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: tok for the following indices index: 0 Got: 7 Expected: 1 Please fix either the inputs or the model.

Below you can find a link to a Google Colab notebook with all the steps from start (downloading the ManyTypes4Py dataset, pip-installing type4py, preprocessing) to finish (training a model, trying to infer the types of a single file) and the corresponding output from when I ran it the last time (including the full error backtrace on the bottom):

https://colab.research.google.com/drive/1kRIffMlgGCeW55wXelksGrXfSd0WjhKQ?usp=sharing

It should be relatively self-explanatory. Evidently, I use a fork of this project and not the project itself. The differences are minor though: In learn.py, I just re-uncommented the .to(DEVICE)-calls (c42144d) as otherwise it would lead to a crash in the notebook (vectors are on different devices). The remaining changes don't affect Python files and are not relevant to this issues. Further, I am using venv, although I doubt this has any negative influence on the execution of this project.

My question is, how can I successfully use infer.py? How can I obtain a proper compatible model for it? Are any of those steps in the linked notebook incorrect?
opened by fmease 4
Error in variable initialisation
When using the preprocess command with only the -o argument, the code crashes with the following

UnboundLocalError: local variable 'train_files_vars' referenced before assignment

This is because in the following extract

https://github.com/saltudelft/type4py/blob/93828c3d1a3460dc29cace398ca0dcb10ea14daf/type4py/preprocess.py#L302-L319

the train_files_vars variable is only initialised in the if branch
bug
opened by gousiosg 2
Using Docker images for both production and development environments
In #8, I published Docker images for using Type4Py locally on users' machines. This PR also creates Docker images to deploy Type4Py in both production and development environments. Specifically, it makes the following changes:

The Type4Py server detects whether the Docker image is running in local mode or production mode based on given ENV vars.

Creates a separate Docker file to build images that allows performing the model inference on GPUs.

Adds GH workflows to build Docker images for production and dev. environments and also supporting CPU/GPU.

Add a unit test to test the local model before publishing its Docker image in GH Action.

A bash script to test all the Docker images for production, dev, local environments.

Use one config file for both production and dev. envs.

enhancement
opened by mir-am 0
Publishing Docker images for Type4Py to run the model locally
This PR makes the following changes:

A new --rvth CLI argument that applies VTHs with a probability (default 0.5) [ONLY FOR PRODUCTION]

Improves memory consumption when loading datasets

Add reduce command, which uses PCA to reduce the dimension of type clusters and the size of the whole model.

Disable telemetry data in the server module when running the Docker version.

For inference, drops dependency on PyTorch to reduce the size of the Docker image

Adding Docker file and config file for building Docker images of Type4Py

enhancement
opened by mir-am 0
New improvements and the Type4Py server
Type checking the prediction of Type4Py (experimental)

Converting the Type4Py's PyTorch model to the ONNX format

Adding a web server to query the model and support the VSCode extension.

Computing MRR for evaluation

enhancement
opened by mir-am 0
Type4Py supports prediction for the type of variables
Extend preprocess and vectorize modules for variables prediction

Extend data_loaders module for loading variables' tensors

Separating ubiquitous and common types when evaluating

Considering Visible Type Hints in preprocess module

Evaluating different tasks in the eval module i.e., parameters, return, and variable

Adding the Type4Py model with different configurations

Improving the performance of data loaders

In eval, consider MRR instead of weighted metrics

predict now creates a JSON file for predictions

Improve the memory consumption of predict

Improvements to predict: better type aliasing & reducing the depth of parametric types

Determining no. of batches when predicting different from the training phase

enhancement
opened by mir-am 0
Migration to the LibSA4Py pipeline & using ManyTypes4Py Dataset
Minor fix to the preprocess module for training the ManyTypes4Py Dataset

Merging the JSON file of processed projects by the LibSA4Py pipeline in the preprocess module.

Using the LibSA4Py pipeline when running the extract CLI command.

Removed legacy pipeline and unused code

enhancement
opened by mir-am 0
Add logger to the modules
Add Python logging instead of print in the modules

Write logs of each module to its own log file. E.g. predict.py, learn.py

Improve log messages in the modules

enhancement
opened by mir-am 0
Return fixed amount of type predictions

I experimented with the type prediction (http://localhost:5001/api/predict?tc=0) using the provided docker image. I noticed that depending on the analysed source code, I get different amounts of type predictions per parameter/return/variable type. Is it possible to retrieve a fixed number of predicted types? For example, I would like to retrieve the Top-10 type predictions for each parameter and return type.

Best regards Florian

opened by Wooza 1
Pre-computing triplets in `TripletDataset`

Before this PR, we had to compute triplets on the fly to obtain a training sample, slowing down training and making the GPU wait for data. This PR makes an improvement to TripletDataset by pre-computing triplets before training the model and giving up to 6 times faster training speed. This way, obtaining a training sample to create batches is almost instant. However, with this improvement, each anchor has only one corresponding positive and negative example in every epoch.
enhancement

opened by mir-am 0
Significant speed improvements to `preprocess`

The speed of preprocess is quite significantly improved by using parallel_apply() for processing functions' arguments and reducing the depth of parametric types. parallel_apply() is used from pandarallel.

opened by mir-am 0
JSON output file not JSON conformant
The JSON output file is not JSON conformant in two aspects:

Single quotes (') are used instead of double quotes(")

Some words such as None, True or False are not wrapped in any quotes at all

This may affect some simpler JSON parsers, better JSON parsers can handle these minor errors just fine.

'error': None #should be "error": "None"
opened by CoderUndefined 0
Integrate with pyre incremental and adapt the TypeWriter search strategy
It would be interesting to see how well the TypeWriter algorithm (https://software-lab.org/publications/TypeWriter_arXiv_1912.03768.pdf) for searching type annotation suggestions works against type4py. We might get dramatically better results for two reasons:

type4py's ML model seems to perform quite a bit better

today's pyre incremental is orders of magnitude faster than pyre was when the TypeWriter paper was written, so we may be able to try many more combinations and get correspondingly better results

At one point we'd considered hacking this very quickly as an internal project in my company, but we ran out of time. I think it would be better done open-source anyway because then

it would be easier to try out against external projects

we could publish our results with code if they are interesting enough to be worth a paper

the entire OSS community could benefit

I'm unsure if I can find time to prioritize this in the next 6 months at work but it's a little more likely if I treat it as a side project, which would also open the door to an informal weekend hackathon as a way to kick it off :)

I could do this in a separate repository or inside of type4py. What do you think @mir-am ? And does this sound interesting to you?
opened by stroxler 1

Type4Py: Deep Similarity Learning-Based Type Inference for Python

Related tags

Overview

Type4Py: Deep Similarity Learning-Based Type Inference for Python

Dataset

Code De-deduplication

Installation Guide

Requirements

Quick Install

Usage Guide

1. Extraction

2. Preprocessing

3. Vectorizing

4. Learning

5. Testing

6. Evaluating

Converting Type4Py to ONNX

VSCode Extension

Type4Py Server

Citing Type4Py

Comments

Owner

Software Analytics Lab

Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

A deep learning based semantic search platform that computes similarity scores between provided query and documents

Data-depth-inference - Data depth inference with python

Product-based-recommendation-system - A product based recommendation system which uses Machine learning algorithm such as KNN and cosine similarity

Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Monocular 3D pose estimation. OpenVINO. CPU inference or iGPU (OpenCL) inference.

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks

Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

Python package facilitating the use of Bayesian Deep Learning methods with Variational Inference for PyTorch

MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Deep Learning Models for Causal Inference

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Low-code/No-code approach for deep learning inference on devices

PyTorch Implementation of Region Similarity Representation Learning (ReSim)