Type4Py: Deep Similarity Learning-Based Type Inference for Python

Overview

Type4Py: Deep Similarity Learning-Based Type Inference for Python

GH Workflow

This repository contains the implementation of Type4Py and instructions for re-producing the results of the paper.

Dataset

For Type4Py, we use the ManyTypes4Py dataset. You can download the latest version of the dataset here. Also, note that the dataset is already de-duplicated.

Code De-deduplication

If you want to use your own dataset, it is essential to de-duplicate the dataset by using a tool like CD4Py.

Installation Guide

Requirements

  • Linux-based OS
  • Python 3.5 or newer
  • An NVIDIA GPU with CUDA support

Quick Install

git clone https://github.com/saltudelft/type4py.git && cd type4py
pip install .

Usage Guide

Follow the below steps to train and evaluate the Type4Py model.

1. Extraction

NOTE: Skip this step if you're using the ManyTypes4Py dataset.

$ type4py extract --c $DATA_PATH --o $OUTPUT_DIR --d $DUP_FILES --w $CORES

Description:

  • $DATA_PATH: The path to the Python corpus or dataset.
  • $OUTPUT_DIR: The path to store processed projects.
  • $DUP_FILES: The path to the duplicate files, i.e., the *.jsonl.gz file produced by CD4Py. [Optional]
  • $CORES: Number of CPU cores to use for processing projects.

2. Preprocessing

$ type4py preprocess --o $OUTPUT_DIR --l $LIMIT

Description:

  • $OUTPUT_DIR: The path that was used in the first step to store processed projects. For the MT4Py dataset, use the directory in which the dataset is extracted.
  • $LIMIT: The number of projects to be processed. [Optional]

3. Vectorizing

$ type4py vectorize --o $OUTPUT_DIR

Description:

  • $OUTPUT_DIR: The path that was used in the previous step to store processed projects.

4. Learning

$ type4py learn --o $OUTPUT_DIR --c --p $PARAM_FILE

Description:

  • $OUTPUT_DIR: The path that was used in the previous step to store processed projects.

  • --c: Trains the complete model. Use type4py learn -h to see other configurations.

  • --p $PARAM_FILE: The path to user-provided hyper-parameters for the model. See this file as an example. [Optional]

5. Testing

$ type4py predict --o $OUTPUT_DIR --c

Description:

  • $OUTPUT_DIR: The path that was used in the first step to store processed projects.
  • --c: Predicts using the complete model. Use type4py predict -h to see other configurations.

6. Evaluating

$ type4py eval --o $OUTPUT_DIR --t c --tp 10

Description:

  • $OUTPUT_DIR: The path that was used in the first step to store processed projects.
  • --t: Evaluates the model considering different prediction tasks. E.g., --t c considers all predictions tasks, i.e., parameters, return, and variables. [Default: c]
  • --tp 10: Considers Top-10 predictions for evaluation. For this argument, You can choose a positive integer between 1 and 10. [Default: 10]

Use type4py eval -h to see other options.

Converting Type4Py to ONNX

To convert the pre-trained Type4Py model to the ONNX format, use the following command:

$ type4py to_onnx --o $OUTPUT_DIR

Description:

  • $OUTPUT_DIR: The path that was used in the usage section to store processed projects and the model.

VSCode Extension

vsm-version

Type4Py can be used in VSCode, which provides ML-based type auto-completion for Python files. The Type4Py's VSCode extension can be installed from the VS Marketplace here.

Type4Py Server

GH Workflow

The Type4Py server is deployed on our server, which exposes a public API and powers the VSCode extension. However, if you would like to deploy the Type4Py server on your own machine, you can adapt the server code here. Also, please feel free to reach out to us for deployment, using the pre-trained Type4Py model and how to train your own model by creating an issue.

Citing Type4Py

@article{mir2021type4py,
  title={Type4Py: Deep Similarity Learning-Based Type Inference for Python},
  author={Mir, Amir M and Latoskinas, Evaldas and Proksch, Sebastian and Gousios, Georgios},
  journal={arXiv preprint arXiv:2101.04470},
  year={2021}
}
Comments
  • Crash when trying to infer single file with freshly trained model using ManyTypes4Py

    Crash when trying to infer single file with freshly trained model using ManyTypes4Py

    Hello, thank you for creating and providing this great project! I plan to use this project for my bachelor thesis. Therefore, I am mainly interested in the inference functionality provided with infer.py on branch server (branch infer seems to be outdated). I am aware of the VS Code extension and the public JSON API. I, however, prefer to use this project locally.

    Since infer.py takes a pre-trained model as a program argument, I followed all the steps in the README to train such a model. Unfortunately, the script crashes with the following message (excerpt):

    onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: tok for the following indices
     index: 0 Got: 7 Expected: 1
     Please fix either the inputs or the model.
    

    Below you can find a link to a Google Colab notebook with all the steps from start (downloading the ManyTypes4Py dataset, pip-installing type4py, preprocessing) to finish (training a model, trying to infer the types of a single file) and the corresponding output from when I ran it the last time (including the full error backtrace on the bottom):

    https://colab.research.google.com/drive/1kRIffMlgGCeW55wXelksGrXfSd0WjhKQ?usp=sharing

    It should be relatively self-explanatory. Evidently, I use a fork of this project and not the project itself. The differences are minor though: In learn.py, I just re-uncommented the .to(DEVICE)-calls (c42144d) as otherwise it would lead to a crash in the notebook (vectors are on different devices). The remaining changes don't affect Python files and are not relevant to this issues. Further, I am using venv, although I doubt this has any negative influence on the execution of this project.


    My question is, how can I successfully use infer.py? How can I obtain a proper compatible model for it? Are any of those steps in the linked notebook incorrect?

    opened by fmease 4
  • Error in variable initialisation

    Error in variable initialisation

    When using the preprocess command with only the -o argument, the code crashes with the following

    UnboundLocalError: local variable 'train_files_vars' referenced before assignment
    

    This is because in the following extract

    https://github.com/saltudelft/type4py/blob/93828c3d1a3460dc29cace398ca0dcb10ea14daf/type4py/preprocess.py#L302-L319

    the train_files_vars variable is only initialised in the if branch

    bug 
    opened by gousiosg 2
  • Using Docker images for both production and development environments

    Using Docker images for both production and development environments

    In #8, I published Docker images for using Type4Py locally on users' machines. This PR also creates Docker images to deploy Type4Py in both production and development environments. Specifically, it makes the following changes:

    • The Type4Py server detects whether the Docker image is running in local mode or production mode based on given ENV vars.
    • Creates a separate Docker file to build images that allows performing the model inference on GPUs.
    • Adds GH workflows to build Docker images for production and dev. environments and also supporting CPU/GPU.
    • Add a unit test to test the local model before publishing its Docker image in GH Action.
    • A bash script to test all the Docker images for production, dev, local environments.
    • Use one config file for both production and dev. envs.
    enhancement 
    opened by mir-am 0
  • Publishing Docker images for Type4Py to run the model locally

    Publishing Docker images for Type4Py to run the model locally

    This PR makes the following changes:

    • A new --rvth CLI argument that applies VTHs with a probability (default 0.5) [ONLY FOR PRODUCTION]
    • Improves memory consumption when loading datasets
    • Add reduce command, which uses PCA to reduce the dimension of type clusters and the size of the whole model.
    • Disable telemetry data in the server module when running the Docker version.
    • For inference, drops dependency on PyTorch to reduce the size of the Docker image
    • Adding Docker file and config file for building Docker images of Type4Py
    enhancement 
    opened by mir-am 0
  • New improvements and the Type4Py server

    New improvements and the Type4Py server

    • Type checking the prediction of Type4Py (experimental)
    • Converting the Type4Py's PyTorch model to the ONNX format
    • Adding a web server to query the model and support the VSCode extension.
    • Computing MRR for evaluation
    enhancement 
    opened by mir-am 0
  • Type4Py supports prediction for the type of variables

    Type4Py supports prediction for the type of variables

    • Extend preprocess and vectorize modules for variables prediction
    • Extend data_loaders module for loading variables' tensors
    • Separating ubiquitous and common types when evaluating
    • Considering Visible Type Hints in preprocess module
    • Evaluating different tasks in the eval module i.e., parameters, return, and variable
    • Adding the Type4Py model with different configurations
    • Improving the performance of data loaders
    • In eval, consider MRR instead of weighted metrics
    • predict now creates a JSON file for predictions
    • Improve the memory consumption of predict
    • Improvements to predict: better type aliasing & reducing the depth of parametric types
    • Determining no. of batches when predicting different from the training phase
    enhancement 
    opened by mir-am 0
  • Migration to the LibSA4Py pipeline & using ManyTypes4Py Dataset

    Migration to the LibSA4Py pipeline & using ManyTypes4Py Dataset

    • Minor fix to the preprocess module for training the ManyTypes4Py Dataset
    • Merging the JSON file of processed projects by the LibSA4Py pipeline in the preprocess module.
    • Using the LibSA4Py pipeline when running the extract CLI command.
    • Removed legacy pipeline and unused code
    enhancement 
    opened by mir-am 0
  • Add logger to the modules

    Add logger to the modules

    • Add Python logging instead of print in the modules
    • Write logs of each module to its own log file. E.g. predict.py, learn.py
    • Improve log messages in the modules
    enhancement 
    opened by mir-am 0
  • Return fixed amount of type predictions

    Return fixed amount of type predictions

    I experimented with the type prediction (http://localhost:5001/api/predict?tc=0) using the provided docker image. I noticed that depending on the analysed source code, I get different amounts of type predictions per parameter/return/variable type. Is it possible to retrieve a fixed number of predicted types? For example, I would like to retrieve the Top-10 type predictions for each parameter and return type.

    Best regards Florian

    opened by Wooza 1
  • Pre-computing triplets in `TripletDataset`

    Pre-computing triplets in `TripletDataset`

    Before this PR, we had to compute triplets on the fly to obtain a training sample, slowing down training and making the GPU wait for data. This PR makes an improvement to TripletDataset by pre-computing triplets before training the model and giving up to 6 times faster training speed. This way, obtaining a training sample to create batches is almost instant. However, with this improvement, each anchor has only one corresponding positive and negative example in every epoch.

    enhancement 
    opened by mir-am 0
  • Significant speed improvements to `preprocess`

    Significant speed improvements to `preprocess`

    The speed of preprocess is quite significantly improved by using parallel_apply() for processing functions' arguments and reducing the depth of parametric types. parallel_apply() is used from pandarallel.

    opened by mir-am 0
  • JSON output file not JSON conformant

    JSON output file not JSON conformant

    The JSON output file is not JSON conformant in two aspects:

    1. Single quotes (') are used instead of double quotes(")
    2. Some words such as None, True or False are not wrapped in any quotes at all

    This may affect some simpler JSON parsers, better JSON parsers can handle these minor errors just fine.

    'error': None
    #should be
    "error": "None"
    
    opened by CoderUndefined 0
  • Integrate with pyre incremental and adapt the TypeWriter search strategy

    Integrate with pyre incremental and adapt the TypeWriter search strategy

    It would be interesting to see how well the TypeWriter algorithm (https://software-lab.org/publications/TypeWriter_arXiv_1912.03768.pdf) for searching type annotation suggestions works against type4py. We might get dramatically better results for two reasons:

    • type4py's ML model seems to perform quite a bit better
    • today's pyre incremental is orders of magnitude faster than pyre was when the TypeWriter paper was written, so we may be able to try many more combinations and get correspondingly better results

    At one point we'd considered hacking this very quickly as an internal project in my company, but we ran out of time. I think it would be better done open-source anyway because then

    • it would be easier to try out against external projects
    • we could publish our results with code if they are interesting enough to be worth a paper
    • the entire OSS community could benefit

    I'm unsure if I can find time to prioritize this in the next 6 months at work but it's a little more likely if I treat it as a side project, which would also open the door to an informal weekend hackathon as a way to kick it off :)

    I could do this in a separate repository or inside of type4py. What do you think @mir-am ? And does this sound interesting to you?

    opened by stroxler 1
Owner
Software Analytics Lab
Software Analytics Lab @ TU Delft
Software Analytics Lab
Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Sharpened Cosine Similarity A layer implementation for PyTorch Install At your c

Brandon Rohrer 203 Nov 30, 2022
Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

Ng Kam Woh 71 Dec 22, 2022
A deep learning based semantic search platform that computes similarity scores between provided query and documents

semanticsearch This is a deep learning based semantic search platform that computes similarity scores between provided query and documents. Documents

null 1 Nov 30, 2021
Data-depth-inference - Data depth inference with python

Welcome! This readme will guide you through the use of the code in this reposito

Marco 3 Feb 8, 2022
Torchserve server using a YoloV5 model running on docker with GPU and static batch inference to perform production ready inference.

Yolov5 running on TorchServe (GPU compatible) ! This is a dockerfile to run TorchServe for Yolo v5 object detection model. (TorchServe (PyTorch librar

null 82 Nov 29, 2022
Monocular 3D pose estimation. OpenVINO. CPU inference or iGPU (OpenCL) inference.

human-pose-estimation-3d-python-cpp RealSenseD435 (RGB) 480x640 + CPU Corei9 45 FPS (Depth is not used) 1. Run 1-1. RealSenseD435 (RGB) 480x640 + CPU

Katsuya Hyodo 8 Oct 3, 2022
PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

Amin Rezaei 157 Dec 11, 2022
KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

KSAI Lite is a deep learning inference framework of kingsoft, based on tensorflow lite

null 80 Dec 27, 2022
Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks

Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks Contributions A novel pairwise feature LSP to extract structural

null 31 Dec 6, 2022
Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Lightweight-Deep-CNN-for-Natural-Image-Matting-via-Similarity-Preserving-Knowledge-Distillation Introduction Accepted at IEEE Signal Processing Letter

DongGeun-Yoon 19 Jun 7, 2022
Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

This repository is the official PyTorch implementation of Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

hippopmonkey 4 Dec 11, 2022
Python package facilitating the use of Bayesian Deep Learning methods with Variational Inference for PyTorch

PyVarInf PyVarInf provides facilities to easily train your PyTorch neural network models using variational inference. Bayesian Deep Learning with Vari

null 342 Dec 2, 2022
MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

Documentation | FAQ | Release Notes | Roadmap | MACE Model Zoo | Demo | Join Us | 中文 Mobile AI Compute Engine (or MACE for short) is a deep learning i

Xiaomi 4.7k Dec 29, 2022
Deep Learning Models for Causal Inference

Extensive tutorials for learning how to build deep learning models for causal inference using selection on observables in Tensorflow 2.

Bernard  J Koch 151 Dec 31, 2022
PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

null 943 Jan 7, 2023
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

null 419 Jan 3, 2023
Low-code/No-code approach for deep learning inference on devices

EzEdgeAI A concept project that uses a low-code/no-code approach to implement deep learning inference on devices. It provides a componentized framewor

On-Device AI Co., Ltd. 7 Apr 5, 2022
PyTorch Implementation of Region Similarity Representation Learning (ReSim)

ReSim This repository provides the PyTorch implementation of Region Similarity Representation Learning (ReSim) described in this paper: @Article{xiao2

Tete Xiao 74 Jan 3, 2023