Trex is a tool to match semantically similar functions based on transfer learning.

Last update: Dec 28, 2022

Related tags

Text Data & NLP trex

Overview

Introduction

Trex is a tool to match semantically similar functions based on transfer learning.

Installation

We recommend conda to setup the environment and install the required packages.

First, create the conda environment,

conda create -n trex python=3.8 numpy scipy scikit-learn requests

and activate the conda environment:

conda activate trex

Then, install the latest PyTorch (assume you have GPU):

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

Enter the trex root directory: e.g., path/to/trex, and install trex:

pip install --editable .

For large datasets install PyArrow:

pip install pyarrow

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Preparation

Pretrained models:

Create the checkpoints and checkpoints/pretrain subdirectory in path/to/trex

mkdir checkpoints, mkdir checkpoints/pretrain

Download our pretrained weight parameters and put in checkpoints/pretrain

Sample data for finetuning similarity

We provide the sample training/testing files of finetuning in data-src/similarity If you want to prepare the finetuning data yourself, make sure you follow the format shown in data-src/similarity (coming soon: tokenization script).

We have to binarize the data to make it ready to be trained. To binarize the training data for finetuning, run:

python command/finetune/preprocess.py

The binarized training data ready for finetuning (for detecting similarity) will be stored at data-bin/similarity

Training

To finetune the model, run:

./command/finetune/finetune.sh

The scripts loads the pretrained weight parameters from checkpoints/pretrain/ and finetunes the model.

Sample data for pretraining on micro-traces

We also provide (10K) samples and scripts to demonstrate how to pretrain the model. To binarize the training data for pretraining, run:

python command/pretrain/preprocess_pretrain_10k.py

The binarized training data ready for pretraining will be stored at data-bin/pretrain_10k

To pretrain the model, run:

./command/pretrain/pretrain_10k.sh

The pretrained model will be checkpointed at checkpoints/pretrain_10k

Dataset

We put our dataset here.

Comments

torch.jit error in get_embedding.py

There seems to be an error with annotations when using the command/inference/get_embedding.py script.
See the error message:

Traceback (most recent call last):
  File "command/inference/get_embedding.py", line 53, in <module>
    emb0_rep = loaded(sample0_emb, features_only=True, classification_head_name='similarity')[0]['features']
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
torch.jit.Error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/fairseq/modules/trex_encoder.py", line 252, in forward
    else:
      pass
    ops.prim.RaiseException(_44)
    ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return _45
class ByteCombineCNN(Module):

Traceback of TorchScript, original code (most recent call last):
  File "/home/user/trex/fairseq/modules/trex_encoder.py", line 166, in forward

        if self.layernorm_embedding is not None:
            x = self.layernorm_embedding(x)
                ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        x = self.dropout_module(x)
        if self.quant_noise is not None:
RuntimeError: This Python function is annotated to be ignored and cannot be run

opened by wideglide 5

Cannot load model parameters from checkpoint

When running the script ./command/finetune/finetune.sh, an error occurred.

Traceback (most recent call last): File "train.py", line 14, in cli_main() File "/data/binVul/trex-main/fairseq_cli/train.py", line 496, in cli_main distributed_utils.call_main(cfg, main) File "/data/binVul/trex-main/fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "/data/binVul/trex-main/fairseq_cli/train.py", line 149, in main extra_state, epoch_itr = checkpoint_utils.load_checkpoint( File "/data/binVul/trex-main/fairseq/checkpoint_utils.py", line 213, in load_checkpoint extra_state = trainer.load_checkpoint( File "/data/binVul/trex-main/fairseq/trainer.py", line 472, in load_checkpoint raise Exception( Exception: Cannot load model parameters from checkpoint checkpoints/similarity/checkpoint_best.pt; please ensure that the architectures match.

How can I solve it? Thanks!

opened by qiyea 5
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Follow the finetune step in readme, arise value error:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

opened by ice-tong 5
Functions longer than 512 tokens

Hi,

for pretraining, having functions larger than 512 tokens seems trivial as they can just be split.

However, for similarity the paper states We average the subsequences’ embeddings during finetuning if the function is split to more than one subsequences. How exactly does this work? I could not find the code for the averaging, and all input data appears to be < 512 tokens, which makes it seem like the functions have been split before; if the functions have been split before finetuning, how are the pairs matched in the dataset?

opened by cluosh 4
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

run ./command/finetune/finetune.sh

Traceback (most recent call last): File "/home/thinktwice/aixin/test/trex/train.py", line 14, in cli_main() File "/home/thinktwice/aixin/test/trex/fairseq_cli/train.py", line 496, in cli_main distributed_utils.call_main(cfg, main) File "/home/thinktwice/aixin/test/trex/fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq_cli/train.py", line 173, in main valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/home/thinktwice/aixin/test/trex/fairseq_cli/train.py", line 284, in train log_output = trainer.train_step(samples) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/home/thinktwice/aixin/test/trex/fairseq/trainer.py", line 701, in train_step raise e File "/home/thinktwice/aixin/test/trex/fairseq/trainer.py", line 669, in train_step loss, sample_size_i, logging_output = self.task.train_step( File "/home/thinktwice/aixin/test/trex/fairseq/tasks/fairseq_task.py", line 475, in train_step loss, sample_size, logging_output = criterion(model, sample) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/criterions/trex.py", line 62, in forward output = model(**sample["net_input"], masked_code=masked_code, masked_value=masked_value)[0] File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/models/trex/model.py", line 233, in forward x, extra = self.encoder(src_tokens, src_lengths, features_only, return_all_hiddens, masked_code, masked_value) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/models/trex/model.py", line 593, in forward x, extra = self.extract_features(src_tokens, return_all_hiddens=return_all_hiddens) File "/home/thinktwice/aixin/test/trex/fairseq/models/trex/model.py", line 599, in extract_features encoder_out = self.sentence_encoder( File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/modules/trex_encoder.py", line 199, in forward return self.forward_scriptable(src_tokens, File "/home/thinktwice/aixin/test/trex/fairseq/modules/trex_encoder.py", line 238, in forward_scriptable x, encoder_embedding = self.forward_embedding(src_tokens) File "/home/thinktwice/aixin/test/trex/fairseq/modules/trex_encoder.py", line 160, in forward_embedding byte_embedding = self.byte_combine(torch.stack(byte_embedding_stack, dim=2)) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/modules/trex_encoder.py", line 391, in forward x = conv(features) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 302, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 298, in _conv_forward return F.conv1d(input, weight, bias, self.stride,

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

opened by ziqiangbuxi0416 3

torch.jit.Error: The following operation failed in the TorchScript interpreter

Hi @peikexin9, I have finetuned a model by using the script ./command/finetune/finetune.sh.

get_embedding.py is modified as follows:

trex = TrexModel.from_pretrained(f'checkpoints/similarity',
                                 checkpoint_file='checkpoint_last.pt',
                                 data_name_or_path=f'data-bin/similarity')

When running python command/inference/get_embedding.py, I got an error.

Traceback (most recent call last):
  File "command/inference/get_embedding.py", line 52, in <module>
    emb0 = loaded(sample0_emb, features_only=True)[0]['features']
  File "/usr/local/miniconda3/envs/trex/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
torch.jit.Error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/fairseq/modules/trex_encoder.py", line 252, in forward
    else:
      pass
    ops.prim.RaiseException(_44)
    ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return _45
class ByteCombineCNN(Module):

Traceback of TorchScript, original code (most recent call last):
  File "/data/binVul/trex-main/fairseq/modules/trex_encoder.py", line 166, in forward

        if self.layernorm_embedding is not None:
            x = self.layernorm_embedding(x)
                ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        x = self.dropout_module(x)
        if self.quant_noise is not None:
RuntimeError: This Python function is annotated to be ignored and cannot be run

opened by qiyea 2

finetune: Cannot load model parameters from checkpoint

I was trying to finetune your pretrained model. However, when I launch ./command/finetune/finetune.sh i get the following error:

2022-05-24 12:51:00 | INFO | fairseq_cli.train | task: SimilarityTask
2022-05-24 12:51:00 | INFO | fairseq_cli.train | model: TrexModel
2022-05-24 12:51:00 | INFO | fairseq_cli.train | criterion: SimilarityCriterion
2022-05-24 12:51:00 | INFO | fairseq_cli.train | num. shared model params: 61,787,413 (num. trained: 61,787,413)
2022-05-24 12:51:00 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0)
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/static/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/static/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/inst_pos_emb/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/inst_pos_emb/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/op_pos_emb/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/op_pos_emb/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/arch_emb/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/arch_emb/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/byte1/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/byte1/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/byte2/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/byte2/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/byte3/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/byte3/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/byte4/valid
2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/byte4/valid
2022-05-24 12:51:00 | INFO | fairseq.tasks.similarity | Loaded valid with #samples: 2005
2022-05-24 12:51:00 | INFO | fairseq.trainer | detected shared parameter: encoder.sentence_encoder.embed_tokens.static.weight <- encoder.lm_code_head.weight
2022-05-24 12:51:00 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2022-05-24 12:51:00 | INFO | fairseq_cli.train | max tokens per device = None and max sentences per device = 16
2022-05-24 12:51:00 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/similarity/checkpoint_best.pt
2022-05-24 12:51:01 | INFO | fairseq.models.trex.model | Overwriting classification_heads.similarity.dense.weight
2022-05-24 12:51:01 | INFO | fairseq.models.trex.model | Overwriting classification_heads.similarity.dense.bias
2022-05-24 12:51:01 | INFO | fairseq.models.trex.model | Overwriting classification_heads.similarity.out_proj.weight
2022-05-24 12:51:01 | INFO | fairseq.models.trex.model | Overwriting classification_heads.similarity.out_proj.bias
Traceback (most recent call last):
  File "/home/trex/fairseq/trainer.py", line 460, in load_checkpoint
    self.model.load_state_dict(
  File "/home/trex/fairseq/models/fairseq_model.py", line 125, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TrexModel:
        Missing key(s) in state_dict: "encoder.sentence_encoder.embed_bytes.weight".
        size mismatch for encoder.sentence_encoder.byte_combine.convolutions.0.weight: copying a param with shape torch.Size([4, 1, 1]) from checkpoint, the shape in current model is torch.Size([64, 768, 1]).
        size mismatch for encoder.sentence_encoder.byte_combine.convolutions.0.bias: copying a param with shape torch.Size([4]) from checkpoint, the shape in current model is torch.Size([64]).
        size mismatch for encoder.sentence_encoder.byte_combine.convolutions.1.weight: copying a param with shape torch.Size([8, 1, 2]) from checkpoint, the shape in current model is torch.Size([128, 768, 2]).
        size mismatch for encoder.sentence_encoder.byte_combine.convolutions.1.bias: copying a param with shape torch.Size([8]) from checkpoint, the shape in current model is torch.Size([128]).
        size mismatch for encoder.sentence_encoder.byte_combine.convolutions.2.weight: copying a param with shape torch.Size([12, 1, 3]) from checkpoint, the shape in current model is torch.Size([192, 768, 3]).
        size mismatch for encoder.sentence_encoder.byte_combine.convolutions.2.bias: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([192]).
        size mismatch for encoder.sentence_encoder.byte_combine.highway.layers.0.weight: copying a param with shape torch.Size([48, 24]) from checkpoint, the shape in current model is torch.Size([768, 384]).
        size mismatch for encoder.sentence_encoder.byte_combine.highway.layers.0.bias: copying a param with shape torch.Size([48]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for encoder.sentence_encoder.byte_combine.highway.layers.1.weight: copying a param with shape torch.Size([48, 24]) from checkpoint, the shape in current model is torch.Size([768, 384]).
        size mismatch for encoder.sentence_encoder.byte_combine.highway.layers.1.bias: copying a param with shape torch.Size([48]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for encoder.sentence_encoder.byte_combine.projection.weight: copying a param with shape torch.Size([768, 24]) from checkpoint, the shape in current model is torch.Size([768, 384]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 14, in <module>
    cli_main()
  File "/home/trex/fairseq_cli/train.py", line 496, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/trex/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/home/trex/fairseq_cli/train.py", line 149, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(
  File "/home/trex/fairseq/checkpoint_utils.py", line 213, in load_checkpoint
    extra_state = trainer.load_checkpoint(
  File "/home/trex/fairseq/trainer.py", line 472, in load_checkpoint
    raise Exception(
Exception: Cannot load model parameters from checkpoint checkpoints/similarity/checkpoint_best.pt; please ensure that the architectures match.

How can i solve?

opened by FiorellaArtuso 1

batch inputs during inference

How can we batch multiple inputs when computing the embeddings using a trained model? The script command/inference/get_embedding.py demonstrates how to compute a single function's embedding, but it has slow throughput.

I tried to concatenate the tensors for each respective field in the dicts produced from sample_emb = trex.process_token_dict(sample_tokens), but ran into a problem where the inputs needed to first be padded to the proper length, and I could not determine how to pad them or determine what the proper padding value is for each field.

Could you please give an example of computing embeddings for multiple functions at once?

opened by the-entire-country-of-ireland 1
The error about prepare_code_trace.py

Hi, @peikexin9. I just ran the code 'python micro_trace/prepare_code_trace.py' ,but I met this error "micro_trace/prepare_code_trace.py", line 67, in hex2str assert len(num) <= 8 AssertionError. First I ran the code 'python command/pretrain/prepare_json.py' to generate the data 'data-raw/funcbytes/', the ran the code 'python micro_trace/prepare_code_trace.py' to generate the data 'data-raw/functraces', but the error occured. Could you please help me. Thank you very much.

opened by RobinHan24 1
Usage of prepare_code_trace.py?

Hello,

In the Trex paper, it's described in Section V that the code base implements microtracing through emulation, which appears to be done in 'micro_trace/prepare_code_trace.py'. However, the TREX github README doesn't mention calling prepare_code_trace.py in the data processing pipeline. Would it be possible to get clarification as to how the traces from emulation get created in either preprocessing or finetuning?

opened by DanielKotroco 4
How to generate our own pretrain dataset?

As mentioned in readme, I followed to run the script preprocess_pretrain_10k.py to generate data in data-bin/pretrain_10k, but how can I generate myown data which is in data-src/pretrain_10k, thanks a lot.

opened by RobinHan24 4

TypeError: forward() got an unexpected keyword argument 'src_lengths'

Hi @peikexin9

When I pretrain the model by using the script ./command/pretrain/pretrain_10k.sh, I got an error.

2022-03-08 16:53:39 | INFO | fairseq.trainer | begin training epoch 1
2022-03-08 16:53:39 | INFO | fairseq_cli.train | Start iterating over samples
Traceback (most recent call last):
  File "train.py", line 14, in <module>
    cli_main()
  File "/data/binVul/trex-main/fairseq_cli/train.py", line 496, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/data/binVul/trex-main/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/data/binVul/trex-main/fairseq_cli/train.py", line 173, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/usr/local/miniconda3/envs/trex2/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/data/binVul/trex-main/fairseq_cli/train.py", line 284, in train
    log_output = trainer.train_step(samples)
  File "/usr/local/miniconda3/envs/trex2/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/data/binVul/trex-main/fairseq/trainer.py", line 669, in train_step
    loss, sample_size_i, logging_output = self.task.train_step(
  File "/data/binVul/trex-main/fairseq/tasks/fairseq_task.py", line 475, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/usr/local/miniconda3/envs/trex2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/binVul/trex-main/fairseq/criterions/trex.py", line 62, in forward
    output = model(**sample["net_input"], masked_code=masked_code, masked_value=masked_value)[0]
  File "/usr/local/miniconda3/envs/trex2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'src_lengths'

opened by qiyea 6

What log_output used for?

i see that you've define many new variables in the reduce_metrics function，such as AUC，ncorrec_pred but， I don't know what these variables are used for, except ‘loss’ are used in backward https://github.com/CUMLSec/trex/blob/7b2cabaecdaeb043da48d85a9016fed391ea75a5/fairseq/tasks/fairseq_task.py#L479

will you tell me ,why you defined AUC in https://github.com/CUMLSec/trex/blob/7b2cabaecdaeb043da48d85a9016fed391ea75a5/fairseq/criterions/similarity.py#L138 and will you like to pulish a newer version that include the end-to-end script? i can't find the script to change the format for the data that has been converted to the data-raw/functraces format. we are looking forward to your reply. Thank you very much!

opened by iamawhalez 1

Trex is a tool to match semantically similar functions based on transfer learning.

Related tags

Overview

Introduction

Installation

Preparation

Pretrained models:

Sample data for finetuning similarity

Training

Sample data for pretraining on micro-traces

Dataset

Comments

Owner

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Global Rhythm Style Transfer Without Text Transcriptions

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Python functions for summarizing and improving voice dictation input.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition