Trex is a tool to match semantically similar functions based on transfer learning.

Related tags

Text Data & NLP trex
Overview

Introduction

Trex is a tool to match semantically similar functions based on transfer learning.

Installation

We recommend conda to setup the environment and install the required packages.

First, create the conda environment,

conda create -n trex python=3.8 numpy scipy scikit-learn requests

and activate the conda environment:

conda activate trex

Then, install the latest PyTorch (assume you have GPU):

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

Enter the trex root directory: e.g., path/to/trex, and install trex:

pip install --editable .

For large datasets install PyArrow:

pip install pyarrow

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Preparation

Pretrained models:

Create the checkpoints and checkpoints/pretrain subdirectory in path/to/trex

mkdir checkpoints, mkdir checkpoints/pretrain

Download our pretrained weight parameters and put in checkpoints/pretrain

Sample data for finetuning similarity

We provide the sample training/testing files of finetuning in data-src/similarity If you want to prepare the finetuning data yourself, make sure you follow the format shown in data-src/similarity (coming soon: tokenization script).

We have to binarize the data to make it ready to be trained. To binarize the training data for finetuning, run:

python command/finetune/preprocess.py

The binarized training data ready for finetuning (for detecting similarity) will be stored at data-bin/similarity

Training

To finetune the model, run:

./command/finetune/finetune.sh

The scripts loads the pretrained weight parameters from checkpoints/pretrain/ and finetunes the model.

Sample data for pretraining on micro-traces

We also provide (10K) samples and scripts to demonstrate how to pretrain the model. To binarize the training data for pretraining, run:

python command/pretrain/preprocess_pretrain_10k.py

The binarized training data ready for pretraining will be stored at data-bin/pretrain_10k

To pretrain the model, run:

./command/pretrain/pretrain_10k.sh

The pretrained model will be checkpointed at checkpoints/pretrain_10k

Dataset

We put our dataset here.

Comments
  • torch.jit error in get_embedding.py

    torch.jit error in get_embedding.py

    There seems to be an error with annotations when using the command/inference/get_embedding.py script.
    See the error message:

    Traceback (most recent call last):
      File "command/inference/get_embedding.py", line 53, in <module>
        emb0_rep = loaded(sample0_emb, features_only=True, classification_head_name='similarity')[0]['features']
      File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
    torch.jit.Error: The following operation failed in the TorchScript interpreter.
    Traceback of TorchScript, serialized code (most recent call last):
      File "code/__torch__/fairseq/modules/trex_encoder.py", line 252, in forward
        else:
          pass
        ops.prim.RaiseException(_44)
        ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return _45
    class ByteCombineCNN(Module):
    
    Traceback of TorchScript, original code (most recent call last):
      File "/home/user/trex/fairseq/modules/trex_encoder.py", line 166, in forward
    
            if self.layernorm_embedding is not None:
                x = self.layernorm_embedding(x)
                    ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            x = self.dropout_module(x)
            if self.quant_noise is not None:
    RuntimeError: This Python function is annotated to be ignored and cannot be run
    
    opened by wideglide 5
  • Cannot load model parameters from checkpoint

    Cannot load model parameters from checkpoint

    When running the script ./command/finetune/finetune.sh, an error occurred.

    Traceback (most recent call last): File "train.py", line 14, in cli_main() File "/data/binVul/trex-main/fairseq_cli/train.py", line 496, in cli_main distributed_utils.call_main(cfg, main) File "/data/binVul/trex-main/fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "/data/binVul/trex-main/fairseq_cli/train.py", line 149, in main extra_state, epoch_itr = checkpoint_utils.load_checkpoint( File "/data/binVul/trex-main/fairseq/checkpoint_utils.py", line 213, in load_checkpoint extra_state = trainer.load_checkpoint( File "/data/binVul/trex-main/fairseq/trainer.py", line 472, in load_checkpoint raise Exception( Exception: Cannot load model parameters from checkpoint checkpoints/similarity/checkpoint_best.pt; please ensure that the architectures match.

    How can I solve it? Thanks!

    opened by qiyea 5
  • ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    Follow the finetune step in readme, arise value error:

    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    opened by ice-tong 5
  • Functions longer than 512 tokens

    Functions longer than 512 tokens

    Hi,

    for pretraining, having functions larger than 512 tokens seems trivial as they can just be split.

    However, for similarity the paper states We average the subsequences’ embeddings during finetuning if the function is split to more than one subsequences. How exactly does this work? I could not find the code for the averaging, and all input data appears to be < 512 tokens, which makes it seem like the functions have been split before; if the functions have been split before finetuning, how are the pairs matched in the dataset?

    opened by cluosh 4
  • RuntimeError:

    RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

    run ./command/finetune/finetune.sh

    Traceback (most recent call last): File "/home/thinktwice/aixin/test/trex/train.py", line 14, in cli_main() File "/home/thinktwice/aixin/test/trex/fairseq_cli/train.py", line 496, in cli_main distributed_utils.call_main(cfg, main) File "/home/thinktwice/aixin/test/trex/fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq_cli/train.py", line 173, in main valid_losses, should_stop = train(cfg, trainer, task, epoch_itr) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/home/thinktwice/aixin/test/trex/fairseq_cli/train.py", line 284, in train log_output = trainer.train_step(samples) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/home/thinktwice/aixin/test/trex/fairseq/trainer.py", line 701, in train_step raise e File "/home/thinktwice/aixin/test/trex/fairseq/trainer.py", line 669, in train_step loss, sample_size_i, logging_output = self.task.train_step( File "/home/thinktwice/aixin/test/trex/fairseq/tasks/fairseq_task.py", line 475, in train_step loss, sample_size, logging_output = criterion(model, sample) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/criterions/trex.py", line 62, in forward output = model(**sample["net_input"], masked_code=masked_code, masked_value=masked_value)[0] File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/models/trex/model.py", line 233, in forward x, extra = self.encoder(src_tokens, src_lengths, features_only, return_all_hiddens, masked_code, masked_value) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/models/trex/model.py", line 593, in forward x, extra = self.extract_features(src_tokens, return_all_hiddens=return_all_hiddens) File "/home/thinktwice/aixin/test/trex/fairseq/models/trex/model.py", line 599, in extract_features encoder_out = self.sentence_encoder( File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/modules/trex_encoder.py", line 199, in forward return self.forward_scriptable(src_tokens, File "/home/thinktwice/aixin/test/trex/fairseq/modules/trex_encoder.py", line 238, in forward_scriptable x, encoder_embedding = self.forward_embedding(src_tokens) File "/home/thinktwice/aixin/test/trex/fairseq/modules/trex_encoder.py", line 160, in forward_embedding byte_embedding = self.byte_combine(torch.stack(byte_embedding_stack, dim=2)) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/aixin/test/trex/fairseq/modules/trex_encoder.py", line 391, in forward x = conv(features) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 302, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/thinktwice/anaconda3/envs/trex/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 298, in _conv_forward return F.conv1d(input, weight, bias, self.stride,

    RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

    opened by ziqiangbuxi0416 3
  • torch.jit.Error: The following operation failed in the TorchScript interpreter

    torch.jit.Error: The following operation failed in the TorchScript interpreter

    Hi @peikexin9, I have finetuned a model by using the script ./command/finetune/finetune.sh.

    get_embedding.py is modified as follows:

    trex = TrexModel.from_pretrained(f'checkpoints/similarity',
                                     checkpoint_file='checkpoint_last.pt',
                                     data_name_or_path=f'data-bin/similarity')
    

    When running python command/inference/get_embedding.py, I got an error.

    Traceback (most recent call last):
      File "command/inference/get_embedding.py", line 52, in <module>
        emb0 = loaded(sample0_emb, features_only=True)[0]['features']
      File "/usr/local/miniconda3/envs/trex/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
    torch.jit.Error: The following operation failed in the TorchScript interpreter.
    Traceback of TorchScript, serialized code (most recent call last):
      File "code/__torch__/fairseq/modules/trex_encoder.py", line 252, in forward
        else:
          pass
        ops.prim.RaiseException(_44)
        ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return _45
    class ByteCombineCNN(Module):
    
    Traceback of TorchScript, original code (most recent call last):
      File "/data/binVul/trex-main/fairseq/modules/trex_encoder.py", line 166, in forward
    
            if self.layernorm_embedding is not None:
                x = self.layernorm_embedding(x)
                    ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            x = self.dropout_module(x)
            if self.quant_noise is not None:
    RuntimeError: This Python function is annotated to be ignored and cannot be run
    
    opened by qiyea 2
  • finetune: Cannot load model parameters from checkpoint

    finetune: Cannot load model parameters from checkpoint

    I was trying to finetune your pretrained model. However, when I launch ./command/finetune/finetune.sh i get the following error:

    2022-05-24 12:51:00 | INFO | fairseq_cli.train | task: SimilarityTask
    2022-05-24 12:51:00 | INFO | fairseq_cli.train | model: TrexModel
    2022-05-24 12:51:00 | INFO | fairseq_cli.train | criterion: SimilarityCriterion
    2022-05-24 12:51:00 | INFO | fairseq_cli.train | num. shared model params: 61,787,413 (num. trained: 61,787,413)
    2022-05-24 12:51:00 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0)
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/static/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/static/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/inst_pos_emb/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/inst_pos_emb/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/op_pos_emb/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/op_pos_emb/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/arch_emb/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/arch_emb/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/byte1/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/byte1/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/byte2/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/byte2/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/byte3/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/byte3/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input0/byte4/valid
    2022-05-24 12:51:00 | INFO | fairseq.data.data_utils | loaded 2,005 examples from: data-bin/comp_similarity/input1/byte4/valid
    2022-05-24 12:51:00 | INFO | fairseq.tasks.similarity | Loaded valid with #samples: 2005
    2022-05-24 12:51:00 | INFO | fairseq.trainer | detected shared parameter: encoder.sentence_encoder.embed_tokens.static.weight <- encoder.lm_code_head.weight
    2022-05-24 12:51:00 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
    2022-05-24 12:51:00 | INFO | fairseq_cli.train | max tokens per device = None and max sentences per device = 16
    2022-05-24 12:51:00 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/similarity/checkpoint_best.pt
    2022-05-24 12:51:01 | INFO | fairseq.models.trex.model | Overwriting classification_heads.similarity.dense.weight
    2022-05-24 12:51:01 | INFO | fairseq.models.trex.model | Overwriting classification_heads.similarity.dense.bias
    2022-05-24 12:51:01 | INFO | fairseq.models.trex.model | Overwriting classification_heads.similarity.out_proj.weight
    2022-05-24 12:51:01 | INFO | fairseq.models.trex.model | Overwriting classification_heads.similarity.out_proj.bias
    Traceback (most recent call last):
      File "/home/trex/fairseq/trainer.py", line 460, in load_checkpoint
        self.model.load_state_dict(
      File "/home/trex/fairseq/models/fairseq_model.py", line 125, in load_state_dict
        return super().load_state_dict(new_state_dict, strict)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for TrexModel:
            Missing key(s) in state_dict: "encoder.sentence_encoder.embed_bytes.weight".
            size mismatch for encoder.sentence_encoder.byte_combine.convolutions.0.weight: copying a param with shape torch.Size([4, 1, 1]) from checkpoint, the shape in current model is torch.Size([64, 768, 1]).
            size mismatch for encoder.sentence_encoder.byte_combine.convolutions.0.bias: copying a param with shape torch.Size([4]) from checkpoint, the shape in current model is torch.Size([64]).
            size mismatch for encoder.sentence_encoder.byte_combine.convolutions.1.weight: copying a param with shape torch.Size([8, 1, 2]) from checkpoint, the shape in current model is torch.Size([128, 768, 2]).
            size mismatch for encoder.sentence_encoder.byte_combine.convolutions.1.bias: copying a param with shape torch.Size([8]) from checkpoint, the shape in current model is torch.Size([128]).
            size mismatch for encoder.sentence_encoder.byte_combine.convolutions.2.weight: copying a param with shape torch.Size([12, 1, 3]) from checkpoint, the shape in current model is torch.Size([192, 768, 3]).
            size mismatch for encoder.sentence_encoder.byte_combine.convolutions.2.bias: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([192]).
            size mismatch for encoder.sentence_encoder.byte_combine.highway.layers.0.weight: copying a param with shape torch.Size([48, 24]) from checkpoint, the shape in current model is torch.Size([768, 384]).
            size mismatch for encoder.sentence_encoder.byte_combine.highway.layers.0.bias: copying a param with shape torch.Size([48]) from checkpoint, the shape in current model is torch.Size([768]).
            size mismatch for encoder.sentence_encoder.byte_combine.highway.layers.1.weight: copying a param with shape torch.Size([48, 24]) from checkpoint, the shape in current model is torch.Size([768, 384]).
            size mismatch for encoder.sentence_encoder.byte_combine.highway.layers.1.bias: copying a param with shape torch.Size([48]) from checkpoint, the shape in current model is torch.Size([768]).
            size mismatch for encoder.sentence_encoder.byte_combine.projection.weight: copying a param with shape torch.Size([768, 24]) from checkpoint, the shape in current model is torch.Size([768, 384]).
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "train.py", line 14, in <module>
        cli_main()
      File "/home/trex/fairseq_cli/train.py", line 496, in cli_main
        distributed_utils.call_main(cfg, main)
      File "/home/trex/fairseq/distributed/utils.py", line 369, in call_main
        main(cfg, **kwargs)
      File "/home/trex/fairseq_cli/train.py", line 149, in main
        extra_state, epoch_itr = checkpoint_utils.load_checkpoint(
      File "/home/trex/fairseq/checkpoint_utils.py", line 213, in load_checkpoint
        extra_state = trainer.load_checkpoint(
      File "/home/trex/fairseq/trainer.py", line 472, in load_checkpoint
        raise Exception(
    Exception: Cannot load model parameters from checkpoint checkpoints/similarity/checkpoint_best.pt; please ensure that the architectures match.
    

    How can i solve?

    opened by FiorellaArtuso 1
  • batch inputs during inference

    batch inputs during inference

    How can we batch multiple inputs when computing the embeddings using a trained model? The script command/inference/get_embedding.py demonstrates how to compute a single function's embedding, but it has slow throughput.

    I tried to concatenate the tensors for each respective field in the dicts produced from sample_emb = trex.process_token_dict(sample_tokens), but ran into a problem where the inputs needed to first be padded to the proper length, and I could not determine how to pad them or determine what the proper padding value is for each field.

    Could you please give an example of computing embeddings for multiple functions at once?

    opened by the-entire-country-of-ireland 1
  • The error about prepare_code_trace.py

    The error about prepare_code_trace.py

    Hi, @peikexin9. I just ran the code 'python micro_trace/prepare_code_trace.py' ,but I met this error "micro_trace/prepare_code_trace.py", line 67, in hex2str assert len(num) <= 8 AssertionError. First I ran the code 'python command/pretrain/prepare_json.py' to generate the data 'data-raw/funcbytes/', the ran the code 'python micro_trace/prepare_code_trace.py' to generate the data 'data-raw/functraces', but the error occured. Could you please help me. Thank you very much.

    opened by RobinHan24 1
  • Usage of prepare_code_trace.py?

    Usage of prepare_code_trace.py?

    Hello,

    In the Trex paper, it's described in Section V that the code base implements microtracing through emulation, which appears to be done in 'micro_trace/prepare_code_trace.py'. However, the TREX github README doesn't mention calling prepare_code_trace.py in the data processing pipeline. Would it be possible to get clarification as to how the traces from emulation get created in either preprocessing or finetuning?

    opened by DanielKotroco 4
  • How to generate our own pretrain dataset?

    How to generate our own pretrain dataset?

    As mentioned in readme, I followed to run the script preprocess_pretrain_10k.py to generate data in data-bin/pretrain_10k, but how can I generate myown data which is in data-src/pretrain_10k, thanks a lot.

    opened by RobinHan24 4
  • TypeError: forward() got an unexpected keyword argument 'src_lengths'

    TypeError: forward() got an unexpected keyword argument 'src_lengths'

    Hi @peikexin9

    When I pretrain the model by using the script ./command/pretrain/pretrain_10k.sh, I got an error.

    2022-03-08 16:53:39 | INFO | fairseq.trainer | begin training epoch 1
    2022-03-08 16:53:39 | INFO | fairseq_cli.train | Start iterating over samples
    Traceback (most recent call last):
      File "train.py", line 14, in <module>
        cli_main()
      File "/data/binVul/trex-main/fairseq_cli/train.py", line 496, in cli_main
        distributed_utils.call_main(cfg, main)
      File "/data/binVul/trex-main/fairseq/distributed/utils.py", line 369, in call_main
        main(cfg, **kwargs)
      File "/data/binVul/trex-main/fairseq_cli/train.py", line 173, in main
        valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
      File "/usr/local/miniconda3/envs/trex2/lib/python3.8/contextlib.py", line 75, in inner
        return func(*args, **kwds)
      File "/data/binVul/trex-main/fairseq_cli/train.py", line 284, in train
        log_output = trainer.train_step(samples)
      File "/usr/local/miniconda3/envs/trex2/lib/python3.8/contextlib.py", line 75, in inner
        return func(*args, **kwds)
      File "/data/binVul/trex-main/fairseq/trainer.py", line 669, in train_step
        loss, sample_size_i, logging_output = self.task.train_step(
      File "/data/binVul/trex-main/fairseq/tasks/fairseq_task.py", line 475, in train_step
        loss, sample_size, logging_output = criterion(model, sample)
      File "/usr/local/miniconda3/envs/trex2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
      File "/data/binVul/trex-main/fairseq/criterions/trex.py", line 62, in forward
        output = model(**sample["net_input"], masked_code=masked_code, masked_value=masked_value)[0]
      File "/usr/local/miniconda3/envs/trex2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
    TypeError: forward() got an unexpected keyword argument 'src_lengths'
    
    opened by qiyea 6
  • What log_output used for?

    What log_output used for?

    i see that you've define many new variables in the reduce_metrics function,such as AUC,ncorrec_pred but, I don't know what these variables are used for, except ‘loss’ are used in backward https://github.com/CUMLSec/trex/blob/7b2cabaecdaeb043da48d85a9016fed391ea75a5/fairseq/tasks/fairseq_task.py#L479

    will you tell me ,why you defined AUC in https://github.com/CUMLSec/trex/blob/7b2cabaecdaeb043da48d85a9016fed391ea75a5/fairseq/criterions/similarity.py#L138 and will you like to pulish a newer version that include the end-to-end script? i can't find the script to change the format for the data that has been converted to the data-raw/functraces format. we are looking forward to your reply. Thank you very much!

    opened by iamawhalez 1
Owner
null
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Dec 29, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.1k Feb 14, 2021
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2k Feb 9, 2021
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my thesis if you're curious or if you're looking for info I haven't documented. Mostly I would recommend giving a quick look to the figures beyond the introduction.

Corentin Jemine 38.5k Jan 3, 2023
This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Neural Style Transfer Transition Video Processing By Brycen Westgarth and Tristan Jogminas Description This code extends the neural style transfer ima

Brycen Westgarth 110 Jan 7, 2023
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

Kaizhi Qian 193 Dec 30, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

null 164 Jan 2, 2023
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description ?? This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
Python functions for summarizing and improving voice dictation input.

Helpmespeak Help me speak uses Python functions for summarizing and improving voice dictation input. Get started with OpenAI gpt-3 OpenAI is a amazing

Margarita Humanitarian Foundation 6 Dec 17, 2022
C.J. Hutto 3.8k Dec 30, 2022
C.J. Hutto 2.8k Feb 18, 2021
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

null 0 Feb 13, 2022