Code to train models from "Paraphrastic Representations at Scale".

Overview

Paraphrastic Representations at Scale

Code to train models from "Paraphrastic Representations at Scale".

The code is written in Python 3.7 and requires H5py, jieba, numpy, scipy, sentencepiece, sacremoses, and PyTorch >= 1.0 libraries. These can be insalled with the following command:

pip install -r requirements.txt

To get started, download the data files used for training from http://www.cs.cmu.edu/~jwieting and download the STS evaluation data:

wget http://phontron.com/data/paraphrase-at-scale.zip
unzip paraphrase-at-scale.zip
rm paraphrase-at-scale.zip
wget http://www.cs.cmu.edu/~jwieting/STS.zip .
unzip STS.zip
rm STS.zip

If you use our code, models, or data for your work please cite:

@article{wieting2021paraphrastic,
    title={Paraphrastic Representations at Scale},
    author={Wieting, John and Gimpel, Kevin and Neubig, Graham and Berg-Kirkpatrick, Taylor},
    journal={arXiv preprint arXiv:2104.15114},
    year={2021}
}

@inproceedings{wieting19simple,
    title={Simple and Effective Paraphrastic Similarity from Parallel Translations},
    author={Wieting, John and Gimpel, Kevin and Neubig, Graham and Berg-Kirkpatrick, Taylor},
    booktitle={Proceedings of the Association for Computational Linguistics},
    url={https://arxiv.org/abs/1909.13872},
    year={2019}
}

To embed a list of sentences:

python -u embed_sentences.py --sentence-file paraphrase-at-scale/example-sentences.txt --load-file paraphrase-at-scale/model.para.lc.100.pt  --sp-model paraphrase-at-scale/paranmt.model --output-file sentence_embeds.np --gpu 0

To score a list of sentence pairs:

python -u score_sentence_pairs.py --sentence-pair-file paraphrase-at-scale/example-sentences-pairs.txt --load-file paraphrase-at-scale/model.para.lc.100.pt  --sp-model paraphrase-at-scale/paranmt.model --gpu 0

To train a model (for example, on ParaNMT):

python -u main.py --outfile model.para.out --lower-case 1 --tokenize 0 --data-file paraphrase-at-scale/paranmt.sim-low=0.4-sim-high=1.0-ovl=0.7.final.h5 \
       --model avg --dim 1024 --epochs 25 --dropout 0.0 --sp-model paraphrase-at-scale/paranmt.model --megabatch-size 100 --save-every-epoch 1 --gpu 0 --vocab-file paraphrase-at-scale/paranmt.sim-low=0.4-sim-high=1.0-ovl=0.7.final.vocab

To download and preprocess raw data for training models (both bilingual and ParaNMT), see preprocess/bilingual and preprocess/paranmt.

Comments
  • `report_interval` missing in `model_args`?

    `report_interval` missing in `model_args`?

    I just tried out the example calls and got an error:

    python -u score_sentence_pairs.py --sentence-pair-file paraphrase-at-scale/example-sentences-pairs.txt --load-file paraphrase-at-scale/model.para.lc.100.pt  --sp-model paraphrase-at-scale/paranmt.model --gpu 0
    
    Traceback (most recent call last):
      File "score_sentence_pairs.py", line 88, in <module>
        model, _ = load_model(None, args)
      File "/content/paraphrastic-representations-at-scale/models.py", line 35, in load_model
        model = Averaging(data, model_args, vocab, vocab_fr)
      File "/content/paraphrastic-representations-at-scale/models.py", line 205, in __init__
        super(Averaging, self).__init__(data, args, vocab, vocab_fr)
      File "/content/paraphrastic-representations-at-scale/models.py", line 52, in __init__
        self.report_interval = args.report_interval
    AttributeError: 'Namespace' object has no attribute 'report_interval'
    

    https://github.com/jwieting/paraphrastic-representations-at-scale/blob/1e49a4a7de1619a189e63cce7cd4f3138b53e7a8/models.py#L51-L52

    Replacing args.report_interval with args.save_interval let's me run the example but is no real solution...

    It seems like the code was maybe later changed but the saved models do not contain the changes:
    model = torch.load(args.load_file)model_args = model['args']model_args.report_interval ?

    opened by Querela 2
  • Question: How is the loss being computed?

    Question: How is the loss being computed?

    HI @jwieting, I am confused how this particular line implements the loss mentioned in the paper? https://github.com/jwieting/paraphrastic-representations-at-scale/blob/20c66dd01002da5eb9e239e613afc598ecc71b06/models.py#L139

    What are g1,g2,p1,p2 ?

    opened by tanmaylaud 1
  • Did you train a chinese language model?

    Did you train a chinese language model?

    First thank you for your work "Paraphrastic Representations at Scale"😝. In this paper you said you released trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese in abstract. Where can I download chinese model? If you can provide, thank you very much.

    opened by aixuedegege 0
  • ParaNMT: sentencepiece model and .pt model vocab and embedding size mismatch?

    ParaNMT: sentencepiece model and .pt model vocab and embedding size mismatch?

    First of all, thank you for your contribution to the STS task, we're very excited to get our hands on the models you provided! 😊

    Problem

    According to the README, when we want to use the pre-trained ParaNMT model model.para.lc.100.pt for scoring sentence pairs (we tested also fine-tuning), it should be used with the sentencepiece model paranmt.model.

    # README.md
    
    python -u score_sentence_pairs.py \
      --sentence-pair-file paraphrase-at-scale/example-sentences-pairs.txt
      --load-file paraphrase-at-scale/model.para.lc.100.pt \
      --sp-model paraphrase-at-scale/paranmt.model \
       --gpu 0
    

    However, we observe a potential mismatch between the model.para.lc.100.pt embedding layer size and the size of vocabulary of the paranmt.model sentencepiece model.

    • model.para.lc.100.pt has an embedding layer of shape Embedding(82983, 1024).
    • paranmt.model sentencepiece model has a vocab size of 50000.

    We analyzed the situation further by printing various tokens from the paranmt.model sentencepiece model and they are identical with tokens in paranmt.vocab. However model.para.lc.100.pt was most probably trained using paranmt.sim-low=0.4-sim-high=1.0-ovl=0.7.final.vocab, whose size is exactly 82982 tokens (don't know why the shift of 1 is there 😄 ). We thus believe that a different sentencepiece model (some that has 82983 tokens) should be used with the model.para.lc.100.pt in order to get correct results.

    # score_sentence_pairs.py
    
    model, _ = load_model(None, args)
    print(model.embedding)  # Embedding(82983, 1024)
    print(model.sp.vocab_size())  # 50000
    
    # All of the following tokens agree with the order in the `paranmt.vocab`
    print(model.sp.id_to_piece(0))  # <unk>
    print(model.sp.id_to_piece(4))  # _the
    print(model.sp.id_to_piece(10))  # _i
    print(model.sp.piece_to_id('▁i'))  # 10
    

    Possible solutions

    We see two possible solutions:

    1. Publish the sentencepiece model that was used for training of the model.para.lc.100.pt model.
    2. Publish a model trained on ParaNMT that used the paranmt.model sentencepiece model during training. Safety check is that the model should have embedding layer of size 50000.

    If we missed something, could you please explain how the models were meant to be used? If we're right, would it please be possible to share the remaining resources?

    opened by sweco 1
  • Not all the args specified for the main.py are used

    Not all the args specified for the main.py are used

    Hi, I noticed that not all the args the user specified for the main.py are fed into the load_model function and some of the args will still be the same as their pre-training setting. Maybe the model_args in line 35 and 37 from the main.py file should be changed into args?

    opened by JYWangGeneCosmo 0
  • SentencePiece memory requirements for paranmt preprocessing

    SentencePiece memory requirements for paranmt preprocessing

    The SentencePiece trainer in the paranmt preprocessing script preprocess_data.py fails for me on a machine with 16GB of RAM. The memory requirement appears to be around 17GB. It would be helpful make note of this in a readme file as the error message wasn't very informative.

    do_all.sh: line 32: 23983 Killed                  python preprocess_data.py --lower-case 1 --paranmt-file scratch/paranmt.sim-low=$1-sim-high=$2-ovl=$3.txt --name "sim-low=$1-sim-high=$2-ovl=$3"
    
    opened by jtbates 0
Owner
John Wieting
John Wieting
a delightful machine learning tool that allows you to train, test and use models without writing code

igel A delightful machine learning tool that allows you to train/fit, test and use models without writing code Note I'm also working on a GUI desktop

Nidhal Baccouri 3k Jan 5, 2023
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Jan 5, 2023
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Dec 31, 2022
TorchFlare is a simple, beginner-friendly, and easy-to-use PyTorch Framework train your models effortlessly.

TorchFlare TorchFlare is a simple, beginner-friendly and an easy-to-use PyTorch Framework train your models without much effort. It provides an almost

Atharva Phatak 85 Dec 26, 2022
A tutorial showing how to train, convert, and run TensorFlow Lite object detection models on Android devices, the Raspberry Pi, and more!

A tutorial showing how to train, convert, and run TensorFlow Lite object detection models on Android devices, the Raspberry Pi, and more!

Evan 1.3k Jan 2, 2023
Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.

Music Source Separation with Channel-wise Subband Phase Aware ResUnet (CWS-PResUNet) Introduction This repo contains the pretrained Music Source Separ

Lau 100 Dec 25, 2022
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
Code for CPM-2 Pre-Train

CPM-2 Pre-Train Pre-train CPM-2 此分支为110亿非 MoE 模型的预训练代码,MoE 模型的预训练代码请切换到 moe 分支 CPM-2技术报告请参考link。 0 模型下载 请在智源资源下载页面进行申请,文件介绍如下: 文件名 描述 参数大小 100000.tar

Tsinghua AI 136 Dec 28, 2022
Code for CPM-2 Pre-Train

CPM-2 Pre-Train Pre-train CPM-2 此分支为110亿非 MoE 模型的预训练代码,MoE 模型的预训练代码请切换到 moe 分支 CPM-2技术报告请参考link。 0 模型下载 请在智源资源下载页面进行申请,文件介绍如下: 文件名 描述 参数大小 100000.tar

Tsinghua AI 136 Dec 28, 2022
Code used to generate the results appearing in "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

Train longer, generalize better - Big batch training This is a code repository used to generate the results appearing in "Train longer, generalize bet

Elad Hoffer 145 Sep 16, 2022
A Pytorch implementation of MoveNet from Google. Include training code and pre-train model.

Movenet.Pytorch Intro MoveNet is an ultra fast and accurate model that detects 17 keypoints of a body. This is A Pytorch implementation of MoveNet fro

Mr.Fire 241 Dec 26, 2022
Train neural network for semantic segmentation (deep lab V3) with pytorch in less then 50 lines of code

Train neural network for semantic segmentation (deep lab V3) with pytorch in 50 lines of code Train net semantic segmentation net using Trans10K datas

null 17 Dec 19, 2022
sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code

sequitur sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code. It implements three differ

Jonathan Shobrook 305 Dec 21, 2022
This repo contains the code required to train the multivariate time-series Transformer.

Multi-Variate Time-Series Transformer This repo contains the code required to train the multivariate time-series Transformer. Download the data The No

Gregory Duthé 4 Nov 24, 2022
[CVPR 2022] Official code for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration"

MDCA Calibration This is the official PyTorch implementation for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved

MDCA Calibration 21 Dec 22, 2022
Quickly and easily create / train a custom DeepDream model

Dream-Creator This project aims to simplify the process of creating a custom DeepDream model by using pretrained GoogleNet models and custom image dat

null 55 Dec 27, 2022
Lightweight library to build and train neural networks in Theano

Lasagne Lasagne is a lightweight library to build and train neural networks in Theano. Its main features are: Supports feed-forward networks such as C

Lasagne 3.8k Dec 29, 2022
Lightweight library to build and train neural networks in Theano

Lasagne Lasagne is a lightweight library to build and train neural networks in Theano. Its main features are: Supports feed-forward networks such as C

Lasagne 3.8k Feb 11, 2021
Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet.

Ravens is a collection of simulated tasks in PyBullet for learning vision-based robotic manipulation, with emphasis on pick and place. It features a Gym-like API with 10 tabletop rearrangement tasks, each with (i) a scripted oracle that provides expert demonstrations (for imitation learning), and (ii) reward functions that provide partial credit (for reinforcement learning).

Google Research 367 Jan 9, 2023