Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Related tags

Text Data & NLP SIF
Overview

SIF

This is the code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

The code is written in python and requires numpy, scipy, pickle, sklearn, theano and the lasagne library. Some functions/classes are based on the code of John Wieting for the paper "Towards Universal Paraphrastic Sentence Embeddings" (Thanks John!). The example data sets are also preprocessed using the code there.

Install

To install all dependencies virtualenv is suggested:

$ virtualenv .env
$ . .env/bin/activate
$ pip install -r requirements.txt 

Get started

To get started, cd into the directory examples/ and run demo.sh. It downloads the pretrained GloVe word embeddings, and then runs the scripts:

  • sif_embedding.py is an demo on how to generate sentence embedding using the SIF weighting scheme,
  • sim_sif.py and sim_tfidf.py are for the textual similarity tasks in the paper,
  • supervised_sif_proj.sh is for the supervised tasks in the paper.

Check these files to see the options.

Source code

The code is separated into the following parts:

  • SIF embedding: involves SIF_embedding.py. The SIF weighting scheme is very simple and is implmented in a few lines.
  • textual similarity tasks: involves data_io.py, eval.py, and sim_algo.py. data_io provides the code for reading the data, eval is for evaluating the performance, and sim_algo provides the code for our sentence embedding algorithm.
  • supervised tasks: involves data_io.py, eval.py, train.py, proj_model_sim.py, and proj_model_sentiment.py. train provides the entry for training the models (proj_model_sim is for the similarity and entailment tasks, and proj_model_sentiment is for the sentiment task). Check train.py to see the options.
  • utilities: includes lasagne_average_layer.py, params.py, and tree.py. These provides utility functions/classes for the above two parts.

References

For technical details and full experimental results, see the paper.

@article{arora2017asimple, 
	author = {Sanjeev Arora and Yingyu Liang and Tengyu Ma}, 
	title = {A Simple but Tough-to-Beat Baseline for Sentence Embeddings}, 
	booktitle = {International Conference on Learning Representations},
	year = {2017}
}
Comments
  • No such file Error: ../data/MSRvid2012

    No such file Error: ../data/MSRvid2012

    When I run the demo.sh in examples directory, this error occurred:

    word vectors loaded from ../data/glove.840B.300d.txt
    word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
    remove the first 0 principal components
    Traceback (most recent call last):
      File "sim_sif.py", line 28, in <module>
        parr, sarr = eval.sim_evaluate_all(We, words, weight4ind, sim_algo.weighted_average_sim_rmpc, params)
      File "../src/eval.py", line 64, in sim_evaluate_all
        p,s = sim_getCorrelation(We, words, prefix+i, weight4ind, scoring_function, params)
      File "../src/eval.py", line 13, in sim_getCorrelation
        f = open(f,'r')
    IOError: [Errno 2] No such file or directory: '../data/MSRvid2012'
    

    Would you please tell me that:

    1. Is this error detrimental to the model training ?
    2. Where to download the missing data file ?

    Thanks a lot !

    opened by huache 11
  • Example demo hanging

    Example demo hanging

    I have installed all dependencies, and

    cd examples/
    ./demo.sh
    

    The script has downloaded GloVe glove.840B.300d.txt, but it seems to hangs. No files written in the log folder and no python running on top. Any hint?

    opened by loretoparisi 8
  • OOV tokens

    OOV tokens

    Hi,

    Thanks for sharing the code! I have some questions about the handling of OOV tokens:

    1. How is the sentence embedding computed if any of its words does not have a pretrained vector in GloVe?
    2. Say we encounter a word at test time which was not present in the corpus used for estimating word frequencies. How is the weight for this word computed?
    3. Do you remove infrequent words when computing the word frequencies from the corpus? I ask because the vocab file name (enwiki_vocab_min200.txt) seems to suggest so. If yes, how are the removed tokens weighted?

    Thanks in advance for your time :)

    Best, Bhuwan

    opened by bdhingra 4
  • About GPU and BLAS

    About GPU and BLAS

    I would like to run against a nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 (docker). I have

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
    | N/A   38C    P8    17W / 125W |      0MiB /  4036MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

    and I could install eventually a BLAS like Intel MKL (that could speed up numpy), etc. Is the current implementation using any gpu low level routine?

    Thank you

    opened by loretoparisi 3
  • Encoding Error

    Encoding Error

    Getting this error when I try to run sif_embeddings, but I think the issue is with data io. How are the files supplied meant to be used, besides running the demo? I'd like to use SIF for evaluating similarity of sentences I supply. There is no training needed if I were to just use the Glove embeddings correct? What are the neural nets in src used for then?

    Thanks!

    File "C:\Users\gdev\git\SIF\examples\sif_embedding.py", line 13, in <module>
        (words, We) = data_io.getWordmap(wordfile)
      File "../src\data_io.py", line 12, in getWordmap
        lines = f.readlines()
      File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>
    
    opened by ProxyCausal 2
  • PSL?

    PSL?

    Hi, thanks for sharing the code.

    Would you please point out which part of the code is refer to the "PSL" in your paper ? I only manage to found two weighting functions:

    def getWeight(words, word2weight):
    ...
    def getIDFWeight(wordfile, save_file=''):
    
    opened by StevenLOL 2
  • MemoryError in sif_embedding.py

    MemoryError in sif_embedding.py

    Hi,

    I downloaded and unzipped glove.840B.300d.zip file and than I run python sif_embedding.py. After 30 minutes I receive this error:

    Traceback (most recent call last):
      File "sif_embedding.py", line 15, in <module>
        (words, We) = data_io.getWordmap(wordfile)
      File "../src/data_io.py", line 22, in getWordmap
        return (words, np.array(We))
    MemoryError
    

    Have you ever seen this error?

    Nb: I drop the process after Termial shows this error because the process goes in deadlock/loop (I'm sure it doesn't go forward).

    opened by silvioOlivastri 1
  • AttributeError: 'params' object has no attribute 'nonlinearity'

    AttributeError: 'params' object has no attribute 'nonlinearity'

    Using cuDNN version 6021 on context None Mapped name None to device cuda: Tesla K80 (1489:00:00.0) ['train.py', '-wordfile', '../data/glove.840B.300d.txt', '-npc', '1', '-dim', '300', '-traindata', '../data/sentiment-train', '-devdata', '../data/sentiment-dev', '-testdata', '../data/sentiment-test', '-layersize', '300', '-nntype', 'proj_sentiment', '-epochs', '10', '-batchsize', '25', '-LW', '1e-06', '-LC', '1e-06', '-memsize', '300', '-learner', 'adam', '-eta', '0.001', '-task', 'sentiment'] Traceback (most recent call last): File "train.py", line 237, in model = proj_model_sentiment(We, params) File "~iclr2017/SIF/src/proj_model_sentiment.py", line 37, in init l_out = lasagne.layers.DenseLayer(l_average, params.layersize, nonlinearity=params.nonlinearity) AttributeError: 'params' object has no attribute 'nonlinearity'

    opened by prakhar-agarwal 0
  • Paper and Code disparity. Columns or rows for SVD?

    Paper and Code disparity. Columns or rows for SVD?

    In the algorithm of the paper we can see on line 4 that each v_s should be a column vector of our Matrix X. alog In the code we can see on line 22, that each datapoint should be a row vector of our Matrix X. code

    opened by municola 0
  • Dataset glove.840B.300d.txt character issue

    Dataset glove.840B.300d.txt character issue

    The involved dataset, at line 52343, presents what it seems to be ". . .", but it's not. At this line, the code of the example sif_embedding.py breaks because the split() at line 15 of auxiliary_data/data_io.py splits wrongly the word and its embedding. After a debugging on that line it turned out that the dots of ". . ." are actually dots while the spaces are the code 160 of the extended ASCII table. Probably this file is not encoded in ASCII but in Unicode, however (for practical reasons) the test has been made with ord() so the output is an ASCII code, but the problem doesn't change.

    opened by AleMuzzi 0
  • Simple re-implementation of SIF

    Simple re-implementation of SIF

    As this code is not maintained anymore I have re implemented SIF heavily reuses this implementation as I needed it for a project. I focused on generating the embedding using SIF. I thank the authors of SIF for such a wonderful paper.

    I am sharing my code here.

    opened by smujjiga 1
  • A potential information reveal problem

    A potential information reveal problem

    I am following your framework and extend your work by adding "attention". However, when I review your code, I am confuse that whether you calculated the first principle component with training data or not? If you calculate the first principle component with current dataset (test data), it seems that you introduce some information to create the sentence embeddings for test datasets. In that case, your result may not be accepted.

    opened by JacksonWuxs 0
Owner
null
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 3, 2023
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

null 1.1k Dec 27, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

null 49 Dec 17, 2022
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 44 Jan 6, 2023
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

?? Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
null 189 Jan 2, 2023
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 884 Nov 11, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 9, 2022
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

null 79 Dec 27, 2022