Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Last update: Dec 27, 2022

Related tags

Text Data & NLP SIF

Overview

SIF

This is the code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

The code is written in python and requires numpy, scipy, pickle, sklearn, theano and the lasagne library. Some functions/classes are based on the code of John Wieting for the paper "Towards Universal Paraphrastic Sentence Embeddings" (Thanks John!). The example data sets are also preprocessed using the code there.

Install

To install all dependencies virtualenv is suggested:

$ virtualenv .env
$ . .env/bin/activate
$ pip install -r requirements.txt

Get started

To get started, cd into the directory examples/ and run demo.sh. It downloads the pretrained GloVe word embeddings, and then runs the scripts:

sif_embedding.py is an demo on how to generate sentence embedding using the SIF weighting scheme,
sim_sif.py and sim_tfidf.py are for the textual similarity tasks in the paper,
supervised_sif_proj.sh is for the supervised tasks in the paper.

Check these files to see the options.

Source code

The code is separated into the following parts:

SIF embedding: involves SIF_embedding.py. The SIF weighting scheme is very simple and is implmented in a few lines.
textual similarity tasks: involves data_io.py, eval.py, and sim_algo.py. data_io provides the code for reading the data, eval is for evaluating the performance, and sim_algo provides the code for our sentence embedding algorithm.
supervised tasks: involves data_io.py, eval.py, train.py, proj_model_sim.py, and proj_model_sentiment.py. train provides the entry for training the models (proj_model_sim is for the similarity and entailment tasks, and proj_model_sentiment is for the sentiment task). Check train.py to see the options.
utilities: includes lasagne_average_layer.py, params.py, and tree.py. These provides utility functions/classes for the above two parts.

References

For technical details and full experimental results, see the paper.

@article{arora2017asimple, 
	author = {Sanjeev Arora and Yingyu Liang and Tengyu Ma}, 
	title = {A Simple but Tough-to-Beat Baseline for Sentence Embeddings}, 
	booktitle = {International Conference on Learning Representations},
	year = {2017}
}

Comments

No such file Error: ../data/MSRvid2012

When I run the demo.sh in examples directory, this error occurred:

word vectors loaded from ../data/glove.840B.300d.txt
word weights computed from ../auxiliary_data/enwiki_vocab_min200.txt using parameter a=-1.000000
remove the first 0 principal components
Traceback (most recent call last):
  File "sim_sif.py", line 28, in <module>
    parr, sarr = eval.sim_evaluate_all(We, words, weight4ind, sim_algo.weighted_average_sim_rmpc, params)
  File "../src/eval.py", line 64, in sim_evaluate_all
    p,s = sim_getCorrelation(We, words, prefix+i, weight4ind, scoring_function, params)
  File "../src/eval.py", line 13, in sim_getCorrelation
    f = open(f,'r')
IOError: [Errno 2] No such file or directory: '../data/MSRvid2012'

Would you please tell me that:

Is this error detrimental to the model training ?
Where to download the missing data file ?

Thanks a lot !

opened by huache 11

Example demo hanging
I have installed all dependencies, and

cd examples/ ./demo.sh

The script has downloaded GloVe glove.840B.300d.txt, but it seems to hangs. No files written in the log folder and no python running on top. Any hint?
opened by loretoparisi 8
OOV tokens
Hi,

Thanks for sharing the code! I have some questions about the handling of OOV tokens:

How is the sentence embedding computed if any of its words does not have a pretrained vector in GloVe?

Say we encounter a word at test time which was not present in the corpus used for estimating word frequencies. How is the weight for this word computed?

Do you remove infrequent words when computing the word frequencies from the corpus? I ask because the vocab file name (enwiki_vocab_min200.txt) seems to suggest so. If yes, how are the removed tokens weighted?

Thanks in advance for your time :)

Best, Bhuwan
opened by bdhingra 4

About GPU and BLAS

I would like to run against a nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 (docker). I have

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   38C    P8    17W / 125W |      0MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and I could install eventually a BLAS like Intel MKL (that could speed up numpy), etc. Is the current implementation using any gpu low level routine?

Thank you

opened by loretoparisi 3

Encoding Error

Getting this error when I try to run sif_embeddings, but I think the issue is with data io. How are the files supplied meant to be used, besides running the demo? I'd like to use SIF for evaluating similarity of sentences I supply. There is no training needed if I were to just use the Glove embeddings correct? What are the neural nets in src used for then?

Thanks!

File "C:\Users\gdev\git\SIF\examples\sif_embedding.py", line 13, in <module>
    (words, We) = data_io.getWordmap(wordfile)
  File "../src\data_io.py", line 12, in getWordmap
    lines = f.readlines()
  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 962: character maps to <undefined>

opened by ProxyCausal 2

PSL?
Hi, thanks for sharing the code.

Would you please point out which part of the code is refer to the "PSL" in your paper ? I only manage to found two weighting functions:

def getWeight(words, word2weight): ... def getIDFWeight(wordfile, save_file=''):
opened by StevenLOL 2
MemoryError in sif_embedding.py
Hi,

I downloaded and unzipped glove.840B.300d.zip file and than I run python sif_embedding.py. After 30 minutes I receive this error:

Traceback (most recent call last): File "sif_embedding.py", line 15, in <module> (words, We) = data_io.getWordmap(wordfile) File "../src/data_io.py", line 22, in getWordmap return (words, np.array(We)) MemoryError

Have you ever seen this error?

Nb: I drop the process after Termial shows this error because the process goes in deadlock/loop (I'm sure it doesn't go forward).
opened by silvioOlivastri 1
AttributeError: 'params' object has no attribute 'nonlinearity'

Using cuDNN version 6021 on context None Mapped name None to device cuda: Tesla K80 (1489:00:00.0) ['train.py', '-wordfile', '../data/glove.840B.300d.txt', '-npc', '1', '-dim', '300', '-traindata', '../data/sentiment-train', '-devdata', '../data/sentiment-dev', '-testdata', '../data/sentiment-test', '-layersize', '300', '-nntype', 'proj_sentiment', '-epochs', '10', '-batchsize', '25', '-LW', '1e-06', '-LC', '1e-06', '-memsize', '300', '-learner', 'adam', '-eta', '0.001', '-task', 'sentiment'] Traceback (most recent call last): File "train.py", line 237, in model = proj_model_sentiment(We, params) File "~iclr2017/SIF/src/proj_model_sentiment.py", line 37, in init l_out = lasagne.layers.DenseLayer(l_average, params.layersize, nonlinearity=params.nonlinearity) AttributeError: 'params' object has no attribute 'nonlinearity'

opened by prakhar-agarwal 0
Paper and Code disparity. Columns or rows for SVD?

In the algorithm of the paper we can see on line 4 that each v_s should be a column vector of our Matrix X. In the code we can see on line 22, that each datapoint should be a row vector of our Matrix X.

opened by municola 0
Dataset glove.840B.300d.txt character issue

The involved dataset, at line 52343, presents what it seems to be ". . .", but it's not. At this line, the code of the example sif_embedding.py breaks because the split() at line 15 of auxiliary_data/data_io.py splits wrongly the word and its embedding. After a debugging on that line it turned out that the dots of ". . ." are actually dots while the spaces are the code 160 of the extended ASCII table. Probably this file is not encoded in ASCII but in Unicode, however (for practical reasons) the test has been made with ord() so the output is an ASCII code, but the problem doesn't change.

opened by AleMuzzi 0
Simple re-implementation of SIF

As this code is not maintained anymore I have re implemented SIF heavily reuses this implementation as I needed it for a project. I focused on generating the embedding using SIF. I thank the authors of SIF for such a wonderful paper.

I am sharing my code here.

opened by smujjiga 1
A potential information reveal problem

I am following your framework and extend your work by adding "attention". However, when I review your code, I am confuse that whether you calculated the first principle component with training data or not? If you calculate the first principle component with current dataset (test data), it seems that you introduce some information to create the sentence embeddings for test datasets. In that case, your result may not be accepted.

opened by JacksonWuxs 0

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Related tags

Overview

SIF

Install

Get started

Source code

References

Comments

Owner

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"