Skipgram Negative Sampling in PyTorch

Overview

PyTorch SGNS

Word2Vec's SkipGramNegativeSampling in Python.

Yet another but quite general negative sampling loss implemented in PyTorch.

It can be used with ANY embedding scheme! Pretty fast, I bet.

vocab_size = 20000
word2vec = Word2Vec(vocab_size=vocab_size, embedding_size=300)
sgns = SGNS(embedding=word2vec, vocab_size=vocab_size, n_negs=20)
optim = Adam(sgns.parameters())
for batch, (iword, owords) in enumerate(dataloader):
    loss = sgns(iword, owords)
    optim.zero_grad()
    loss.backward()
    optim.step()

New: support negative sampling based on word frequency distribution (0.75th power) and subsampling (resolving word frequency imbalance).

To test this repo, place a space-delimited corpus as data/corpus.txt then run python preprocess.py and python train.py --weights --cuda (use -h option for help).

Comments
  • An error occurred when testing the repo

    An error occurred when testing the repo

    Hi Thank you for sharing the code. However, when I tried to test the repo with "python preprocess.py" and " python train.py --weights --cuda", the first one worked well and generated processed data, whereas the second reported the error as follows:

    [Epoch 1]: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 93, in train(parse_args()) File "train.py", line 81, in train loss = sgns(iword, owords) File "/home/weixin/anaconda2/envs/p3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/weixin/Downloads/pytorch-sgns-master/model.py", line 70, in forward ivectors = self.embedding.forward_i(iword).unsqueeze(2) File "/home/weixin/Downloads/pytorch-sgns-master/model.py", line 42, in forward_i return self.ivectors(v) File "/home/weixin/anaconda2/envs/p3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/home/weixin/anaconda2/envs/p3/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 103, in forward self.scale_grad_by_freq, self.sparse RuntimeError: save_for_backward can only save input or output tensors, but argument 0 doesn't satisfy this condition

    I am quite new to Pytorch so any idea what might go wrong? Many thanks.

    opened by DexterZeng 4
  • Bug in the Loss Function

    Bug in the Loss Function

    The loss function currently implement is -(oloss + nloss).mean()

    It should be (-oloss + nloss).mean()

    You want to minimize the distance between "positive samples" and maximize the distance between "negative samples".

    opened by saketguru 2
  • Different embeddings for input/output words?

    Different embeddings for input/output words?

    Hey there, great skipgram example, so thank you for that.

    I have a question on why you decided to use different embeddings for the "input" words and "output"/"negative" words? See lines below: https://github.com/theeluwin/pytorch-sgns/blob/master/model.py#L29:L30

    I imagine this could give better performance on some problem, but haven't been able to test this myself yet. Thanks for the help!

    opened by phillynch7 2
  • Fix misspelling when checking cuda availbility

    Fix misspelling when checking cuda availbility

    Hi, While I try to understand the model part, I think I found some misspelling. I think we should have to check the cuda availability for ovectors.weight

    opened by heartcored98 2
  • How to ensure that the negative sampled words are not the target word?

    How to ensure that the negative sampled words are not the target word?

    First, thanks for you excellent code :)

    In model.py, the following piece of code suggests that we may get positive word when we do negative sampling, though the probability is very small. nwords = t.multinomial(self.weights, batch_size * context_size * self.n_negs, replacement=True).view(batch_size, -1) I'm wondering why you didn't perform equality check, is that because it doesn't affect the quality of trained word vectors but slow down the training speed? Are there other reasons?

    opened by jeffchy 1
  • applying regularisation

    applying regularisation

    Hi theeluwin!

    First of all thanks for the code, it was well written and helped me a ton in building my own word2vec model.

    This is not an issue per se, but something I'm potentially adding to the word2vec model using your code, the main idea is to use regularisation on embeddings in a temporal setting. I've run into trouble with the code and I'm wondering if you'd be so kind as to help out!

    the main idea is that I'm training 2 sets of models (model 0 & 1) consecutively based on 2 sets of corpora, the 2 sets are temporally adjacent (say news articles of 01/jan and 02/jan), during the training of model 1, I'd like to add a penalty term to the loss/cost function: for all the words in set(vocab_0)&set(vocab_1), I'd like to minimise the distance of the same word's embeddings from period 0 & 1.

    I'm not sure if it makes sense!

    So far I'm testing on embeddings of rather small dimensions ~ 20, therefore I'm using the Euclidean distance as a measure.

    based on your code, I added a fordward_r function in the Word2Vec class: ` def forward_r(self, data):

    if data is not None:
        v = LT(data)
        v = v.cuda() if self.ivectors.weight.is_cuda else v
        return(self.ivectors(v))
    else:
        return(None)
    

    `

    This function simply extracts the relevant embeddings (words from the intersection of the 2 vocabs)

    and then in the SGNS, I'm now only testing on 1 particular embedding, I added the following loss calculation that look like this:

    rvectors = self.embedding.forward_r(rwords) rloss = 3*((rvectors.squeeze() - self.vector3)**2).sum()

    and finally it woud return the following total loss: return -(oloss + nloss).mean() + rloss

    However the problem is, the loss gets stuck, it never updates, and it appears that the back propagation is not working properly.

    As you can probably tell, I'm rather new to pytorch and I'm not sure if you could lend me a hand on what's happening!

    Thank you so much in advance!

    opened by ruoyzhang 1
  • Purpose of unks in skipgram function

    Purpose of unks in skipgram function

    Hi,

    Can you please explain, what is the purpose of including the <UNK> tokens in the owords vector produced by skipgram function? What should model learn by using these as training examples?

    Also, what is the purpose of variable ws in train function, if it's not used anywhere after its definition?

    opened by mmlynarik 2
  • Where is Expectation

    Where is Expectation

    screenshot from 2018-06-28 19-07-42

    In this formula we have an expectation of $w_i$. That means for each pair of $(w_I, w_O)$ we should calculate this expectation. But as I can see in your code you are sampling n_negs of Negative Samples for each pair of $(w_I, w_O)$. Wouldn't that be more correct if we sample n_negs times $N$ of $w_i$ to obtain an empirical mean of expression in square brackets and after than accumulate n_negs of means?

    opened by zetyquickly 5
  • Confused by the loss function.

    Confused by the loss function.

    In your code, you minimized -(oloss + nloss).mean()

    which means (oloss+nloss) should be large. So, "oloss become large and nloss become small " is expected.

    Although -(oloss+nloss) decrease, I got oloss become small and nloss become large, how so?

    opened by JinYang88 2
Owner
Jamie J. Seol
@theeluwin
Jamie J. Seol
A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

Roberto Sanchez 0 Aug 4, 2021
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
Mirco Ravanelli 2.3k Dec 27, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
SAINT PyTorch implementation

SAINT-pytorch A Simple pyTorch implementation of "Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing" based on https://arx

Arshad Shaikh 63 Dec 25, 2022
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jul 9, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 3, 2023
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 11.4k Jan 1, 2023
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.8k Jan 4, 2023
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.1k Feb 12, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
Translate - a PyTorch Language Library

NOTE PyTorch Translate is now deprecated, please use fairseq instead. Translate - a PyTorch Language Library Translate is a library for machine transl

null 775 Dec 24, 2022
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 40.9k Feb 18, 2021
An open-source NLP research library, built on PyTorch.

An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks. Quic

AI2 9.7k Feb 18, 2021
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 4.8k Feb 18, 2021
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.1k Feb 18, 2021