Hi theeluwin!
First of all thanks for the code, it was well written and helped me a ton in building my own word2vec model.
This is not an issue per se, but something I'm potentially adding to the word2vec model using your code, the main idea is to use regularisation on embeddings in a temporal setting. I've run into trouble with the code and I'm wondering if you'd be so kind as to help out!
the main idea is that I'm training 2 sets of models (model 0 & 1) consecutively based on 2 sets of corpora, the 2 sets are temporally adjacent (say news articles of 01/jan and 02/jan), during the training of model 1, I'd like to add a penalty term to the loss/cost function:
for all the words in set(vocab_0)&set(vocab_1), I'd like to minimise the distance of the same word's embeddings from period 0 & 1.
I'm not sure if it makes sense!
So far I'm testing on embeddings of rather small dimensions ~ 20, therefore I'm using the Euclidean distance as a measure.
based on your code, I added a fordward_r function in the Word2Vec class:
`
def forward_r(self, data):
if data is not None:
v = LT(data)
v = v.cuda() if self.ivectors.weight.is_cuda else v
return(self.ivectors(v))
else:
return(None)
`
This function simply extracts the relevant embeddings (words from the intersection of the 2 vocabs)
and then in the SGNS, I'm now only testing on 1 particular embedding, I added the following loss calculation that look like this:
rvectors = self.embedding.forward_r(rwords)
rloss = 3*((rvectors.squeeze() - self.vector3)**2).sum()
and finally it woud return the following total loss:
return -(oloss + nloss).mean() + rloss
However the problem is, the loss gets stuck, it never updates, and it appears that the back propagation is not working properly.
As you can probably tell, I'm rather new to pytorch and I'm not sure if you could lend me a hand on what's happening!
Thank you so much in advance!