News-Recommendation-system-using-Bert4Rec-model
Bert4rec for news Recommendation
Dataset used:
Microsoft News Dataset is a huge dataset for news recommendation research.It was collected from anonymous behavior logs of Microsoft News website.The purpose of MIND is to serve as a benchmark dataset for news recommendation and facilitate the research in news recommendation and recommender systems area. MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users.We randomly sampled 1 million users who had at least 5 news click records during 6 weeks from October 12 to November 22, 2019. Every news article contains textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. There are 2,186,683 samples in the training set, 365,200 samples in the validation set, and 2,341,619 samples in the test set, which can empower the training of data-intensive news recommendation models.
[MIND Dataset] https://msnews.github.io/assets/doc/ACL2020_MIND.pdf
Model Description:
Bert4Rec is a model used for products recommendation. In this project we have used the same Model for training a sequence of new articles. BERT4Rec uses a transformer model to learn the sequential representation of elements in a sequence. In this model we assume the news articles to be arranged in a chronological order in historical data. This we do using the script pretrain_Bert4Rec_Model.py. Thus we use masked sequences and train the model in such a way that the model is able to predict the masked elements. We use the output of the pretrained BERT4Rec model for getting the user representation by summing up the output of this model. Later we use this user representation to rank the candidate news.
[BERT4Rec Sequential Recommendation with Bidirectional Encoder Representations from Transformer] https://arxiv.org/pdf/1904.06690.pdf
Implementation:
Taking the news titles in history which are arranged in chronological order we mask some news IDs in random from sequence. we train the Bert4Rec model which tries to identify the represenatation of the masked sequence. (change paths to access dataset) we run the following code
python pretrain_Bert4Rec_Model.py
later we finetune a CNN model for news representation. the CNN representation of candidate news and mean of Bert4Rec output passed on to a sigmoid layer after doing a dot product. this is done using
python main.py
Testing
python test.py
Before submission pass the result.txt file to prediction.txt for proper formatting.
python final_submission.py
cleaner(".../MIND_dataset/result.txt",".../MINDlarge_test/behaviors.tsv","..../MIND_dataset/prediction.txt")
Reference: [BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer] https://github.com/FeiSun/BERT4Rec