BiNE: Bipartite Network Embedding

leihuichen

Last update: Nov 24, 2022

Related tags

Text Data & NLP BiNE

Overview

BiNE: Bipartite Network Embedding

This repository contains the demo code of the paper:

BiNE: Bipartite Network Embedding. Ming Gao, Leihui Chen, Xiangnan He & Aoying Zhou

which has been accepted by SIGIR2018.

Note: Any problems, you can contact me at [email protected]. Through email, you will get my rapid response.

Environment settings

python==2.7.11
numpy==1.13.3
sklearn==0.17.1
networkx==1.11
datasketch==1.2.5
scipy==0.17.0
six==1.10.0

Basic Usage

Main Parameters:

Input graph path. Defult is '../data/rating_train.dat' (--train-data)
Test dataset path. Default is '../data/rating_test.dat' (--test-data)
Name of model. Default is 'default' (--model-name)
Number of dimensions. Default is 128 (--d)
Number of negative samples. Default is 4 (--ns)
Size of window. Default is 5 (--ws)
Trade-off parameter $\alpha$. Default is 0.01 (--alpha)
Trade-off parameter $\beta$. Default is 0.01 (--beta)
Trade-off parameter $\gamma$. Default is 0.1 (--gamma)
Learning rate $\lambda$. Default is 0.01 (--lam)
Maximal iterations. Default is 50 (--max-iters)
Maximal walks per vertex. Default is 32 (--maxT)
Minimal walks per vertex. Default is 1 (--minT)
Walk stopping probability. Default is 0.15 (--p)
Calculate the recommendation metrics. Default is 0 (--rec)
Calculate the link prediction. Default is 0 (--lip)
File of training data for LR. Default is '../data/wiki/case_train.dat' (--case-train)
File of testing data for LR. Default is '../data/wiki/case_test.dat' (--case-test)
File of embedding vectors of U. Default is '../data/vectors_u.dat' (--vectors-u)
File of embedding vectors of V. Default is '../data/vectors_v.dat' (--vectors-v)
For large bipartite, 1 do not generate homogeneous graph file; 2 do not generate homogeneous graph. Default is 0 (--large)
Mertics of centrality. Default is 'hits', options: 'hits' and 'degree_centrality' (--mode)

Usage

We provide two processed dataset:

DBLP (for recommendation). It contains:
- A training dataset ./data/dblp/rating_train.dat
- A testing dataset ./data/dblp/rating_test.dat
Wikipedia (for link prediction). It contains:
- A training dataset ./data/wiki/rating_train.dat
- A testing dataset ./data/wiki/rating_test.dat
Each line is a instance: userID (begin with 'u')\titemID (begin with 'i') \t weight\n

For example: u0\ti0\t1

Please run the './model/train.py'

cd model
python train.py --train-data ../data/dblp/rating_train.dat --test-data ../data/dblp/rating_test.dat --lam 0.025 --max-iter 100 --model-name dblp --rec 1 --large 2 --vectors-u ../data/dblp/vectors_u.dat --vectors-v ../data/dblp/vectors_v.dat

The embedding vectors of nodes are saved in file '/model-name/vectors_u.dat' and '/model-name/vectors_v.dat', respectively.

Example

Recommendation

Run

cd model
python train.py --train-data ../data/dblp/rating_train.dat --test-data ../data/dblp/rating_test.dat --lam 0.025 --max-iter 100 --model-name dblp --rec 1 --large 2 --vectors-u ../data/dblp/vectors_u.dat --vectors-v ../data/dblp/vectors_v.dat

Output (training process)

======== experiment settings =========
alpha : 0.0100, beta : 0.0100, gamma : 0.1000, lam : 0.0250, p : 0.1500, ws : 5, ns : 4, maxT :  32, minT : 1, max_iter : 100
========== processing data ===========
constructing graph....
number of nodes: 6001
walking...
walking...ok
number of nodes: 1177
walking...
walking...ok
getting context and negative samples....
negative samples is ok.....
context...
context...ok
context...
context...ok
============== training ==============
[*************************************************************************************************** ]100.00%

Output (testing process)

============== testing ===============
recommendation metrics: F1 : 0.1132, MAP : 0.2041, MRR : 0.3331, NDCG : 0.2609

Link Prediction

Run

cd model
python train.py --train-data ../data/wiki/rating_train.dat --test-data ../data/wiki/rating_test.dat --lam 0.01 --max-iter 100 --model-name wiki --lip 1 --large 2 --gamma 1 --vectors-u ../data/wiki/vectors_u.dat --vectors-v ../data/wiki/vectors_v.dat --case-train ../data/wiki/case_train.dat --case-test ../data/wiki/case_test.dat

Output (training process)

======== experiment settings =========
alpha : 0.0100, beta : 0.0100, gamma : 1.0000, lam : 0.0100, p : 0.1500, ws : 5, ns : 4, maxT :  32, minT : 1, max_iter : 100, d : 128
========== processing data ===========
constructing graph....
number of nodes: 15000
walking...
walking...ok
number of nodes: 2529
walking...
walking...ok
getting context and negative samples....
negative samples is ok.....
context...
context...ok
context...
context...ok
============== training ==============
[*************************************************************************************************** ]100.00%

Output (testing process)

============== testing ===============
link prediction metrics: AUC_ROC : 0.9468, AUC_PR : 0.9614

Comments

networkx.exception.NetworkXError:HITS: power iteration failed to converge in 102 iterations.

Is this the reason? And how to solve in this case?It seems it cannot simply solved by replacing a function.

Traceback (most recent call last):
  File "D:/programming/BiNE/model/train.py", line 572, in <module>
    sys.exit(main())
  File "D:/programming/BiNE/model/train.py", line 569, in main
    train_by_sampling(args)
  File "D:/programming/BiNE/model/train.py", line 321, in train_by_sampling
    walk_generator(gul,args)
  File "D:/programming/BiNE/model/train.py", line 55, in walk_generator
    gul.calculate_centrality()
  File "D:\programming\BiNE\model\graph_utils.py", line 61, in calculate_centrality
    h, a = nx.hits(self.G)
  File "D:\Anaconda3.5\envs\BiNE\lib\site-packages\networkx\algorithms\link_analysis\hits_alg.py", line 111, in hits
    "HITS: power iteration failed to converge in %d iterations."%(i+1))
networkx.exception.NetworkXError: HITS: power iteration failed to converge in 102 iterations.

opened by WangHexie 5

the problem in top_n functin

I find that the code in line 180 of train.py tmp_t = sorted(test_rate[u].items(), lambda x, y: cmp(x[1], y[1]), reverse=True)[0:min(len(test_rate[u]),len(test_rate[u]))] you use min(len(test_rate[u]),len(test_rate[u])) but they are the same.

opened by justicevita 2

biadjacency_matrix issue

Any idea how to solve this?

  File "train.py", line 53, in walk_generator
    gul.homogeneous_graph_random_walks_for_large_bipartite_graph(datafile=args.train_data, percentage=args.p, maxT=args.maxT, minT=args.minT)
  File "/tmp2/cmchen/proRec/BiNE/model/graph_utils.py", line 111, in homogeneous_graph_random_walks_for_large_bipartite_graph
    A,row_index,item_index= bi.biadjacency_matrix(self.G, self.node_u, self.node_v, dtype=np.float,weight='weight', format='csr')
ValueError: too many values to unpack

opened by chihming 2

About node visualization

Hi Leihui, I am quite interested in the node visualization performance in your paper. However, I can not reproduce the TSNE results as shown in your paper. Could you please share the code of node visualization, thanks.

opened by fukien 1
Low speed

I used a data set which contains 0.2 million links to get the embedding. But after running for 8 hours , the program still got stuck in the graph construction.

Are there some ways to speed up the program?

opened by WangHexie 1

I just had to spend a while getting versions to line up to run this code. For anyone else searching for this, here is a more accurate list of required python modules and version. Some were missing from the README.

python                    2.7.14
datasketch                1.4.1
futures                   3.2.0
networkx                  2.2
numpy                     1.16.0
pandas                    0.23.4
scikit-learn              0.20.0
scipy                     1.1.0
six                       1.11.0

opened by JSybrandt 0

skip-gram center and context word?

非常感谢您分享的代码。在skip-gram，我有些问题请教下您， I_z = {center: 1}这个地方是不是应该是计算context的节点吧， V = np.array(node_list[contexts]['embedding_vectors']) 应该是计算center的节点embedding吧，最终更新的是 for z in context_u: tmp_z, tmp_loss = skip_gram(u, z, neg_u, node_list_u, lam, alpha) node_list_u[z]['embedding_vectors'] += tmp_z ## 这里是不是更新center节点的embedding吧？

十分期待您的解答！

opened by 529261027 1
Dataset

Need a little bit information on how to prepare my dataset. I have a dat file containing all the ratings place in the model folder so that it can be read, renamed and splitted into train and test. but it seem it is not reading anything

opened by neenerrh 2
Why context vector will be updated by SGD?

Hi I want to understand why we need to update the context vector of user nodes and item nodes when performing the skipgram model.

This is not the case in node2vec afaik (and node2vec is using one-hot vector). May I ask if there is any reason behind it?

opened by ltwedgar 0
not suit for large dataset...

i try to use bine on a user-item interaction network with 1 million users and similar scale of items. the implement now is stack at construct graph... i set the "large" option to 2. and it didnt help. is there anyway to speed up the training. plus, there is still around 100 GB memory unused in my machine and only 1 cpu is fully used. hope for the answer.

opened by Yindong-Zhang 6

Owner

leihuichen

student

GitHub

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

2k Feb 9, 2021

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

47 Sep 5, 2022

A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

488 Nov 28, 2022

ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

37 Nov 6, 2022

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

21 Dec 15, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

13 Sep 8, 2022

Korean Sentence Embedding Repository

Korean-Sentence-Embedding ?? Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 2, 2023

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

2 Jun 10, 2022

Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

2 Feb 10, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Dec 30, 2022

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

60 Dec 31, 2022

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

6.4k Jan 1, 2023

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

4.8k Dec 30, 2022

An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

63 Dec 29, 2022

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

4.8k Feb 18, 2021

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

4.3k Feb 18, 2021