BiNE: Bipartite Network Embedding

Related tags

Text Data & NLP BiNE
Overview

BiNE: Bipartite Network Embedding

This repository contains the demo code of the paper:

BiNE: Bipartite Network Embedding. Ming Gao, Leihui Chen, Xiangnan He & Aoying Zhou

which has been accepted by SIGIR2018.

Note: Any problems, you can contact me at [email protected]. Through email, you will get my rapid response.

Environment settings

  • python==2.7.11
  • numpy==1.13.3
  • sklearn==0.17.1
  • networkx==1.11
  • datasketch==1.2.5
  • scipy==0.17.0
  • six==1.10.0

Basic Usage

Main Parameters:

Input graph path. Defult is '../data/rating_train.dat' (--train-data)
Test dataset path. Default is '../data/rating_test.dat' (--test-data)
Name of model. Default is 'default' (--model-name)
Number of dimensions. Default is 128 (--d)
Number of negative samples. Default is 4 (--ns)
Size of window. Default is 5 (--ws)
Trade-off parameter $\alpha$. Default is 0.01 (--alpha)
Trade-off parameter $\beta$. Default is 0.01 (--beta)
Trade-off parameter $\gamma$. Default is 0.1 (--gamma)
Learning rate $\lambda$. Default is 0.01 (--lam)
Maximal iterations. Default is 50 (--max-iters)
Maximal walks per vertex. Default is 32 (--maxT)
Minimal walks per vertex. Default is 1 (--minT)
Walk stopping probability. Default is 0.15 (--p)
Calculate the recommendation metrics. Default is 0 (--rec)
Calculate the link prediction. Default is 0 (--lip)
File of training data for LR. Default is '../data/wiki/case_train.dat' (--case-train)
File of testing data for LR. Default is '../data/wiki/case_test.dat' (--case-test)
File of embedding vectors of U. Default is '../data/vectors_u.dat' (--vectors-u)
File of embedding vectors of V. Default is '../data/vectors_v.dat' (--vectors-v)
For large bipartite, 1 do not generate homogeneous graph file; 2 do not generate homogeneous graph. Default is 0 (--large)
Mertics of centrality. Default is 'hits', options: 'hits' and 'degree_centrality' (--mode)

Usage

We provide two processed dataset:

  • DBLP (for recommendation). It contains:

    • A training dataset ./data/dblp/rating_train.dat
    • A testing dataset ./data/dblp/rating_test.dat
  • Wikipedia (for link prediction). It contains:

    • A training dataset ./data/wiki/rating_train.dat
    • A testing dataset ./data/wiki/rating_test.dat
  • Each line is a instance: userID (begin with 'u')\titemID (begin with 'i') \t weight\n

    For example: u0\ti0\t1

Please run the './model/train.py'

cd model
python train.py --train-data ../data/dblp/rating_train.dat --test-data ../data/dblp/rating_test.dat --lam 0.025 --max-iter 100 --model-name dblp --rec 1 --large 2 --vectors-u ../data/dblp/vectors_u.dat --vectors-v ../data/dblp/vectors_v.dat

The embedding vectors of nodes are saved in file '/model-name/vectors_u.dat' and '/model-name/vectors_v.dat', respectively.

Example

Recommendation

Run

cd model
python train.py --train-data ../data/dblp/rating_train.dat --test-data ../data/dblp/rating_test.dat --lam 0.025 --max-iter 100 --model-name dblp --rec 1 --large 2 --vectors-u ../data/dblp/vectors_u.dat --vectors-v ../data/dblp/vectors_v.dat

Output (training process)

======== experiment settings =========
alpha : 0.0100, beta : 0.0100, gamma : 0.1000, lam : 0.0250, p : 0.1500, ws : 5, ns : 4, maxT :  32, minT : 1, max_iter : 100
========== processing data ===========
constructing graph....
number of nodes: 6001
walking...
walking...ok
number of nodes: 1177
walking...
walking...ok
getting context and negative samples....
negative samples is ok.....
context...
context...ok
context...
context...ok
============== training ==============
[*************************************************************************************************** ]100.00%

Output (testing process)

============== testing ===============
recommendation metrics: F1 : 0.1132, MAP : 0.2041, MRR : 0.3331, NDCG : 0.2609

Link Prediction

Run

cd model
python train.py --train-data ../data/wiki/rating_train.dat --test-data ../data/wiki/rating_test.dat --lam 0.01 --max-iter 100 --model-name wiki --lip 1 --large 2 --gamma 1 --vectors-u ../data/wiki/vectors_u.dat --vectors-v ../data/wiki/vectors_v.dat --case-train ../data/wiki/case_train.dat --case-test ../data/wiki/case_test.dat

Output (training process)

======== experiment settings =========
alpha : 0.0100, beta : 0.0100, gamma : 1.0000, lam : 0.0100, p : 0.1500, ws : 5, ns : 4, maxT :  32, minT : 1, max_iter : 100, d : 128
========== processing data ===========
constructing graph....
number of nodes: 15000
walking...
walking...ok
number of nodes: 2529
walking...
walking...ok
getting context and negative samples....
negative samples is ok.....
context...
context...ok
context...
context...ok
============== training ==============
[*************************************************************************************************** ]100.00%

Output (testing process)

============== testing ===============
link prediction metrics: AUC_ROC : 0.9468, AUC_PR : 0.9614
Comments
  • networkx.exception.NetworkXError:HITS: power iteration failed to converge in 102 iterations.

    networkx.exception.NetworkXError:HITS: power iteration failed to converge in 102 iterations.

    Is this the reason? And how to solve in this case?It seems it cannot simply solved by replacing a function.

    Traceback (most recent call last):
      File "D:/programming/BiNE/model/train.py", line 572, in <module>
        sys.exit(main())
      File "D:/programming/BiNE/model/train.py", line 569, in main
        train_by_sampling(args)
      File "D:/programming/BiNE/model/train.py", line 321, in train_by_sampling
        walk_generator(gul,args)
      File "D:/programming/BiNE/model/train.py", line 55, in walk_generator
        gul.calculate_centrality()
      File "D:\programming\BiNE\model\graph_utils.py", line 61, in calculate_centrality
        h, a = nx.hits(self.G)
      File "D:\Anaconda3.5\envs\BiNE\lib\site-packages\networkx\algorithms\link_analysis\hits_alg.py", line 111, in hits
        "HITS: power iteration failed to converge in %d iterations."%(i+1))
    networkx.exception.NetworkXError: HITS: power iteration failed to converge in 102 iterations.
    
    
    opened by WangHexie 5
  • the problem in top_n functin

    the problem in top_n functin

    I find that the code in line 180 of train.py tmp_t = sorted(test_rate[u].items(), lambda x, y: cmp(x[1], y[1]), reverse=True)[0:min(len(test_rate[u]),len(test_rate[u]))] you use min(len(test_rate[u]),len(test_rate[u])) but they are the same.

    opened by justicevita 2
  • biadjacency_matrix issue

    biadjacency_matrix issue

    Any idea how to solve this?

      File "train.py", line 53, in walk_generator
        gul.homogeneous_graph_random_walks_for_large_bipartite_graph(datafile=args.train_data, percentage=args.p, maxT=args.maxT, minT=args.minT)
      File "/tmp2/cmchen/proRec/BiNE/model/graph_utils.py", line 111, in homogeneous_graph_random_walks_for_large_bipartite_graph
        A,row_index,item_index= bi.biadjacency_matrix(self.G, self.node_u, self.node_v, dtype=np.float,weight='weight', format='csr')
    ValueError: too many values to unpack
    
    opened by chihming 2
  • About node visualization

    About node visualization

    Hi Leihui, I am quite interested in the node visualization performance in your paper. However, I can not reproduce the TSNE results as shown in your paper. Could you please share the code of node visualization, thanks.

    opened by fukien 1
  • Low speed

    Low speed

    I used a data set which contains 0.2 million links to get the embedding. But after running for 8 hours , the program still got stuck in the graph construction.

    Are there some ways to speed up the program?

    opened by WangHexie 1
  • Actual python dependencies

    Actual python dependencies

    I just had to spend a while getting versions to line up to run this code. For anyone else searching for this, here is a more accurate list of required python modules and version. Some were missing from the README.

    python                    2.7.14
    datasketch                1.4.1
    futures                   3.2.0
    networkx                  2.2
    numpy                     1.16.0
    pandas                    0.23.4
    scikit-learn              0.20.0
    scipy                     1.1.0
    six                       1.11.0
    
    opened by JSybrandt 0
  • skip-gram center and context word?

    skip-gram center and context word?

    非常感谢您分享的代码。 在skip-gram,我有些问题请教下您, I_z = {center: 1}这个地方是不是应该是计算context的节点吧, V = np.array(node_list[contexts]['embedding_vectors']) 应该是计算center的节点embedding吧, 最终更新的是 for z in context_u: tmp_z, tmp_loss = skip_gram(u, z, neg_u, node_list_u, lam, alpha) node_list_u[z]['embedding_vectors'] += tmp_z ## 这里是不是更新center节点的embedding吧?

    十分期待您的解答!

    opened by 529261027 1
  • Dataset

    Dataset

    Need a little bit information on how to prepare my dataset. I have a dat file containing all the ratings place in the model folder so that it can be read, renamed and splitted into train and test. but it seem it is not reading anything

    opened by neenerrh 2
  • Why context vector will be updated by SGD?

    Why context vector will be updated by SGD?

    Hi I want to understand why we need to update the context vector of user nodes and item nodes when performing the skipgram model.

    This is not the case in node2vec afaik (and node2vec is using one-hot vector). May I ask if there is any reason behind it?

    opened by ltwedgar 0
  • not suit for large dataset...

    not suit for large dataset...

    i try to use bine on a user-item interaction network with 1 million users and similar scale of items. the implement now is stack at construct graph... i set the "large" option to 2. and it didnt help. is there anyway to speed up the training. plus, there is still around 100 GB memory unused in my machine and only 1 cpu is fully used. hope for the answer.

    opened by Yindong-Zhang 6
Owner
leihuichen
student
leihuichen
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2k Feb 9, 2021
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 9, 2023
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

null 49 Dec 17, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Kaushal Shetty 488 Nov 28, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 6, 2022
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

NEC Laboratories Europe 13 Sep 8, 2022
Korean Sentence Embedding Repository

Korean-Sentence-Embedding ?? Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

null 80 Jan 2, 2023
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

Ishtiaq Hussain 2 Feb 10, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 1, 2023
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 4.8k Feb 18, 2021
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.3k Feb 18, 2021