Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Last update: Dec 11, 2022

Related tags

Overview

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

This is the official repository for the EMNLP 2021 long paper Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration. We provide code for training and evaluating Phrase-BERT in addition to the datasets used in the paper.

Update: the model is also available now on Huggingface thanks to the help from whaleloops and nreimers!

Setup

This repository depends on sentence-BERT version 0.3.3, which you can install from the source using:

>>> git clone https://github.com/UKPLab/sentence-transformers.git --branch v0.3.3
>>> cd sentence-transformers/
>>> pip install -e .

Also you can install sentence-BERT with pip:

>>> pip install sentence-transformers==0.3.3

Quick Start

The following example shows how to use a trained Phrase-BERT model to embed phrases into dense vectors.

First download and unzip our model.

>>> cd 
   
    
>>> wget https://storage.googleapis.com/phrase-bert/phrase-bert/phrase-bert-model.zip
>>> unzip phrase-bert-model.zip -d phrase-bert-model/
>>> rm phrase-bert-model.zip

Then load the Phrase-BERT model through the sentence-BERT interface:

from sentence_transformers import SentenceTransformer
model_path = '
   
    '
model = SentenceTransformer(model_path)

You can compute phrase embeddings using Phrase-BERT as follows:

phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
phrase_embs = model.encode( phrase_list )
[p1, p2, p3] = phrase_embs

As in sentence-BERT, the default output is a list of numpy arrays:

for phrase, embedding in zip(phrase_list, phrase_embs):
    print("Phrase:", phrase)
    print("Embedding:", embedding)
    print("")

An example of computing the dot product of phrase embeddings:

import numpy as np
print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')

An example of computing cosine similarity of phrase embeddings:

import torch 
from torch import nn
cos_sim = nn.CosineSimilarity(dim=0)
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')

The output should look like:

The dot product between phrase 1 and 2 is: 218.43600463867188
The dot product between phrase 1 and 3 is: 165.48483276367188
The dot product between phrase 2 and 3 is: 160.51708984375
The cosine similarity between phrase 1 and 2 is: 0.8142536282539368
The cosine similarity between phrase 1 and 3 is: 0.6130303144454956
The cosine similarity between phrase 2 and 3 is: 0.584893524646759

Evaluation

Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper:

Turney [Download ]
BiRD [Download]
PPDB [Download]
PPDB-filtered [Download]
PAWS-short [Download Train-split ] [Download Dev-split ] [Download Test-split ]

Change config/model_path.py with the model path according to your directories and

For evaluation on Turney, run python eval_turney.py
For evaluation on BiRD, run python eval_bird.py

for evaluation on PPDB / PPDB-filtered / PAWS-short, run eval_ppdb_paws.py with:

nohup python  -u eval_ppdb_paws.py \
    --full_run_mode \
    --task 
     
       \
    --data_dir 
      
        \
    --result_dir 
       
         \
    >./output.txt 2>&1 &

Train your own Phrase-BERT

If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to phrase-bert/phrase_bert_finetune.py

The datasets we used to fine-tune Phrase-BERT are here: training data csv file and validation data csv file.

To re-produce the trained Phrase-BERT, please run:

export INPUT_DATA_PATH=
   
    
export TRAIN_DATA_FILE=
    
     
export VALID_DATA_FILE=
     
      
export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens 
export OUTPUT_MODEL_PATH=
      
       


python -u phrase_bert_finetune.py \
    --input_data_path $INPUT_DATA_PATH \
    --train_data_file $TRAIN_DATA_FILE \
    --valid_data_file $VALID_DATA_FILE \
    --input_model_path $INPUT_MODEL_PATH \
    --output_model_path $OUTPUT_MODEL_PATH

Citation:

Please cite us if you find this useful:

@inproceedings{phrasebertwang2021,
    author={Shufan Wang and Laure Thompson and Mohit Iyyer},
    Booktitle = {Empirical Methods in Natural Language Processing},
    Year = "2021",
    Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration}
}

Comments

Word similarity using phrase bert
Hi, I have 2 part question regarding my problem where I have to find similar terms in domain specific dataset.

How is Phrase Bert getting embeddings of vectors without context as is needed by BERT?

Are clustering algorithms suitable to get all similar terms to specified terms?
opened by saurabh0512 1
Pre-trained PhraseBERT model

Hello, your paper is very interesting!

Thank you for releasing the code, training data as well as evaluation benchmark. Can you also share the pretrained PhraseBERT model that people can use it to reproduce your results directly without finetuning?

Thang

opened by ThangPM 1
Train your own Phrase-BERT

I am new to NLP, so maybe my question is a little bit newbie. I am trying to train the given train and Val dataset with your code, as such but it always gives the same error saying that the input data expected 1 argument.. please help to point out my mistake. thanks

opened by ayreen2 2

Model encoding error

Hello, I followed the instructions, but I encountered error at this step:

phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
phrase_embs = model.encode( phrase_list )

The error seems to do with passing features to BERT. The entire error message is printed below:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sentence-transformers/sentence_transformers/SentenceTransformer.py", line 187, in encode
    out_features = self.forward(features)
  File "anaconda3/envs/phrasebert/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "anaconda3/envs/phrasebert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "sentence-transformers/sentence_transformers/models/BERT.py", line 33, in forward
    output_states = self.bert(**features)
  File "anaconda3/envs/phrasebert/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "anaconda3/envs/phrasebert/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 950, in forward
    batch_size, seq_length = input_shape
ValueError: not enough values to unpack (expected 2, got 1)

opened by cozymichelle 5

Version of sentence-bert is not found

git clone https://github.com/UKPLab/sentence-transformers/tree/v0.3.3

and get return: fatal: repository 'https://github.com/UKPLab/sentence-transformers/tree/v0.3.3/' not found

opened by JiachengLi1995 1

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Related tags

Overview

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Setup

Quick Start

Evaluation

Train your own Phrase-BERT

Citation:

Comments

Word similarity using phrase bert

Pre-trained PhraseBERT model

Train your own Phrase-BERT

Model encoding error

Version of sentence-bert is not found

Owner

Sentence Embeddings with BERT & XLNet

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Python implementation of TextRank for phrase extraction and summarization of text documents

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Python implementation of TextRank for phrase extraction and summarization of text documents

Automated Phrase Mining from Massive Text Corpora in Python.

Phrase-Based & Neural Unsupervised Machine Translation

NLP tool to extract emotional phrase from tweets 🤩

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

YACLC - Yet Another Chinese Learner Corpus

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.