This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Xiaotao Gu

Last update: Dec 22, 2022

Related tags

Deep Learning UCPhrase-exp

Overview

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

To appear on KDD'21...[pdf]

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora. In this work, we recognize the power of pretrained language models in identifying the structure of a sentence. The attention matrices generated by a Transformer model are informative to distinguish quality phrases from ordinary spans, as illustrated in the following example.

With a lightweight CNN model to capture inter-word relationships from various ranges, we can effectively tackle the task of phrase tagging as a multi-channel image classifiaction problem.

For model training, we seek to alleviate the need for human annotation and external knowledge bases. Instead, we show that sufficient supervision can be directly mined from large-scale unlabeled corpus. Specifically, we mine frequent max patterns with each document as context, since by definition, high-quality phrases are sequences that are consistently used in context. Compared with labels generated by distant supervision, silver labels mined from the corpus itself preserve better diversity, coverage, and contextual completeness. The superiority is supported by comparison on two public datasets.

We compare our method with existing ones on the KP20k dataset (publication data from CS domain) and the KPTimes dataset (news articles). UCPhrase significantly outperforms prior arts without supervision. Compared with off-the-shelf phrase tagging tools, UCPhrase also shows unique advantages, especially in its ability to generalize to specific domains without reliance on manually curated labels or KBs. We provide comprehensive case studies to demonstrate the comparison among different tagging methods. We also have some interesting findings in the discussion sections.

We aim to build UCPhrase as a practical tool for phrase tagging, though it is certainly far from perfect. Please feel free to try on your own corpus and give us feedbacks if you have any ideas that can help build better phrase tagging tools!

Facts: UCPhrase is a joint work by researchers from UI at Urbana Champaign, and University of California San Diago.

Quick Start

Step 1: Download and unzip the data folder

wget https://www.dropbox.com/s/1bv7dnjawykjsji/data.zip?dl=0 -O data.zip
unzip -n data.zip

Step 2: Install and compile dependencies

bash build.sh

Step 3: Run experiments

cd src
python exp.py --gpu 0 --dir_data ../data/devdata

Model checkpoint and output files will be stored under the generated "experiments" folder.

Citation

If you find the implementation useful, please consider citing the following paper:

Xiaotao Gu*, Zihan Wang*, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han, Jingbo Shang, "UCPhrase: Unsupervised Context-aware Quality Phrase Tagging", in Proc. of 2021 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'21), Aug. 2021

Comments

Missing file for '/standard/kp20k.tagging.human_2.json'

When I download the data from https://www.dropbox.com/s/1bv7dnjawykjsji/data.zip?dl=0 and unzip the package, no file of '/standard/kp20k.tagging.human_2.json'.

opened by possible1402 4
How to extract phrase by using stanfordcoreNLP tool?

Hi,

Thanks for opening your code for this paper. I read your paper and I found that you compared your results with pretrained model such as stanfordNLP. I am curious about how to extract phrases/mentions by stanfordNLP. Do you use "coref" annotator to get coreferent mentions as extracted phrase? Or there are other tools based on stanfordCoreNLP?

Wait for your reply.

Thanks in advance.

opened by yangjingyi 3
results interpretation

This might be a repeat, but I didn't find an answer. Could you please specify where can I find actual resulting phrases after running the code? I scanned the code and all the configs multiple times, but i don't see it. Unless it's the spans in Attmap.cs_roberta_base.3layers, but in that case i'm lost at trying to interpret it. Cheers!

Edit: I checked devdata-cs_roberta_base-core.CNN.3layers/model and these spans make much more sense, still could you please confirm those are the ones and how do I interpret them?

opened by icanfast 2
-

This PR inserts the list of products as silver labels in the candidate generation step. In addition, the prepare_dataset.py script has been improved to filter out reviews that are in another language and to remove html tags. Also remove the different patterns identified for each review_source. It also includes some logs to debug the training process.

This needs another PR to improve model training and clean up the repository.

opened by alejandrods 0
what the labels of CNN?

Hi! Excuse! What the labels of the CNN model ? The input of the CNN is 'attention map' and the output of the model is 'silver labels' ? However, the sliver labels are not all correct ?

opened by ZYuliang 0
Getting "Killed" error when run on kp20k Dataset

When I run python exp.py --gpu 0 --dir_data ../data/kp20k, I get the Killed error. While debugging I found the error happens at the return torch.tensor(padded_tensor, dtype=torch.float32) (line 66 of src/model_att/model.py)

Is there a way to fix this?

opened by nanthamanish 5

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Related tags

Overview

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

To appear on KDD'21...[pdf]

Quick Start

Step 1: Download and unzip the data folder

Step 2: Install and compile dependencies

Step 3: Run experiments

Citation

Comments

Missing file for '/standard/kp20k.tagging.human_2.json'

How to extract phrase by using stanfordcoreNLP tool?

results interpretation

-

what the labels of CNN?

Getting "Killed" error when run on kp20k Dataset

Owner

Xiaotao Gu

Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

Text mining project; Using distilBERT to predict authors in the classification task authorship attribution.

Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The original code is written in keras.

Measuring and Improving Consistency in Pretrained Language Models

Using pretrained language models for biomedical knowledge graph completion.

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

Amazon Forest Computer Vision: Satellite Image tagging code using PyTorch / Keras with lots of PyTorch tricks

Easy to use Audio Tagging in PyTorch

Amazon Forest Computer Vision: Satellite Image tagging code using PyTorch / Keras with lots of PyTorch tricks

Sequence-tagging using deep learning

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links