Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

Yu Meng

Last update: Dec 18, 2022

Related tags

Overview

TopClus

The source code used for Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations, published in WWW 2022.

Requirements

At least one GPU is required to run the code.

Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

You need to also download the following resources in NLTK:

import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

Overview

TopClus is an unsupervised topic discovery method that jointly models words, documents and topics in a latent spherical space derived from pretrained language model representations.

Running Topic Discovery

The entry script is src/trainer.py and the meanings of the command line arguments will be displayed upon typing

python src/trainer.py -h

The topic discovery results will be written to results_${dataset}.

We provide two example scripts nyt.sh and yelp.sh for running topic discovery on the New York Times and the Yelp Review corpora used in the paper, respectively. You need to first extract the text files from the .tar.gz tarball files under datasets/nyt and datasets/yelp.

You could expect to obtain results like the following (the Topic IDs are random):

On New York Times:
Topic 20: months,weeks,days,decades,years,hours,decade,seconds,moments,minutes
Topic 28: weapons,missiles,missile,nuclear,grenades,explosions,explosives,launcher,bombs,bombing
Topic 30: healthcare,medical,medicine,physicians,patients,health,hospitals,bandages,medication,physician
Topic 41: economic,commercially,economy,business,industrial,industry,market,consumer,trade,commerce
Topic 46: senate,senator,congressional,legislators,legislatures,ministry,legislature,minister,ministerial,parliament
Topic 72: government,administration,governments,administrations,mayor,gubernatorial,mayoral,mayors,public,governor
Topic 77: aircraft,airline,airplane,airlines,voyage,airplanes,aviation,planes,spacecraft,flights
Topic 88: baseman,outfielder,baseball,innings,pitchers,softball,inning,basketball,shortstop,pitcher

On Yelp Review:
Topic 1: steamed,roasted,fried,shredded,seasoned,sliced,frozen,baked,canned,glazed
Topic 15: nice,cozy,elegant,polite,charming,relaxing,enjoyable,pleasant,helpful,luxurious
Topic 16: spicy,fresh,creamy,stale,bland,salty,fluffy,greasy,moist,cold
Topic 17: flavor,texture,flavors,taste,quality,smells,tastes,flavour,scent,ingredients
Topic 20: japanese,german,australian,moroccan,russian,greece,italian,greek,asian,
Topic 40: drinks,beers,beer,wine,beverages,alcohol,beverage,vodka,champagne,wines
Topic 55: horrible,terrible,shitty,awful,dreadful,worst,worse,disgusting,filthy,rotten
Topic 75: strawberry,berry,onion,peppers,tomato,onions,potatoes,vegetable,mustard,garlic

Running Document Clustering

The latent document embeddings will be saved to results_${dataset}/latent_doc_emb.pt which can be used as features to clustering algorithms (e.g., K-Means).

If you have ground truth document labels, you could obtain the document clustering evaluation results by passing the document label file and the saved latent document embedding file to the cluster_eval function in src/utils.py. For example:

from src.utils import TopClusUtils
utils = TopClusUtils()
utils.cluster_eval(label_path="datasets/nyt/label_topic.txt", emb_path="results_nyt/latent_doc_emb.pt")

Running on New Datasets

To execute the code on a new dataset, you need to

Create a directory named your_dataset under datasets.
Prepare a text corpus texts.txt (one document per line) under your_dataset as the target corpus for topic discovery.
Run src/trainer.py with appropriate command line arguments (the default values are usually good start points).

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2022topic,
  title={Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations},
  author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Zhang, Yu and Han, Jiawei},
  booktitle={The Web Conference},
  year={2022},
}

You might also like...

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by

135 Dec 30, 2022

This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

CPC_DeepCluster This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEEC

2 Sep 15, 2022

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Comments

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that?

I like your paper but I think it's confusing that how to tackle multiple embeddings of the same word/token. I wonder is there any chance that different embeddings of the same word are mapped to different clusters and all of them are quite close to the cluster center in the spherical space. How do you deal with that?

opened by namespace-Pt 4
A request for evaluation script

Hello, Could you please share the script used for evaluation? The obtained results may differ with different evaluation scripts. Thanks a lot!

Best wishes, Lei

opened by lei-liu1 5

Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

Related tags

Overview

TopClus

Requirements

Overview

Running Topic Discovery

Running Document Clustering

Running on New Datasets

Citations

You might also like...

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

This is the implementation of "SELF SUPERVISED REPRESENTATION LEARNING WITH DEEP CLUSTERING FOR ACOUSTIC UNIT DISCOVERY FROM RAW SPEECH" submitted to ICASSP 2022

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

Systemic Evolutionary Chemical Space Exploration for Drug Discovery

Navigating StyleGAN2 w latent space using CLIP

[CVPR 2020] Interpreting the Latent Space of GANs for Semantic Face Editing

MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space

PyTorch implementation of the WarpedGANSpace: Finding non-linear RBF paths in GAN latent space (ICCV 2021)

Comments

I think one word has multiple contextualized embeddings in the corpus. How do you deal with that?

A request for evaluation script

Owner

Yu Meng

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

This respository includes implementations on Manifoldron: Direct Space Partition via Manifold Discovery

PyTorch implementation of: Michieli U. and Zanuttigh P., "Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations", CVPR 2021.

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

Code for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space"

Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling