Biterm Topic Model (BTM): modeling topics in short texts

Maksim Terpilowski

Last update: Dec 30, 2022

Related tags

Text Data & NLP visualization python nlp machine-learning natural-language-processing cython topic-modeling nlp-machine-learning btm topic-models biterm-topic-model

Overview

Biterm Topic Model

Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actually, it is a cythonized version of BTM. This package is also capable of computing perplexity and semantic coherence metrics.

Development

Please note that bitermplus is actively improved. Refer to documentation to stay up to date.

Requirements

cython
numpy
pandas
scipy
scikit-learn
tqdm

Setup

Linux and Windows

There should be no issues with installing bitermplus under these OSes. You can install the package directly from PyPi.

pip install bitermplus

Or from this repo:

pip install git+https://github.com/maximtrp/bitermplus.git

Mac OS

First, you need to install XCode CLT and Homebrew. Then, install libomp using brew:

xcode-select --install
brew install libomp
pip3 install bitermplus

Example

Model fitting

import bitermplus as btm
import numpy as np
import pandas as pd

# IMPORTING DATA
df = pd.read_csv(
    'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()

# PREPROCESSING
# Obtaining terms frequency in a sparse matrix and corpus vocabulary
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
tf = np.array(X.sum(axis=0)).ravel()
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
docs_lens = list(map(len, docs_vec))
# Generating biterms
biterms = btm.get_biterms(docs_vec)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(
    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)

# METRICS
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
# or
perplexity = model.perplexity_
coherence = model.coherence_

Results visualization

You need to install tmplot first.

import tmplot as tmp
tmp.report(model=model, docs=texts)

Tutorial

There is a tutorial in documentation that covers the important steps of topic modeling (including stability measures and results visualization).

Comments

the topic distribution for all doc is similar

topic

[9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07 3.12592152e-07] [9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08 2.43742411e-08] [9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07 1.83996702e-07] [9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07 2.77376339e-07] [9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10 3.94318712e-10] [9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07 3.92884503e-07]
bug help wanted good first issue

opened by JennieGerhardt 11
ERROR: Failed building wheel for bitermplus

creating build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found #include <omp.h> ^~~~~~~ 1 error generated. error: command '/usr/bin/clang' failed with exit code 1 [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus Failed to build bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
bug documentation

opened by QinrenK 9
Got an unexpected result in marked sample

Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, when i use the marked sample to train the model. i got the unexpeted result. Firstly, the marked samples contain 5 types, but trained model get a huge perlexity when the the number of topic is 5. Secondly, when i test the topic parameter from 1 to 20, the perplexity was reduced following the increase of topic number. my code is following: df = pd.read_csv('dataPretreatment/data/corpus.txt', header=None, names=['texts']) texts = df['texts'].str.strip().tolist() print(df) stop_words = segmentWord.stopwordslist() perplexitys = [] coherences = []

for T in range(1,21,1): print(T) X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words) # Vectorizing documents docs_vec = btm.get_vectorized_docs(texts, vocabulary) # Generating biterms biterms = btm.get_biterms(docs_vec) # INITIALIZING AND RUNNING MODEL model = btm.BTM(X, vocabulary, seed=12321, T=T, M=50, alpha=50/T, beta=0.01) model.fit(biterms, iterations=2000) p_zd = model.transform(docs_vec) perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, T) coherence = model.coherence_ perplexitys.append(perplexity) coherences.append(coherence)

``

opened by Chen-X666 7
Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out'

Hi @maximtrp, I am trying to use bitermplus for topic modeling. Running the code shows the error I mentioned in the title. Seems sth in get_words_freqs function goes wrong. I appreciate if you advise how I can fix that.

opened by Sajad7010 4

Cannot find Closest topics and Stable topics

Hello there, I am able to generate the model and visualize it. But when I tried to find the closest topics and stable topics, I get the error for code line:

closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=139, verbose=True)

The error is:

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

This is despite me separately checking the array size and it is 2-D. I am pasting the code below. Pl. can you check if I am doing anything wrong.

Thank you.

X, vocabulary, vocab_dict = btm.get_words_freqs(clean_text, max_df=.85, min_df=15,ngram_range=(1,2))

# Vectorizing documents
docs_vec = btm.get_vectorized_docs(clean_text, vocabulary)

# Generating biterms
Y = X.todense()
biterms = btm.get_biterms(docs_vec, 15)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(X, vocabulary, T=8, M=10, alpha=500/1000, beta=0.01, win=15, has_background= True)
model.fit(biterms, iterations=500, verbose=True)
p_zd = model.transform(docs_vec,verbose=True)  
print(p_zd) 

# matrix of document-topics; topics vs. documents, topics vs. words probabilities 
matrix_docs_topics = model.matrix_docs_topics_    #Documents vs topics probabilities matrix.
topic_doc_matrix = model.matrix_topics_docs_      #Topics vs documents probabilities matrix.
matrix_topic_words = model.matrix_topics_words_   #Topics vs words probabilities matrix.

# Getting stable topics
print("Array Dimension = ",len(matrix_topic_words.shape))
closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=100, verbose=True)
stable_topics, stable_kl = btm.get_stable_topics(closest_topics, thres=0.7)

# Stable topics indices list
print(stable_topics)

help wanted question

opened by RashmiBatra 4

Questions regarding Perplexity and Model Comparison with C++

I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model? My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?
help wanted question

opened by orpheus92 3
How do I get the topic words?

Hi,

Firstly, thanks for sharing your code.

Not an issue, just a question. I'm able to see the relevant words for a topic in the tmplot report. How do I get those words? I need to get at least the most three relevant terms.

Thanks in advance.
question

opened by aguinaldoabbj 3

failed building wheels

Hi!

I've got an error when running pip3 install bitermplus on MacOS (intel-based, Ventura), using python 3.10.8 in a separate venv (not anaconda):

Building wheels for collected packages: bitermplus
  Building wheel for bitermplus (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for bitermplus (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [34 lines of output]
      Error in sitecustomize; set PYTHONVERBOSE for traceback:
      AssertionError:
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-12-x86_64-cpython-310
      creating build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/__init__.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_util.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running egg_info
      writing src/bitermplus.egg-info/PKG-INFO
      writing dependency_links to src/bitermplus.egg-info/dependency_links.txt
      writing requirements to src/bitermplus.egg-info/requires.txt
      writing top-level names to src/bitermplus.egg-info/top_level.txt
      reading manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      adding license file 'LICENSE'
      writing manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      copying src/bitermplus/_btm.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_btm.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running build_ext
      building 'bitermplus._btm' extension
      creating build/temp.macosx-12-x86_64-cpython-310
      creating build/temp.macosx-12-x86_64-cpython-310/src
      creating build/temp.macosx-12-x86_64-cpython-310/src/bitermplus
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-12-x86_64-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
      src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
      #include <omp.h>
               ^~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

Could this error be related to #29? I've tested on a PC and it worked though.

bug documentation

opened by alanmaehara 2

Failed building wheel for bitermplus

Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

When I try to install bitermplus with pip install bitermplus there is an error massage like this : note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
bug

opened by novra 2
Calculation of nmi,ami,ri

I'm trying to test the model and see if it matches the data labels, but I can't get the topic for each document. I'm trying to get the list of labels to apply nmi, ami and ri so I'm wondering how to get the labels from the model. @maximtrp

opened by gitassia 2
Implementation Guide

I was wondering is there any way to print the the topics generate by the BTM model, just like how I can do it with Gensim. In addition to that, I am getting all negative coherence values in the range of -500 or -600. I am not sure if I am doing something wrong. The issues is, I am not able to interpret the results, even plotting gives some strange output.

The following image show what is held by the variable adobe, again I am not sure if it needs to be in this manner or each row here needs to a list

opened by neel6762 2

Releases(v0.6.12)

v0.6.12(Mar 29, 2022)

This release contains some minor fixes and adds labels_ property to BTM model class (labels for the most probable topics for each of the documents). It also adds get_docs_top_topic method for creating DataFrames with documents and their labels.
Source code(tar.gz)
Source code(zip)
v0.6.11(Jan 8, 2022)

This release fixes the incompatibility error between bitermplus and scikit-learn.
Source code(tar.gz)
Source code(zip)
v0.6.10(Dec 16, 2021)

This release includes a number of minor fixes. Methods to select stable topics have been moved to tmplot package. Please see the updated tutorial in the documentation.
Source code(tar.gz)
Source code(zip)
v0.6.9(Aug 19, 2021)

This release introduces a function for Renyi entropy calculation (bitermplus.entropy) that can be used to estimate the optimal number of topics. For more details, read this paper.
Source code(tar.gz)
Source code(zip)
v0.6.8(Jul 23, 2021)

This release is an attempt to fix the issue with perplexity calculation yielding infinity values (#7).
Source code(tar.gz)
Source code(zip)
v0.6.7(Jul 1, 2021)
This release drops support for pyLDAvis in favor of tmplot that can be installed with pip (optional):

pip install tmplot
Source code(tar.gz)
Source code(zip)
v0.6.6(Jun 16, 2021)

This release exposes new model attributes: matrix_topics_docs_, matrix_words_topics_, and df_words_topics_ (words vs topics probabilities in a DataFrame).
Source code(tar.gz)
Source code(zip)
v0.6.5(Jun 11, 2021)

This release fixes a critical bug in the closest topics selection (get_closest_topics method).
Source code(tar.gz)
Source code(zip)
v0.6.4(Apr 18, 2021)

This release includes memory optimizations and new metrics for topics distance measuring (see get_closest_topics method).
Source code(tar.gz)
Source code(zip)
v0.6.3(Apr 7, 2021)

This release fixes a bug in transform method that occurred when empty documents were passed as inputs.
Source code(tar.gz)
Source code(zip)
v0.6.2(Apr 6, 2021)

This release fixes a bug in document vs topics matrix shape (reported in this issue).
Source code(tar.gz)
Source code(zip)
v0.6.1(Apr 5, 2021)

This is a minor release that fixes buffer types mismatch on creating biterms (critical bug that appeared under Windows).
Source code(tar.gz)
Source code(zip)
v0.6.0(Apr 4, 2021)
This is a major release that fixes critical bugs in arrays initialization. The previous versions of bitermplus are not recommended for use.

Changelog:

Arrays (n_bz, n_wz) are now properly initialized. This procedure was broken in the previous versions that led to biased results.

Data normalization (via _normalize hidden method) improved.

New NumPy random generators are used to initially assign topics to biterms.

Biterms (biterms_ model attribute) and topics probabilities (theta_ model attribute) are now available.

Biterms are now serialized as well when model is saved.

Source code(tar.gz)
Source code(zip)
v0.5.10(Mar 23, 2021)

This release improves model pickling and adds seed argument to fit() method of BTM class.
Source code(tar.gz)
Source code(zip)
v0.5.9(Mar 22, 2021)

In this release public extension attributes were converted to properties with comprehensible names and docstrings.
Source code(tar.gz)
Source code(zip)
v0.5.8(Mar 21, 2021)

This release fixed numerous bugs in the code of inference methods, optimizes memory usage, and covers most part of model fitting and inferring code with tests.
Source code(tar.gz)
Source code(zip)

Owner

Maksim Terpilowski

Research scientist

GitHub https://bitermplus.readthedocs.io/en/stable/

Fast topic modeling platform

The state-of-the-art platform for topic modeling. Full Documentation User Mailing List Download Releases User survey What is BigARTM? BigARTM is a pow

633 Dec 21, 2022

Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

2.4k Jan 6, 2023

This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

NLP Space News Topic Modeling Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com Table of Contents Project Idea Data acquisition Primary data sour

1 Jan 3, 2022

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

3.1k Jan 7, 2023

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

2.5k Feb 17, 2021

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

8 Dec 16, 2022

Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, which allow you to try out and modify example code and analyses.

33 Dec 28, 2022

Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

13.8k Jan 2, 2023

Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

11.7k Feb 12, 2021

Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

11.7k Feb 18, 2021

ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

37 Nov 6, 2022

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

592 Dec 18, 2022

NLP topic mdel LDA - Gathered from New York Times website

1 Oct 14, 2021

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

20 Dec 14, 2022

Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

55 Nov 28, 2022

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER-Sentiment-Analysis VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifica

3.8k Dec 30, 2022

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER-Sentiment-Analysis VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifica

2.8k Feb 18, 2021

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

160 Dec 23, 2022

Biterm Topic Model (BTM): modeling topics in short texts

Related tags

Overview

Biterm Topic Model

Development

Requirements

Setup

Linux and Windows

Mac OS

Example

Model fitting

Results visualization

Tutorial

Comments

topic

Releases(v0.6.12)

v0.6.12(Mar 29, 2022)

v0.6.11(Jan 8, 2022)

v0.6.10(Dec 16, 2021)

v0.6.9(Aug 19, 2021)

v0.6.8(Jul 23, 2021)

v0.6.7(Jul 1, 2021)

v0.6.6(Jun 16, 2021)

v0.6.5(Jun 11, 2021)

v0.6.4(Apr 18, 2021)

v0.6.3(Apr 7, 2021)

v0.6.2(Apr 6, 2021)

v0.6.1(Apr 5, 2021)

v0.6.0(Apr 4, 2021)

v0.5.10(Mar 23, 2021)

v0.5.9(Mar 22, 2021)

v0.5.8(Mar 21, 2021)

Owner

Maksim Terpilowski

Fast topic modeling platform

Top2Vec is an algorithm for topic modeling and semantic search.

This repo stores the codes for topic modeling on palliative care journals.

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Topic Modelling for Humans

Topic Modelling for Humans

Topic Modelling for Humans

ETM - R package for Topic Modelling in Embedding Spaces

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

NLP topic mdel LDA - Gathered from New York Times website

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Topic Inference with Zeroshot models

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)