OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

MIND

Last update: Jan 1, 2023

Related tags

Deep Learning natural-language-processing hyperparameter-optimization topic-modeling bayesian-optimization hyperparameter-tuning latent-dirichlet-allocation evaluation-metrics neural-topic-models latent-semantic-analysis topic-models hyperparameter-search non-negative-matrix-factorization

Overview

OCTIS : Optimizing and Comparing Topic Models is Simple!

OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and comparing Topic Models, whose optimal hyper-parameters are estimated by means of a Bayesian Optimization approach.

Install

You can install OCTIS with the following command:

pip install octis

You can find the requirements in the requirements.txt file.

Features

Preprocess your own dataset or use one of the already-preprocessed benchmark datasets
Well-known topic models (both classical and neurals)
Evaluate your model using different state-of-the-art evaluation metrics
Optimize the models' hyperparameters for a given metric using Bayesian Optimization
Python library for advanced usage or simple web dashboard for starting and controlling the optimization experiments

Examples and Tutorials

To easily understand how to use OCTIS, we invite you to try our tutorials out :)

Name	Link
How to build a topic model and evaluate the results (LDA on 20Newsgroups)
How to optimize the hyperparameters of a neural topic model (CTM on M10)

Load a preprocessed dataset

To load one of the already preprocessed datasets as follows:

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")

Just use one of the dataset names listed below. Note: it is case-sensitive!

Available Datasets

Name	Source	# Docs	# Words	# Labels
20NewsGroup	20Newsgroup	16309	1612	20
BBC_News	BBC-News	2225	2949	5
DBLP	DBLP	54595	1513	4
M10	M10	8355	1696	10

Otherwise, you can load a custom preprocessed dataset in the following way:

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("../path/to/the/dataset/folder")

Make sure that the dataset is in the following format:

corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).
vocabulary: a .txt file where each line represents a word of the vocabulary

The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.

Disclaimer

Similarly to TensorFlow Datasets and HuggingFace's nlp library, we just downloaded and prepared public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, please get in touch through a GitHub issue.

If you're a dataset owner and wish to include your dataset in this library, please get in touch through a GitHub issue.

Preprocess

To preprocess a dataset, import the preprocessing class and use the preprocess_dataset method.

import os
import string
from octis.preprocessing.preprocessing import Preprocessing
os.chdir(os.path.pardir)

# Initialize preprocessing
p = Preprocessing(vocabulary=None, max_features=None, remove_punctuation=True, punctuation=string.punctuation,
                  lemmatize=True, remove_stopwords=True, stopword_list=['am', 'are', 'this', 'that'],
                  min_chars=1, min_words_docs=0)
# preprocess
dataset = p.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt')

# save the preprocessed dataset
dataset.save('hello_dataset')

For more details on the preprocessing see the preprocessing demo example in the examples folder.

Train a model

To build a model, load a preprocessed dataset, set the model hyperparameters and use train_model() to train the model.

from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA

# Load a dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("dataset_folder")

model = LDA(num_topics=25)  # Create model
model_output = model.train_model(dataset) # Train the model

If the dataset is partitioned, you can:

Train the model on the training set and test it on the test documents
Train the model with the whole dataset, regardless of any partition.

Evaluate a model

To evaluate a model, choose a metric and use the score() method of the metric class.

from octis.evaluation_metrics.diversity_metrics import TopicDiversity

metric = TopicDiversity(topk=10) # Initialize metric
topic_diversity_score = metric.score(model_output) # Compute score of the metric

Available metrics

Classification Metrics:

F1 measure (F1Score())
Precision (PrecisionScore())
Recall (RecallScore())
Accuracy (AccuracyScore())

Coherence Metrics:

UMass Coherence (Coherence({'measure':'c_umass'})
C_V Coherence (Coherence({'measure':'c_v'})
UCI Coherence (Coherence({'measure':'c_uci'})
NPMI Coherence (Coherence({'measure':'c_npmi'})
Word Embedding-based Coherence Pairwise (WECoherencePairwise())
Word Embedding-based Coherence Centroid (WECoherenceCentroid())

Diversity Metrics:

Topic Diversity (TopicDiversity())
InvertedRBO (InvertedRBO())
Word Embedding-based InvertedRBO (WordEmbeddingsInvertedRBO())
Word Embedding-based InvertedRBO centroid (WordEmbeddingsInvertedRBOCentroid())

Topic significance Metrics:

KL Uniform (KL_uniform())
KL Vacuous (KL_vacuous())
KL Background (KL_background())

Optimize a model

To optimize a model you need to select a dataset, a metric and the search space of the hyperparameters to optimize. For the types of the hyperparameters, we use scikit-optimize types (https://scikit-optimize.github.io/stable/modules/space.html)

from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real

# Define the search space. To see which hyperparameters to optimize, see the topic model's initialization signature
search_space = {"alpha": Real(low=0.001, high=5.0), "eta": Real(low=0.001, high=5.0)}

# Initialize an optimizer object and start the optimization.
optimizer=Optimizer()
optResult=optimizer.optimize(model, dataset, eval_metric, search_space, save_path="../results" # path to store the results
                             number_of_call=30, # number of optimization iterations
                             model_runs=5) # number of runs of the topic model
#save the results of th optimization in a csv file
optResult.save_to_csv("results.csv")

The result will provide best-seen value of the metric with the corresponding hyperparameter configuration, and the hyperparameters and metric value for each iteration of the optimization. To visualize this information, you have to set 'plot' attribute of Bayesian_optimization to True.

You can find more here: optimizer README

Available Models

Name	Implementation
CTM (Bianchi et al. 2020)	https://github.com/MilaNLProc/contextualized-topic-models
ETM (Dieng et al. 2020)	https://github.com/adjidieng/ETM
HDP (Blei et al. 2004)	https://radimrehurek.com/gensim/
LDA (Blei et al. 2003)	https://radimrehurek.com/gensim/
LSI (Landauer et al. 1998)	https://radimrehurek.com/gensim/
NMF (Lee and Seung 2000)	https://radimrehurek.com/gensim/
NeuralLDA (Srivastava and Sutton 2017)	https://github.com/estebandito22/PyTorchAVITM
ProdLda (Srivastava and Sutton 2017)	https://github.com/estebandito22/PyTorchAVITM

If you use one of these implementations, make sure to cite the right paper.

If you implemented a model and wish to update any part of it, or do not want your model to be included in this library, please get in touch through a GitHub issue.

If you implemented a model and wish to include your model in this library, please get in touch through a GitHub issue. Otherwise, if you want to include the model by yourself, see the following section.

Implement your own Model

Models inherit from the class AbstractModel defined in octis/models/model.py . To build your own model your class must override the train_model(self, dataset, hyperparameters) method which always requires at least a Dataset object and a Dictionary of hyperparameters as input and should return a dictionary with the output of the model as output.

To better understand how a model work, let's have a look at the LDA implementation. The first step in developing a custom model is to define the dictionary of default hyperparameters values:

hyperparameters = {'corpus': None, 'num_topics': 100, 'id2word': None, 'alpha': 'symmetric',
    'eta': None, # ...
    'callbacks': None}

Defining the default hyperparameters values allows users to work on a subset of them without having to assign a value to each parameter.

The following step is the train_model() override:

def train_model(self, dataset, hyperparameters={}, top_words=10):

The LDA method requires a dataset, the hyperparameters dictionary and an extra (optional) argument used to select how many of the most significative words track for each topic.

With the hyperparameters defaults, the ones in input and the dataset you should be able to write your own code and return as output a dictionary with at least 3 entries:

topics: the list of the most significative words foreach topic (list of lists of strings).
topic-word-matrix: an NxV matrix of weights where N is the number of topics and V is the vocabulary length.
topic-document-matrix: an NxD matrix of weights where N is the number of topics and D is the number of documents in the corpus.

if your model supports the training/test partitioning it should also return:

test-topic-document-matrix: the document topic matrix of the test set.

Dashboard

OCTIS includes a user friendly graphical interface for creating, monitoring and viewing experiments. Following the implementation standards of datasets, models and metrics the dashboard will automatically update and allow you to use your own custom implementations.

To run rhe dashboard, while in the project directory run the following command:

python OCTIS/dashboard/server.py

The browser will open and you will be redirected to the dashboard. In the dashboard you can:

Create new experiments organized in batch
Visualize and compare all the experiments
Visualize a custom experiment
Manage the experiment queue

How to cite our work

This work has been accepted at the demo track of EACL 2021! You can find it here: https://www.aclweb.org/anthology/2021.eacl-demos.31/ If you decide to use this resource, please cite:

@inproceedings{terragni2020octis,
    title={{OCTIS}: Comparing and Optimizing Topic Models is Simple!},
    author={Terragni, Silvia and Fersini, Elisabetta and Galuzzi, Bruno Giovanni and Tropeano, Pietro and Candelieri, Antonio},
    year={2021},
    booktitle={Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations},
    month = apr,
    year = "2021",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-demos.31",
    pages = "263--270",
}

Team

Project and Development Lead

Silvia Terragni <s.terragni4@campus.unimib.it>
Elisabetta Fersini <elisabetta.fersini@unimib.it>
Antonio Candelieri <antonio.candelieri@unimib.it>

Current Contributors

Pietro Tropeano <p.tropeano1@campus.unimib.it> Framework architecture, Preprocessing, Topic Models, Evaluation metrics and Web Dashboard
Bruno Galuzzi <bruno.galuzzi@unimib.it> Bayesian Optimization
Silvia Terragni <s.terragni4@campus.unimib.it> Overall project

Past Contributors

Lorenzo Famiglini <l.famiglini@campus.unimib.it> Neural models integration
Davide Pietrasanta <d.pietrasanta@campus.unimib.it> Bayesian Optimization

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template. Thanks to all the developers that released their topic models' implementations.

Comments

ETM - Possibility of using KeyedVectors input for pre-trained W2V embeddings
OCTIS version: 1.9.0

Python version: 3.7.6

Operating System: Ubuntu 20.04 LTS

Description

Hi, this is more of a question than anything else. I've seen that for ETM model training, we must pass an embeddings path corresponding to a "pickled" file. However, I need to execute ETM with rather large embeddings. There's any intent on implementing a gensim.models.KeyedVectors based (or something like that) embeddings input for this model? I've implemented something like that for an etm package of mine, but yours' has all I need to execute model optimization. Would a PR on this matter be accepted?

Anyway, cheers for the nice work, this package is really great!

What I Did

Gave a look at here.
opened by lffloyd 8

How do I load a dataset? How to do multi-label classification with OCTIS?

OCTIS version: any
Python version: any
Operating System: any

Description

I am trying to evaluate topic model algorithms with a provided dataset, without success.

What I Did

I am trying to run the following code:

from octis.evaluation_metrics.classification_metrics import AccuracyScore
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA


dataset = Dataset(corpus=X, labels=y)
model = LDA(num_topics=5, alpha=0.1)

acc = AccuracyScore(dataset)
output = model.train_model(dataset)

Where X is my text data and y is the topics (multilabel) for the given text. The last line return this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-29-99a4fc73752b> in <module>
      1 acc = AccuracyScore(dataset)
----> 2 output = model.train_model(dataset)

~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/models/LDA.py in train_model(self, dataset, hyperparams, top_words)
    164 
    165         if self.use_partitions:
--> 166             train_corpus, test_corpus = dataset.get_partitioned_corpus(use_validation=False)
    167         else:
    168             train_corpus = dataset.get_corpus()

~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/dataset/dataset.py in get_partitioned_corpus(self, use_validation)
     41     # Partitioned Corpus getter
     42     def get_partitioned_corpus(self, use_validation=True):
---> 43         last_training_doc = self.__metadata["last-training-doc"]
     44         # gestire l'eccezione se last_validation_doc non è definito, restituire
     45         # il validation vuoto

TypeError: 'NoneType' object is not subscriptable

enhancement

opened by jadermcs 6

Evaluate 3 different topic modeling algorithms
OCTIS version:

Python version:3,7

Operating System: linux

Description

I am a PhD candidate and I need to evaluate the performance of three different topic model algorithm including: LDA, LSI and Bertopic. ( LDA and LSI were trained using the Gensim package) what are the relevance metrics that I should use apart from coherence score? I would like to include in my paper a sort of table or graph that shows an evaluation in term of accuracy of the model (coherence score) and relevance of topics ( should I use the topic diversity metric ?) Thank you

What I Did

Paste the command(s) you ran and the output. If there was a crash, please include the traceback here.
opened by hajarzankadi 5

CTM training fails.

OCTIS version: 1.8.0
Python version: 3.8.10
Operating System: Ubuntu 20.04.02

Description

CTM training fails.

What I Did

dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATASET_PATH)
model = CTM(num_topics=TOPIC_SIZE)
model_output = model.train_model(dataset)
save_model_output(model_output, MODEL_OUTPUT_PATH)
save_model_output(model, MODEL_PATH)

The following error message was displayed.

Batches:  84%|████████████████████████████████████████████████████████████████████████████████████████▌                 | 21790/26093 [59:43<11:47,  6.08it/s]
Traceback (most recent call last):
  File "train.py", line 62, in <module>
    model = ProdLDA(num_topics=TOPIC_SIZE)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 95, in train_model
    x_train, x_test, x_valid, input_size = self.preprocess(
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 175, in preprocess
    b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 208, in load_bert_data
    bert_ouput = bert_embeddings_from_list(texts, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/contextualized_topic_models/utils/data_preparation.py", line 35, in bert_embeddings_from_list
    return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/SentenceTransformer.py", line 160, in encode
    out_features = self.forward(features)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/models/Transformer.py", line 51, in forward
    output_states = self.auto_model(**trans_features, return_dict=False)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 991, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 582, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 470, in forward
    self_attention_outputs = self.attention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 401, in forward
    self_outputs = self.self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 305, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

opened by cidrugHug8 5

OCTIS could not evaluate an external result?
OCTIS version: 1.10.4

Python version: 3.9

Operating System: Windows 11

Description

I got an error: unable to interpret topic as either a list of tokens or a list of ids.

What I Did

I use another method to get the topics of 20newsgroup, and I want to use the metrics provided by octis to test their quality.

So, I have many lists of topics. for example, one list is: ['cheek', 'yep', 'huh', 'ken', 'lets', 'ignore', 'forget', 'art', 'dilemma', 'dilemna']. I need to calculate the topic cohesion between these topics and the document(corpus).

As a topic modeling metrics system, I thought OCTIS may do this for me. However, it is hard.

I got this error because: among my result topics, some of the words are not in the corpus of 20newsgroup provided by OCTIS. I got my data from scikit-learn's 20newsgroup. So I think the only explanation is that the corpus of 20newsgroup from scikit-learn and OCTIS is different.

Therefore, it seems that the only solution is to use OCTIS's dataset to do the training. And then use OCTIS's evaluation system to do the topic cohesion. Does this mean that OCTIS is not accepting external topics?

Not sure if there are any other solutions for this case. I believe OCTIS should be able to work with external topic modeling methods. I just did not find the way. So please tell me if there is any suggestions.
opened by KesselZ 4
load a custom preprocessed dataset Error
OCTIS version: 1.10.4

Python version: 3.9

Operating System: MacOS

Description

I am trying to use evaluation metrics from OTICS package on my own dataset. I did follow the guilds in main readme page on how to load a custom preprocessed dataset. I even used your sample .tsv file, but I got the following error: NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'

What I Did

from octis.dataset.dataset import Dataset dataset = Dataset() dataset.load_custom_dataset_from_folder('/Users/../scratches/corpus.tsv')

Traceback (most recent call last): File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/octis/dataset/dataset.py", line 327, in load_custom_dataset_from_folder df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in init self._engine = self._make_engine(f, self.engine) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1218, in _make_engine self.handles = get_handle( # type: ignore[call-overload] File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/common.py", line 786, in get_handle handle = open( NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'
opened by srashtchi 4
Does contextual information always increase coherence?

Hi, I've been working on topic models for tweets. I trained my corpus on LDA model as well as CTM and ProdLDA. However, the coherence score for LDA is always higher for different number of topics. I was wondering why that would be? Are there any specific cases where LDA might perform better than CTM and ProdLDA? I've read through papers for both CTM and ProdLDA and looks like for different datasets, CTM performed better but for my use case, it isn't the case. I would really appreciate if I could get some help/examples on why contextual information might not help. Thanks

opened by PearlSikka 4
cannot fetch dataset
Hello,

Dataset.fetch_dataset() leads to an error, due to lines 71-73 in octis/dataset/downloader.py returning [404] responses. Root Dataset URL (line 69) seems incorrect. Manual fetching through a browser also leads to 404. downloader.py current version is the same as the local version in octis 1.10.0

thank you, Damir

OCTIS version: 1.10.0

Python version: 3.9.4

Operating System: Ubuntu 20.04
opened by dkorenci 4
Not able to run dashboard
Python version: 3.6

Operating System: Windows

Description

I tried: python octis\dashboard\server.py

and i get this error:

import octis.dashboard.experimentManager as expManager AttributeError: module 'octis' has no attribute 'dashboard'
opened by rsreetech 4
load_custom_dataset_from_folder
Hi Silvia

I managed to get my code running fine, thanks for your response.

I have another question , I am trying to make the code smoother, right now in order to create a dataset object I have to save my variable to a .tsv file first, and then use the load_custom_dataset_from_folder method to load the data from .tsv into empty dataset object. without this object obviously the get_corpus() method wouldn't do its magic. See the sample code below.

So basically the question is: is there a way to directly pass my variable to a dataset object without saving and loading?

from octis.dataset.dataset import Dataset f=Path('/myFolderPath/corpus.tsv') df.to_csv(f, sep="\t", index=False, header=False, columns = ['document']) dataset = Dataset() dataset.load_custom_dataset_from_folder('/myFolderPath/') texts=dataset.get_corpus()

Originally posted by @srashtchi in https://github.com/MIND-Lab/OCTIS/issues/68#issuecomment-1221302310
opened by srashtchi 3
ETM training leading to NaN loss
OCTIS version: 1.10.4

Python version: 3.7.13

Operating System: Windows

Description

I'm running topic model for tweets using ETM model. While training, it led to NaN loss in the first epoch and hence, the training doesn't go further epochs. The ETM model is being trained with default parameters.

model = ETM(num_topics=10) #command run output = model.train_model(dataset)

Output: Epoch: 1 .. batch: 20/25 .. LR: 0.005 .. KL_theta: nan .. Rec_loss: nan .. NELBO: nan

![tm_fail](https://user-images.githubusercontent.com/70057374/177056948-277f8d0f-9b57-4884-ab60-c79827ff5b8b.png)
opened by PearlSikka 3
Adding top_words parameter to CTM model
The train_model function in CTM has the top_words parameter but it doesn't get passed to the ctm class as it doesn't have any argument corresponding to the same. This results in CTM returning topics of length 10 regardless of the values of top_words in the train_model function. For example the below code will return topics of length 10 even though we've set the value of top_words to 5.

from octis.models.CTM import CTM from octis.dataset.dataset import Dataset dataset = Dataset() dataset.fetch_dataset("M10") model = CTM(num_topics=10) output = model.train_model(dataset, top_words=5) npmi = Coherence(texts=dataset.get_corpus(), topk=5)

This PR adds the top_word parameter to the ctm class and modifies the get_topics function to use the value of top_words passed by the user (default value is still 10)
opened by arijitgupta42 0
Bump wheel from 0.33.6 to 0.38.1
Bumps wheel from 0.33.6 to 0.38.1.

Changelog

Sourced from wheel's changelog.

Release Notes

UNRELEASED

Updated vendored packaging to 22.0

0.38.4 (2022-11-09)

Fixed PKG-INFO conversion in bdist_wheel mangling UTF-8 header values in METADATA (PR by Anderson Bravalheri)

0.38.3 (2022-11-08)

Fixed install failure when used with --no-binary, reported on Ubuntu 20.04, by removing setup_requires from setup.cfg

0.38.2 (2022-11-05)

Fixed regression introduced in v0.38.1 which broke parsing of wheel file names with multiple platform tags

0.38.1 (2022-11-04)

Removed install dependency on setuptools

The future-proof fix in 0.36.0 for converting PyPy's SOABI into a abi tag was faulty. Fixed so that future changes in the SOABI will not change the tag.

0.38.0 (2022-10-21)

Dropped support for Python < 3.7

Updated vendored packaging to 21.3

Replaced all uses of distutils with setuptools

The handling of license_files (including glob patterns and default values) is now delegated to setuptools>=57.0.0 (#466). The package dependencies were updated to reflect this change.

Fixed potential DoS attack via the WHEEL_INFO_RE regular expression

Fixed ValueError: ZIP does not support timestamps before 1980 when using SOURCE_DATE_EPOCH=0 or when on-disk timestamps are earlier than 1980-01-01. Such timestamps are now changed to the minimum value before packaging.

0.37.1 (2021-12-22)

Fixed wheel pack duplicating the WHEEL contents when the build number has changed (#415)

Fixed parsing of file names containing commas in RECORD (PR by Hood Chatham)

0.37.0 (2021-08-09)

Added official Python 3.10 support

Updated vendored packaging library to v20.9

... (truncated)

Commits

6f1608d Created a new release

cf8f5ef Moved news item from PR #484 to its proper place

9ec2016 Removed install dependency on setuptools (#483)

747e1f6 Fixed PyPy SOABI parsing (#484)

7627548 [pre-commit.ci] pre-commit autoupdate (#480)

7b9e8e1 Test on Python 3.11 final

a04dfef Updated the pypi-publish action

94bb62c Fixed docs not building due to code style changes

d635664 Updated the codecov action to the latest version

fcb94cd Updated version to match the release

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies python
opened by dependabot[bot] 0

python error

OCTIS version:
Python version:

Description

runtime error

What I Did

from octis.models.LDA import LDA 
cause this error 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-54-5a8917c9d1ca>](https://localhost:8080/#) in <module>
----> 1 import octis.models.LDA

6 frames
/usr/local/lib/python3.8/dist-packages/gensim/_matutils.pyx in init gensim._matutils()

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

opened by fatmas1982 0

Evaluating a custom topic model

Is there any ways to use your package for evaluating a custom topic model? What outputs in what formats I should get from my model and use to compute the metrics?

opened by nfsedaghat 0
No parameter for topk words per topic
OCTIS version: 1.83

Python version: 3.8

Operating System: Windows

Description

When training a model, I want to adjust the number of words per topic. However, I did not see this option in the source code of various models. Can you add this option for the different models?
opened by ERijck 1
Octis startup argparse issue
OCTIS version: latest

Python version: 3.8

Operating System: linux

Description

supplying --host as an argparse input does not change the host when starting up the server

What I Did

python octis/dashboard/server.py --host 0.0.0.0 Still starts up on 127.0.0.1

Paste the command(s) you ran and the output.

* Running on http://127.0.0.1:5000 Press CTRL+C to quit

Error here
opened by Alig1493 0

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

Related tags

Overview

OCTIS : Optimizing and Comparing Topic Models is Simple!

Install

Features

Examples and Tutorials

Load a preprocessed dataset

Available Datasets

Disclaimer

Preprocess

Train a model

Evaluate a model

Available metrics

Optimize a model

Available Models

Implement your own Model

Dashboard

How to cite our work

Team

Project and Development Lead

Current Contributors

Past Contributors

Credits

Comments

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

Description

Release Notes

Description

What I Did

Description

Description

What I Did

Owner

MIND

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent

Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN", accepted to ACM MM 2021 BNI Track.

HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS.

A toolkit for developing and comparing reinforcement learning algorithms.

Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

Let Python optimize the best stop loss and take profits for your TradingView strategy.

TensorFlow implementation for Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How

Open-L2O: A Comprehensive and Reproducible Benchmark for Learning to Optimize Algorithms

Official implementation for "Symbolic Learning to Optimize: Towards Interpretability and Scalability"

Sequential model-based optimization with a `scipy.optimize` interface

An end-to-end machine learning library to directly optimize AUC loss

Optimize Trading Strategies Using Freqtrade