OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

Overview

OCTIS : Optimizing and Comparing Topic Models is Simple!

Documentation Status Contributors License

Logo

OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and comparing Topic Models, whose optimal hyper-parameters are estimated by means of a Bayesian Optimization approach.

Install

You can install OCTIS with the following command:

pip install octis

You can find the requirements in the requirements.txt file.

Features

  • Preprocess your own dataset or use one of the already-preprocessed benchmark datasets
  • Well-known topic models (both classical and neurals)
  • Evaluate your model using different state-of-the-art evaluation metrics
  • Optimize the models' hyperparameters for a given metric using Bayesian Optimization
  • Python library for advanced usage or simple web dashboard for starting and controlling the optimization experiments

Examples and Tutorials

To easily understand how to use OCTIS, we invite you to try our tutorials out :)

Name Link
How to build a topic model and evaluate the results (LDA on 20Newsgroups) Open In Colab
How to optimize the hyperparameters of a neural topic model (CTM on M10) Open In Colab

Load a preprocessed dataset

To load one of the already preprocessed datasets as follows:

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")

Just use one of the dataset names listed below. Note: it is case-sensitive!

Available Datasets

Name Source # Docs # Words # Labels
20NewsGroup 20Newsgroup 16309 1612 20
BBC_News BBC-News 2225 2949 5
DBLP DBLP 54595 1513 4
M10 M10 8355 1696 10

Otherwise, you can load a custom preprocessed dataset in the following way:

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("../path/to/the/dataset/folder")
Make sure that the dataset is in the following format:
  • corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).
  • vocabulary: a .txt file where each line represents a word of the vocabulary

The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.

Disclaimer

Similarly to TensorFlow Datasets and HuggingFace's nlp library, we just downloaded and prepared public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, please get in touch through a GitHub issue.

If you're a dataset owner and wish to include your dataset in this library, please get in touch through a GitHub issue.

Preprocess

To preprocess a dataset, import the preprocessing class and use the preprocess_dataset method.

import os
import string
from octis.preprocessing.preprocessing import Preprocessing
os.chdir(os.path.pardir)

# Initialize preprocessing
p = Preprocessing(vocabulary=None, max_features=None, remove_punctuation=True, punctuation=string.punctuation,
                  lemmatize=True, remove_stopwords=True, stopword_list=['am', 'are', 'this', 'that'],
                  min_chars=1, min_words_docs=0)
# preprocess
dataset = p.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt')

# save the preprocessed dataset
dataset.save('hello_dataset')

For more details on the preprocessing see the preprocessing demo example in the examples folder.

Train a model

To build a model, load a preprocessed dataset, set the model hyperparameters and use train_model() to train the model.

from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA

# Load a dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("dataset_folder")

model = LDA(num_topics=25)  # Create model
model_output = model.train_model(dataset) # Train the model

If the dataset is partitioned, you can:

  • Train the model on the training set and test it on the test documents
  • Train the model with the whole dataset, regardless of any partition.

Evaluate a model

To evaluate a model, choose a metric and use the score() method of the metric class.

from octis.evaluation_metrics.diversity_metrics import TopicDiversity

metric = TopicDiversity(topk=10) # Initialize metric
topic_diversity_score = metric.score(model_output) # Compute score of the metric

Available metrics

Classification Metrics:

  • F1 measure (F1Score())
  • Precision (PrecisionScore())
  • Recall (RecallScore())
  • Accuracy (AccuracyScore())

Coherence Metrics:

  • UMass Coherence (Coherence({'measure':'c_umass'})
  • C_V Coherence (Coherence({'measure':'c_v'})
  • UCI Coherence (Coherence({'measure':'c_uci'})
  • NPMI Coherence (Coherence({'measure':'c_npmi'})
  • Word Embedding-based Coherence Pairwise (WECoherencePairwise())
  • Word Embedding-based Coherence Centroid (WECoherenceCentroid())

Diversity Metrics:

  • Topic Diversity (TopicDiversity())
  • InvertedRBO (InvertedRBO())
  • Word Embedding-based InvertedRBO (WordEmbeddingsInvertedRBO())
  • Word Embedding-based InvertedRBO centroid (WordEmbeddingsInvertedRBOCentroid())

Topic significance Metrics:

  • KL Uniform (KL_uniform())
  • KL Vacuous (KL_vacuous())
  • KL Background (KL_background())

Optimize a model

To optimize a model you need to select a dataset, a metric and the search space of the hyperparameters to optimize. For the types of the hyperparameters, we use scikit-optimize types (https://scikit-optimize.github.io/stable/modules/space.html)

from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real

# Define the search space. To see which hyperparameters to optimize, see the topic model's initialization signature
search_space = {"alpha": Real(low=0.001, high=5.0), "eta": Real(low=0.001, high=5.0)}

# Initialize an optimizer object and start the optimization.
optimizer=Optimizer()
optResult=optimizer.optimize(model, dataset, eval_metric, search_space, save_path="../results" # path to store the results
                             number_of_call=30, # number of optimization iterations
                             model_runs=5) # number of runs of the topic model
#save the results of th optimization in a csv file
optResult.save_to_csv("results.csv")

The result will provide best-seen value of the metric with the corresponding hyperparameter configuration, and the hyperparameters and metric value for each iteration of the optimization. To visualize this information, you have to set 'plot' attribute of Bayesian_optimization to True.

You can find more here: optimizer README

Available Models

Name Implementation
CTM (Bianchi et al. 2020) https://github.com/MilaNLProc/contextualized-topic-models
ETM (Dieng et al. 2020) https://github.com/adjidieng/ETM
HDP (Blei et al. 2004) https://radimrehurek.com/gensim/
LDA (Blei et al. 2003) https://radimrehurek.com/gensim/
LSI (Landauer et al. 1998) https://radimrehurek.com/gensim/
NMF (Lee and Seung 2000) https://radimrehurek.com/gensim/
NeuralLDA (Srivastava and Sutton 2017) https://github.com/estebandito22/PyTorchAVITM
ProdLda (Srivastava and Sutton 2017) https://github.com/estebandito22/PyTorchAVITM

If you use one of these implementations, make sure to cite the right paper.

If you implemented a model and wish to update any part of it, or do not want your model to be included in this library, please get in touch through a GitHub issue.

If you implemented a model and wish to include your model in this library, please get in touch through a GitHub issue. Otherwise, if you want to include the model by yourself, see the following section.

Implement your own Model

Models inherit from the class AbstractModel defined in octis/models/model.py . To build your own model your class must override the train_model(self, dataset, hyperparameters) method which always requires at least a Dataset object and a Dictionary of hyperparameters as input and should return a dictionary with the output of the model as output.

To better understand how a model work, let's have a look at the LDA implementation. The first step in developing a custom model is to define the dictionary of default hyperparameters values:

hyperparameters = {'corpus': None, 'num_topics': 100, 'id2word': None, 'alpha': 'symmetric',
    'eta': None, # ...
    'callbacks': None}

Defining the default hyperparameters values allows users to work on a subset of them without having to assign a value to each parameter.

The following step is the train_model() override:

def train_model(self, dataset, hyperparameters={}, top_words=10):

The LDA method requires a dataset, the hyperparameters dictionary and an extra (optional) argument used to select how many of the most significative words track for each topic.

With the hyperparameters defaults, the ones in input and the dataset you should be able to write your own code and return as output a dictionary with at least 3 entries:

  • topics: the list of the most significative words foreach topic (list of lists of strings).
  • topic-word-matrix: an NxV matrix of weights where N is the number of topics and V is the vocabulary length.
  • topic-document-matrix: an NxD matrix of weights where N is the number of topics and D is the number of documents in the corpus.

if your model supports the training/test partitioning it should also return:

  • test-topic-document-matrix: the document topic matrix of the test set.

Dashboard

OCTIS includes a user friendly graphical interface for creating, monitoring and viewing experiments. Following the implementation standards of datasets, models and metrics the dashboard will automatically update and allow you to use your own custom implementations.

To run rhe dashboard, while in the project directory run the following command:

python OCTIS/dashboard/server.py

The browser will open and you will be redirected to the dashboard. In the dashboard you can:

  • Create new experiments organized in batch
  • Visualize and compare all the experiments
  • Visualize a custom experiment
  • Manage the experiment queue

How to cite our work

This work has been accepted at the demo track of EACL 2021! You can find it here: https://www.aclweb.org/anthology/2021.eacl-demos.31/ If you decide to use this resource, please cite:

@inproceedings{terragni2020octis,
    title={{OCTIS}: Comparing and Optimizing Topic Models is Simple!},
    author={Terragni, Silvia and Fersini, Elisabetta and Galuzzi, Bruno Giovanni and Tropeano, Pietro and Candelieri, Antonio},
    year={2021},
    booktitle={Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations},
    month = apr,
    year = "2021",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-demos.31",
    pages = "263--270",
}

Team

Project and Development Lead

Current Contributors

Past Contributors

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template. Thanks to all the developers that released their topic models' implementations.

Comments
  • ETM - Possibility of using KeyedVectors input for pre-trained W2V embeddings

    ETM - Possibility of using KeyedVectors input for pre-trained W2V embeddings

    • OCTIS version: 1.9.0
    • Python version: 3.7.6
    • Operating System: Ubuntu 20.04 LTS

    Description

    Hi, this is more of a question than anything else. I've seen that for ETM model training, we must pass an embeddings path corresponding to a "pickled" file. However, I need to execute ETM with rather large embeddings. There's any intent on implementing a gensim.models.KeyedVectors based (or something like that) embeddings input for this model? I've implemented something like that for an etm package of mine, but yours' has all I need to execute model optimization. Would a PR on this matter be accepted?

    Anyway, cheers for the nice work, this package is really great!

    What I Did

    Gave a look at here.

    opened by lffloyd 8
  • How do I load a dataset? How to do multi-label classification with OCTIS?

    How do I load a dataset? How to do multi-label classification with OCTIS?

    • OCTIS version: any
    • Python version: any
    • Operating System: any

    Description

    I am trying to evaluate topic model algorithms with a provided dataset, without success.

    What I Did

    I am trying to run the following code:

    from octis.evaluation_metrics.classification_metrics import AccuracyScore
    from octis.dataset.dataset import Dataset
    from octis.models.LDA import LDA
    
    
    dataset = Dataset(corpus=X, labels=y)
    model = LDA(num_topics=5, alpha=0.1)
    
    acc = AccuracyScore(dataset)
    output = model.train_model(dataset)
    

    Where X is my text data and y is the topics (multilabel) for the given text. The last line return this error:

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-29-99a4fc73752b> in <module>
          1 acc = AccuracyScore(dataset)
    ----> 2 output = model.train_model(dataset)
    
    ~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/models/LDA.py in train_model(self, dataset, hyperparams, top_words)
        164 
        165         if self.use_partitions:
    --> 166             train_corpus, test_corpus = dataset.get_partitioned_corpus(use_validation=False)
        167         else:
        168             train_corpus = dataset.get_corpus()
    
    ~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/dataset/dataset.py in get_partitioned_corpus(self, use_validation)
         41     # Partitioned Corpus getter
         42     def get_partitioned_corpus(self, use_validation=True):
    ---> 43         last_training_doc = self.__metadata["last-training-doc"]
         44         # gestire l'eccezione se last_validation_doc non è definito, restituire
         45         # il validation vuoto
    
    TypeError: 'NoneType' object is not subscriptable
    
    enhancement 
    opened by jadermcs 6
  • Evaluate 3 different topic modeling algorithms

    Evaluate 3 different topic modeling algorithms

    • OCTIS version:
    • Python version:3,7
    • Operating System: linux

    Description

    I am a PhD candidate and I need to evaluate the performance of three different topic model algorithm including: LDA, LSI and Bertopic. ( LDA and LSI were trained using the Gensim package) what are the relevance metrics that I should use apart from coherence score? I would like to include in my paper a sort of table or graph that shows an evaluation in term of accuracy of the model (coherence score) and relevance of topics ( should I use the topic diversity metric ?) Thank you

    What I Did

    Paste the command(s) you ran and the output.
    If there was a crash, please include the traceback here.
    
    opened by hajarzankadi 5
  • CTM training fails.

    CTM training fails.

    • OCTIS version: 1.8.0
    • Python version: 3.8.10
    • Operating System: Ubuntu 20.04.02

    Description

    CTM training fails.

    What I Did

    dataset = Dataset()
    dataset.load_custom_dataset_from_folder(DATASET_PATH)
    model = CTM(num_topics=TOPIC_SIZE)
    model_output = model.train_model(dataset)
    save_model_output(model_output, MODEL_OUTPUT_PATH)
    save_model_output(model, MODEL_PATH)
    

    The following error message was displayed.

    Batches:  84%|████████████████████████████████████████████████████████████████████████████████████████▌                 | 21790/26093 [59:43<11:47,  6.08it/s]
    Traceback (most recent call last):
      File "train.py", line 62, in <module>
        model = ProdLDA(num_topics=TOPIC_SIZE)
      File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 95, in train_model
        x_train, x_test, x_valid, input_size = self.preprocess(
      File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 175, in preprocess
        b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
      File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 208, in load_bert_data
        bert_ouput = bert_embeddings_from_list(texts, bert_model)
      File "/usr/local/lib/python3.8/dist-packages/octis/models/contextualized_topic_models/utils/data_preparation.py", line 35, in bert_embeddings_from_list
        return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))
      File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/SentenceTransformer.py", line 160, in encode
        out_features = self.forward(features)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
        input = module(input)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/models/Transformer.py", line 51, in forward
        output_states = self.auto_model(**trans_features, return_dict=False)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 991, in forward
        encoder_outputs = self.encoder(
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 582, in forward
        layer_outputs = layer_module(
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 470, in forward
        self_attention_outputs = self.attention(
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 401, in forward
        self_outputs = self.self(
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 305, in forward
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
    
    opened by cidrugHug8 5
  • OCTIS could not evaluate an external result?

    OCTIS could not evaluate an external result?

    • OCTIS version: 1.10.4
    • Python version: 3.9
    • Operating System: Windows 11

    Description

    I got an error: unable to interpret topic as either a list of tokens or a list of ids.

    What I Did

    I use another method to get the topics of 20newsgroup, and I want to use the metrics provided by octis to test their quality.

    So, I have many lists of topics. for example, one list is: ['cheek', 'yep', 'huh', 'ken', 'lets', 'ignore', 'forget', 'art', 'dilemma', 'dilemna']. I need to calculate the topic cohesion between these topics and the document(corpus).

    As a topic modeling metrics system, I thought OCTIS may do this for me. However, it is hard.

    I got this error because: among my result topics, some of the words are not in the corpus of 20newsgroup provided by OCTIS. I got my data from scikit-learn's 20newsgroup. So I think the only explanation is that the corpus of 20newsgroup from scikit-learn and OCTIS is different.

    Therefore, it seems that the only solution is to use OCTIS's dataset to do the training. And then use OCTIS's evaluation system to do the topic cohesion. Does this mean that OCTIS is not accepting external topics?

    Not sure if there are any other solutions for this case. I believe OCTIS should be able to work with external topic modeling methods. I just did not find the way. So please tell me if there is any suggestions.

    opened by KesselZ 4
  • load a custom preprocessed dataset  Error

    load a custom preprocessed dataset Error

    • OCTIS version: 1.10.4
    • Python version: 3.9
    • Operating System: MacOS

    Description

    I am trying to use evaluation metrics from OTICS package on my own dataset. I did follow the guilds in main readme page on how to load a custom preprocessed dataset. I even used your sample .tsv file, but I got the following error: NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'

    What I Did

    from octis.dataset.dataset import Dataset
    dataset = Dataset()
    dataset.load_custom_dataset_from_folder('/Users/../scratches/corpus.tsv')
    

    Traceback (most recent call last): File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/octis/dataset/dataset.py", line 327, in load_custom_dataset_from_folder df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in init self._engine = self._make_engine(f, self.engine) File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1218, in _make_engine self.handles = get_handle( # type: ignore[call-overload] File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/common.py", line 786, in get_handle handle = open( NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'

    opened by srashtchi 4
  • Does contextual information always increase coherence?

    Does contextual information always increase coherence?

    Hi, I've been working on topic models for tweets. I trained my corpus on LDA model as well as CTM and ProdLDA. However, the coherence score for LDA is always higher for different number of topics. I was wondering why that would be? Are there any specific cases where LDA might perform better than CTM and ProdLDA? I've read through papers for both CTM and ProdLDA and looks like for different datasets, CTM performed better but for my use case, it isn't the case. I would really appreciate if I could get some help/examples on why contextual information might not help. Thanks

    opened by PearlSikka 4
  • cannot fetch dataset

    cannot fetch dataset

    Hello,

    Dataset.fetch_dataset() leads to an error, due to lines 71-73 in octis/dataset/downloader.py returning [404] responses. Root Dataset URL (line 69) seems incorrect. Manual fetching through a browser also leads to 404. downloader.py current version is the same as the local version in octis 1.10.0

    thank you, Damir

    • OCTIS version: 1.10.0
    • Python version: 3.9.4
    • Operating System: Ubuntu 20.04
    opened by dkorenci 4
  • Not able to run dashboard

    Not able to run dashboard

    • Python version: 3.6
    • Operating System: Windows

    Description

    I tried: python octis\dashboard\server.py

    and i get this error:

    import octis.dashboard.experimentManager as expManager AttributeError: module 'octis' has no attribute 'dashboard'

    opened by rsreetech 4
  • load_custom_dataset_from_folder

    load_custom_dataset_from_folder

    Hi Silvia

    I managed to get my code running fine, thanks for your response.

    I have another question , I am trying to make the code smoother, right now in order to create a dataset object I have to save my variable to a .tsv file first, and then use the load_custom_dataset_from_folder method to load the data from .tsv into empty dataset object. without this object obviously the get_corpus() method wouldn't do its magic. See the sample code below.

    So basically the question is: is there a way to directly pass my variable to a dataset object without saving and loading?

    from octis.dataset.dataset import Dataset
    f=Path('/myFolderPath/corpus.tsv')
    df.to_csv(f, sep="\t", index=False, header=False, columns = ['document'])
    
    dataset = Dataset()
    dataset.load_custom_dataset_from_folder('/myFolderPath/')
    
    texts=dataset.get_corpus()
    

    Originally posted by @srashtchi in https://github.com/MIND-Lab/OCTIS/issues/68#issuecomment-1221302310

    opened by srashtchi 3
  • ETM training leading to NaN loss

    ETM training leading to NaN loss

    • OCTIS version: 1.10.4
    • Python version: 3.7.13
    • Operating System: Windows

    Description

    I'm running topic model for tweets using ETM model. While training, it led to NaN loss in the first epoch and hence, the training doesn't go further epochs. The ETM model is being trained with default parameters.

    model = ETM(num_topics=10) #command run output = model.train_model(dataset)

    Output: Epoch: 1 .. batch: 20/25 .. LR: 0.005 .. KL_theta: nan .. Rec_loss: nan .. NELBO: nan

    
    ![tm_fail](https://user-images.githubusercontent.com/70057374/177056948-277f8d0f-9b57-4884-ab60-c79827ff5b8b.png)
    
    
    opened by PearlSikka 3
  • Adding top_words parameter to CTM model

    Adding top_words parameter to CTM model

    The train_model function in CTM has the top_words parameter but it doesn't get passed to the ctm class as it doesn't have any argument corresponding to the same. This results in CTM returning topics of length 10 regardless of the values of top_words in the train_model function. For example the below code will return topics of length 10 even though we've set the value of top_words to 5.

    from octis.models.CTM import CTM
    from octis.dataset.dataset import Dataset
    
    dataset = Dataset()
    dataset.fetch_dataset("M10")
    
    model = CTM(num_topics=10)
    output = model.train_model(dataset, top_words=5)
    npmi = Coherence(texts=dataset.get_corpus(), topk=5)
    

    This PR adds the top_word parameter to the ctm class and modifies the get_topics function to use the value of top_words passed by the user (default value is still 10)

    opened by arijitgupta42 0
  • Bump wheel from 0.33.6 to 0.38.1

    Bump wheel from 0.33.6 to 0.38.1

    Bumps wheel from 0.33.6 to 0.38.1.

    Changelog

    Sourced from wheel's changelog.

    Release Notes

    UNRELEASED

    • Updated vendored packaging to 22.0

    0.38.4 (2022-11-09)

    • Fixed PKG-INFO conversion in bdist_wheel mangling UTF-8 header values in METADATA (PR by Anderson Bravalheri)

    0.38.3 (2022-11-08)

    • Fixed install failure when used with --no-binary, reported on Ubuntu 20.04, by removing setup_requires from setup.cfg

    0.38.2 (2022-11-05)

    • Fixed regression introduced in v0.38.1 which broke parsing of wheel file names with multiple platform tags

    0.38.1 (2022-11-04)

    • Removed install dependency on setuptools
    • The future-proof fix in 0.36.0 for converting PyPy's SOABI into a abi tag was faulty. Fixed so that future changes in the SOABI will not change the tag.

    0.38.0 (2022-10-21)

    • Dropped support for Python < 3.7
    • Updated vendored packaging to 21.3
    • Replaced all uses of distutils with setuptools
    • The handling of license_files (including glob patterns and default values) is now delegated to setuptools>=57.0.0 (#466). The package dependencies were updated to reflect this change.
    • Fixed potential DoS attack via the WHEEL_INFO_RE regular expression
    • Fixed ValueError: ZIP does not support timestamps before 1980 when using SOURCE_DATE_EPOCH=0 or when on-disk timestamps are earlier than 1980-01-01. Such timestamps are now changed to the minimum value before packaging.

    0.37.1 (2021-12-22)

    • Fixed wheel pack duplicating the WHEEL contents when the build number has changed (#415)
    • Fixed parsing of file names containing commas in RECORD (PR by Hood Chatham)

    0.37.0 (2021-08-09)

    • Added official Python 3.10 support
    • Updated vendored packaging library to v20.9

    ... (truncated)

    Commits
    • 6f1608d Created a new release
    • cf8f5ef Moved news item from PR #484 to its proper place
    • 9ec2016 Removed install dependency on setuptools (#483)
    • 747e1f6 Fixed PyPy SOABI parsing (#484)
    • 7627548 [pre-commit.ci] pre-commit autoupdate (#480)
    • 7b9e8e1 Test on Python 3.11 final
    • a04dfef Updated the pypi-publish action
    • 94bb62c Fixed docs not building due to code style changes
    • d635664 Updated the codecov action to the latest version
    • fcb94cd Updated version to match the release
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies python 
    opened by dependabot[bot] 0
  • python error

    python error

    • OCTIS version:
    • Python version:

    Description

    runtime error

    What I Did

    from octis.models.LDA import LDA 
    cause this error 
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    [<ipython-input-54-5a8917c9d1ca>](https://localhost:8080/#) in <module>
    ----> 1 import octis.models.LDA
    
    6 frames
    /usr/local/lib/python3.8/dist-packages/gensim/_matutils.pyx in init gensim._matutils()
    
    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
    
    
    opened by fatmas1982 0
  • Evaluating a custom topic model

    Evaluating a custom topic model

    Is there any ways to use your package for evaluating a custom topic model? What outputs in what formats I should get from my model and use to compute the metrics?

    opened by nfsedaghat 0
  • No parameter for topk words per topic

    No parameter for topk words per topic

    • OCTIS version: 1.83
    • Python version: 3.8
    • Operating System: Windows

    Description

    When training a model, I want to adjust the number of words per topic. However, I did not see this option in the source code of various models. Can you add this option for the different models?

    opened by ERijck 1
  • Octis startup argparse issue

    Octis startup argparse issue

    • OCTIS version: latest
    • Python version: 3.8
    • Operating System: linux

    Description

    supplying --host as an argparse input does not change the host when starting up the server

    What I Did

    python octis/dashboard/server.py --host 0.0.0.0 Still starts up on 127.0.0.1 image

    Paste the command(s) you ran and the output.

    * Running on http://127.0.0.1:5000
    Press CTRL+C to quit
    

    Error here

    opened by Alig1493 0
Owner
MIND
MIND
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano <https:

null 9.6k Jan 6, 2023
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano <https:

null 9.3k Feb 12, 2021
Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent

Narya The Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent. This repository

Paul Garnier 121 Dec 30, 2022
Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

Image Classification Project Killer in PyTorch This repo is designed for those who want to start their experiments two days before the deadline and ki

null 349 Dec 8, 2022
Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN", accepted to ACM MM 2021 BNI Track.

RecycleD Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN

Yunan Zhu 23 Nov 5, 2022
HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

HeatNet HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events glob

Google Research 6 Jul 7, 2022
ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS.

ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS. It currently supports four examples for you to quickly experience the power of ONNX Runtime Web.

Microsoft 58 Dec 18, 2022
A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

OpenAI 29.6k Jan 8, 2023
Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

Tom-R.T.Kvalvaag 2 Dec 17, 2021
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Jan 5, 2023
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Dec 31, 2022
Let Python optimize the best stop loss and take profits for your TradingView strategy.

TradingView Machine Learning TradeView is a free and open source Trading View bot written in Python. It is designed to support all major exchanges. It

Robert Roman 473 Jan 9, 2023
TensorFlow implementation for Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How

Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How TensorFlow implementation for Bayesian Modeling and Unce

Shen Lab at Texas A&M University 8 Sep 2, 2022
Open-L2O: A Comprehensive and Reproducible Benchmark for Learning to Optimize Algorithms

Open-L2O This repository establishes the first comprehensive benchmark efforts of existing learning to optimize (L2O) approaches on a number of proble

VITA 161 Jan 2, 2023
Official implementation for "Symbolic Learning to Optimize: Towards Interpretability and Scalability"

Symbolic Learning to Optimize This is the official implementation for ICLR-2022 paper "Symbolic Learning to Optimize: Towards Interpretability and Sca

VITA 8 Dec 19, 2022
Sequential model-based optimization with a `scipy.optimize` interface

Scikit-Optimize Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements

Scikit-Optimize 2.5k Jan 4, 2023
An end-to-end machine learning library to directly optimize AUC loss

LibAUC An end-to-end machine learning library for AUC optimization. Why LibAUC? Deep AUC Maximization (DAM) is a paradigm for learning a deep neural n

Andrew 75 Dec 12, 2022
Optimize Trading Strategies Using Freqtrade

Optimize trading strategy using Freqtrade Short demo on building, testing and optimizing a trading strategy using Freqtrade. The DevBootstrap YouTube

DevBootstrap 139 Jan 1, 2023