Top2Vec is an algorithm for topic modeling and semantic search.

Overview

Update: Pre-trained Universal Sentence Encoders and BERT Sentence Transformer now available for embedding. Read more.

Top2Vec

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can:

  • Get number of detected topics.
  • Get topics.
  • Get topic sizes.
  • Get hierarchichal topics.
  • Search topics by keywords.
  • Search documents by topic.
  • Search documents by keywords.
  • Find similar words.
  • Find similar documents.
  • Expose model with RESTful-Top2Vec

See the paper for more details on how it works.

Benefits

  1. Automatically finds number of topics.
  2. No stop word lists required.
  3. No need for stemming/lemmatization.
  4. Works on short text.
  5. Creates jointly embedded topic, document, and word vectors.
  6. Has search functions built in.

How does it work?

The assumption the algorithm makes is that many semantically similar documents are indicative of an underlying topic. The first step is to create a joint embedding of document and word vectors. Once documents and words are embedded in a vector space the goal of the algorithm is to find dense clusters of documents, then identify which words attracted those documents together. Each dense area is a topic and the words that attracted the documents to the dense area are the topic words.

The Algorithm:

1. Create jointly embedded document and word vectors using Doc2Vec or Universal Sentence Encoder or BERT Sentence Transformer.

Documents will be placed close to other similar documents and close to the most distinguishing words.

2. Create lower dimensional embedding of document vectors using UMAP.

Document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas. Each point is a document vector.

3. Find dense areas of documents using HDBSCAN.

The colored areas are the dense areas of documents. Red points are outliers that do not belong to a specific cluster.

4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.

The red points are outlier documents and do not get used for calculating the topic vector. The purple points are the document vectors that belong to a dense area, from which the topic vector is calculated.

5. Find n-closest word vectors to the resulting topic vector.

The closest word vectors in order of proximity become the topic words.

Installation

The easy way to install Top2Vec is:

pip install top2vec

To install pre-trained universal sentence encoder options:

pip install top2vec[sentence_encoders]

To install pre-trained BERT sentence transformer options:

pip install top2vec[sentence_transformers]

To install indexing options:

pip install top2vec[indexing]

Usage

from top2vec import Top2Vec

model = Top2Vec(documents)

Important parameters:

  • documents: Input corpus, should be a list of strings.

  • speed: This parameter will determine how fast the model takes to train. The 'fast-learn' option is the fastest and will generate the lowest quality vectors. The 'learn' option will learn better quality vectors but take a longer time to train. The 'deep-learn' option will learn the best quality vectors but will take significant time to train.

  • workers: The amount of worker threads to be used in training the model. Larger amount will lead to faster training.

Trained models can be saved and loaded.

model.save("filename")
model = Top2Vec.load("filename")

For more information view the API guide.

Pretrained Models

Doc2Vec will be used by default to generate the joint word and document embeddings. However there are also pretrained embedding_model options for generating joint word and document embeddings:

  • universal-sentence-encoder
  • universal-sentence-encoder-multilingual
  • distiluse-base-multilingual-cased
from top2vec import Top2Vec

model = Top2Vec(documents, embedding_model='universal-sentence-encoder')

For large data sets and data sets with very unique vocabulary doc2vec could produce better results. This will train a doc2vec model from scratch. This method is language agnostic. However multiple languages will not be aligned.

Using the universal sentence encoder options will be much faster since those are pre-trained and efficient models. The universal sentence encoder options are suggested for smaller data sets. They are also good options for large data sets that are in English or in languages covered by the multilingual model. It is also suggested for data sets that are multilingual.

The distiluse-base-multilingual-cased pre-trained sentence transformer is suggested for multilingual datasets and languages that are not covered by the multilingual universal sentence encoder. The transformer is significantly slower than the universal sentence encoder options.

More information on universal-sentence-encoder, universal-sentence-encoder-multilingual, and distiluse-base-multilingual-cased.

Citation

If you would like to cite Top2Vec in your work this is the current reference:

@article{angelov2020top2vec,
      title={Top2Vec: Distributed Representations of Topics}, 
      author={Dimo Angelov},
      year={2020},
      eprint={2008.09470},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Example

Train Model

Train a Top2Vec model on the 20newsgroups dataset.

from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

Get Number of Topics

This will return the number of topics that Top2Vec has found in the data.

>>> model.get_num_topics()
77

Get Topic Sizes

This will return the number of documents most similar to each topic. Topics are in decreasing order of size.

topic_sizes, topic_nums = model.get_topic_sizes()

Returns:

  • topic_sizes: The number of documents most similar to each topic.

  • topic_nums: The unique index of every topic will be returned.

Get Topics

This will return the topics in decreasing size.

topic_words, word_scores, topic_nums = model.get_topics(77)

Returns:

  • topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.

  • word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.

  • topic_nums: The unique index of every topic will be returned.

Search Topics

We are going to search for topics most similar to medicine.

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5)

Returns:

  • topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.

  • word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.

  • topic_scores: For each topic the cosine similarity to the search keywords will be returned.

  • topic_nums: The unique index of every topic will be returned.

>>> topic_nums
[21, 29, 9, 61, 48]

>>> topic_scores
[0.4468, 0.381, 0.2779, 0.2566, 0.2515]

Topic 21 was the most similar topic to "medicine" with a cosine similarity of 0.4468. (Values can be from least similar 0, to most similar 1)

Generate Word Clouds

Using a topic number you can generate a word cloud. We are going to generate word clouds for the top 5 most similar topics to our medicine topic search from above.

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5)
for topic in topic_nums:
    model.generate_topic_wordcloud(topic)

Search Documents by Topic

We are going to search by topic 48, a topic that appears to be about science.

documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)

Returns:

  • documents: The documents in a list, the most similar are first.

  • doc_scores: Semantic similarity of document to topic. The cosine similarity of the document and topic vector.

  • doc_ids: Unique ids of documents. If ids were not given, the index of document in the original corpus.

For each of the returned documents we are going to print its content, score and document number.

documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()
Document: 15227, Score: 0.6322
-----------
  Evolution is both fact and theory.  The THEORY of evolution represents the
scientific attempt to explain the FACT of evolution.  The theory of evolution
does not provide facts; it explains facts.  It can be safely assumed that ALL
scientific theories neither provide nor become facts but rather EXPLAIN facts.
I recommend that you do some appropriate reading in general science.  A good
starting point with regard to evolution for the layman would be "Evolution as
Fact and Theory" in "Hen's Teeth and Horse's Toes" [pp 253-262] by Stephen Jay
Gould.  There is a great deal of other useful information in this publication.
-----------

Document: 14515, Score: 0.6186
-----------
Just what are these "scientific facts"?  I have never heard of such a thing.
Science never proves or disproves any theory - history does.

-Tim
-----------

Document: 9433, Score: 0.5997
-----------
The same way that any theory is proven false.  You examine the predicitions
that the theory makes, and try to observe them.  If you don't, or if you
observe things that the theory predicts wouldn't happen, then you have some 
evidence against the theory.  If the theory can't be modified to 
incorporate the new observations, then you say that it is false.

For example, people used to believe that the earth had been created
10,000 years ago.  But, as evidence showed that predictions from this 
theory were not true, it was abandoned.
-----------

Document: 11917, Score: 0.5845
-----------
The point about its being real or not is that one does not waste time with
what reality might be when one wants predictions. The questions if the
atoms are there or if something else is there making measurements indicate
atoms is not necessary in such a system.

And one does not have to write a new theory of existence everytime new
models are used in Physics.
-----------

...

Semantic Search Documents by Keywords

Search documents for content semantically similar to cryptography and privacy.

documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["cryptography", "privacy"], num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()
Document: 16837, Score: 0.6112
-----------
...
Email and account privacy, anonymity, file encryption,  academic 
computer policies, relevant legislation and references, EFF, and 
other privacy and rights issues associated with use of the Internet
and global networks in general.
...

Document: 16254, Score: 0.5722
-----------
...
The President today announced a new initiative that will bring
the Federal Government together with industry in a voluntary
program to improve the security and privacy of telephone
communications while meeting the legitimate needs of law
enforcement.
...
-----------
...

Similar Keywords

Search for similar words to space.

words, word_scores = model.similar_words(keywords=["space"], keywords_neg=[], num_words=20)
for word, score in zip(words, word_scores):
    print(f"{word} {score}")
space 1.0
nasa 0.6589
shuttle 0.5976
exploration 0.5448
planetary 0.5391
missions 0.5069
launch 0.4941
telescope 0.4821
astro 0.4696
jsc 0.4549
ames 0.4515
satellite 0.446
station 0.4445
orbital 0.4438
solar 0.4386
astronomy 0.4378
observatory 0.4355
facility 0.4325
propulsion 0.4251
aerospace 0.4226
Comments
  • numpy causing various errors

    numpy causing various errors

    I've been having trouble with numpy when using Top2Vec version 1.0.20 with Python 3.8.0 on Ubuntu 18.04; I experience the same problems using Python 3.7.5. I've tried installing numpy 1.0.20, numpy 1.19.5.

    see this issuefor the hbsc error.

    and this issue for the umap error.

    UMAP

    PicklingError:
    
    (snip)
    
    /data/.top2vec/lib/python3.8/site-packages/umap/umap_.py in fit(self, X, y)
       2571 
       2572         numba.set_num_threads(self._original_n_threads)
    -> 2573         self._input_hash = joblib.hash(self._raw_data)
       2574 
       2575         return self
    
    /data/.top2vec/lib/python3.8/site-packages/joblib/hashing.py in hash(obj, hash_name, coerce_mmap)
        259     else:
        260         hasher = Hasher(hash_name=hash_name)
    --> 261     return hasher.hash(obj)
    
    /data/.top2vec/lib/python3.8/site-packages/joblib/hashing.py in hash(self, obj, return_digest)
         61     def hash(self, obj, return_digest=True):
         62         try:
    ---> 63             self.dump(obj)
         64         except pickle.PicklingError as e:
         65             e.args += ('PicklingError while hashing %r: %r' % (obj, e),)
    
    (snip)
    
    PicklingError: ("Can't pickle <class 'numpy.dtype[float32]'>: it's not found as numpy.dtype[float32]", 'PicklingError while hashing array([[ 0.002187  , -0.00357572, -0.00279311, ...,  0.00120361,\n        -0.00115495,  0.00059189],\n       [-0.05823869,  0.01436491,  0.02220243, ...,  0.00703284,\n        -0.01716192, -0.01003473],\n       [-0.00334117,  0.00051066,  0.00269544, ...,  0.00070796,\n        -0.00202038, -0.00233051],\n       ...,\n       [ 0.00062888,  0.0027382 ,  0.0044361 , ..., -0.00229976,\n         0.00057765, -0.00033288],\n       [-0.00081269,  0.00099852, -0.00054314, ...,  0.00133646,\n        -0.00026089, -0.00150439],\n       [-0.01297437,  0.0104734 ,  0.01563089, ..., -0.00051685,\n        -0.00144138, -0.00556232]], dtype=float32): PicklingError("Can\'t pickle <class \'numpy.dtype[float32]\'>: it\'s not found as numpy.dtype[float32]")')
    

    HDBSCAN

    from top2vec import Top2Vec
    
    (snip)
    
    /data/.top2vec/lib/python3.8/site-packages/hdbscan/hdbscan_.py in <module>
         19 from scipy.sparse import csgraph
         20 
    ---> 21 from ._hdbscan_linkage import (single_linkage,
         22                                mst_linkage_core,
         23                                mst_linkage_core_vector,
    
    hdbscan/_hdbscan_linkage.pyx in init hdbscan._hdbscan_linkage()
    
    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    
    
    opened by AltfunsMA 12
  • Run out of memory on 1.6m point dataset with 300 dimensions.

    Run out of memory on 1.6m point dataset with 300 dimensions.

    Hi, great work for Top2Vec, I am trying to apply it to my dataset which has 1.6million instances. I successfully trained Doc2vec inside Top2vec. with 300 dimensions as the default. but I run out of memory on the Umap procedure in 2 minutes. BTW I have a 32g memory. I also try low_memory=True. The same oom.

    So, I wonder that how many memory UMAP gonna take for 2m points with 300 dimensions? For precaution, how many more memory HDBScan gonna cost?

    Thank you!

    opened by kongyq 11
  • How to display Top2Vec Model in HDBSCAN or UMAP ?

    How to display Top2Vec Model in HDBSCAN or UMAP ?

    Hello,

    Forgive me for the newbie question, but having successfully built and saved a Top2Vec model:

         How can a saved Top2Vec model be viewed (visually rendered) in HDBSCAN or UMAP?
    

    I may be over looking the obvious, but in reading through the documentation and Googling for answers nothing has jumped out so far.

    Most grateful,

    Chris

    opened by None-Such 10
  • TypeError: __init__() got an unexpected keyword argument 'vector_size'

    TypeError: __init__() got an unexpected keyword argument 'vector_size'

    Hi,

    I created a conda env with Python 3.6 and installed top2vec.

    I then tried the example below to test the install

    from top2vec import Top2Vec from sklearn.datasets import fetch_20newsgroups newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

    and I get the following output/error: 2020-12-18 00:20:24,861 - top2vec - INFO - Pre-processing documents for training 2020-12-18 00:20:31,459 - top2vec - INFO - Creating joint document/word embedding Traceback (most recent call last): File "", line 1, in File "conda_envs/top2vec/lib/python3.6/site-packages/top2vec/Top2Vec.py", line 285, in init self.model = Doc2Vec(**doc2vec_args) File "/home/.local/lib/python3.6/site-packages/gensim/models/doc2vec.py", line 634, in init **kwargs) TypeError: init() got an unexpected keyword argument 'vector_size'

    Can you please help me with it ?

    Thanks

    opened by gianfilippo 10
  • [Installation Issue] Unable to install dependencies(tensorflow-text) while installing Top2Vec

    [Installation Issue] Unable to install dependencies(tensorflow-text) while installing Top2Vec

    I am trying to install top2vec but getting the following error when I do 'pip install top2vec==1.0.15'

    ERROR: Could not find a version that satisfies the requirement tensorflow-text (from top2vec) (from versions: none) ERROR: No matching distribution found for tensorflow-text (from top2vec)

    I have windows 10, python 3.7, x64.

    From what I understand, currently, tensorflow-text isn't available for Windows, so could you guys provide any resolution for this?

    opened by Alisha1992 10
  • ValueError: numpy.ndarray size changed, may indicate binary incompatibility.

    ValueError: numpy.ndarray size changed, may indicate binary incompatibility.

    A few days ago the problem with ValueError: numpy.ndarray size changed, may indicate binary incompatibility. occurred during executing the code that worked one week ago without any problems.

    The same issue is with BERTopic (https://github.com/MaartenGr/BERTopic/issues/392), so I thought maybe it would be beneficial to link it there. For now, it seems there is no easy solution to that problem

    opened by maciejbiesek 9
  • What would be the best way to incorporate NER?

    What would be the best way to incorporate NER?

    Id like to use an NER to embed broader terms instead of just the unigrams.

    Im not 100% sure how the unigrams are consumed. So if I wanted to embed "New York" instead of splitting it, what does the format of the output of the tokenizer need to be?

    opened by datavistics 9
  • ValueError: list.remove(x): x not in list in model.hierarchical_topic_reduction()

    ValueError: list.remove(x): x not in list in model.hierarchical_topic_reduction()

    I created a model with

    model= Top2Vec(documents_text2, min_count = 4,
                           speed = "fast-learn", 
                           document_ids=document_ids2, 
                           workers = workers_n,keep_documents=False)
    

    Then I tried to reduce the number of topics with

    model.hierarchical_topic_reduction()

    and get this error

    model10 = model.hierarchical_topic_reduction(1000)
    Traceback (most recent call last):
    
      File "<ipython-input-12-4ee6e263e4a0>", line 1, in <module>
        model10 = model.hierarchical_topic_reduction(1000)
    
      File "C:\Users\anaconda\.conda\envs\top2vec_final\lib\site-packages\top2vec\Top2Vec.py", line 1215, in hierarchical_topic_reduction
        ix_keep.remove(most_sim)
    
    ValueError: list.remove(x): x not in list
    
    opened by p-dre 9
  • ImportError: universal-sentence-encoder is not available.

    ImportError: universal-sentence-encoder is not available.

    Hi! I'm getting the above error on the following code:

    from top2vec import Top2Vec
    model = Top2Vec(documents=df['transcript'].values, speed="learn", embedding_model='universal-sentence-encoder')
    

    Full Exception Traceback:

    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-3-12fb6ba4e3a8> in <module>
          1 from top2vec import Top2Vec
          2 
    ----> 3 model = Top2Vec(documents=df['transcript'].values, speed="learn", embedding_model='universal-sentence-encoder')
    
    ~\Anaconda3\lib\site-packages\top2vec\Top2Vec.py in __init__(self, documents, min_count, embedding_model, embedding_model_path, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, verbose)
        278             self.embedding_model = embedding_model
        279 
    --> 280             self._check_import_status()
        281 
        282             logger.info('Pre-processing documents for training')
    
    ~\Anaconda3\lib\site-packages\top2vec\Top2Vec.py in _check_import_status(self)
        642         if self.embedding_model != 'distiluse-base-multilingual-cased':
        643             if not _HAVE_TENSORFLOW:
    --> 644                 raise ImportError(f"{self.embedding_model} is not available.\n\n"
        645                                   "Try: pip install top2vec[sentence_encoders]\n\n"
        646                                   "Alternatively try: pip install tensorflow tensorflow_hub tensorflow_text")
    
    ImportError: universal-sentence-encoder is not available.
    
    Try: pip install top2vec[sentence_encoders]
    
    Alternatively try: pip install tensorflow tensorflow_hub tensorflow_text
    

    I have all of these libraries installed (see below) - but this error wont go.

    (base) C:\Users\rsiddiqui>pip install top2vec[sentence_encoders] Requirement already satisfied: top2vec[sentence_encoders] in c:\users\rsiddiqui\anaconda3\lib\site-packages (1.0.16) Requirement already satisfied: numpy in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from top2vec[sentence_encoders]) (1.18.5) Requirement already satisfied: umap-learn in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (0.4.6) Requirement already satisfied: gensim in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (3.8.3) Requirement already satisfied: pandas in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (1.1.3) Requirement already satisfied: wordcloud in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (1.8.1) Requirement already satisfied: hdbscan in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (0.8.26) Requirement already satisfied: pynndescent>=0.4 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (0.5.1) Requirement already satisfied: tensorflow-text; extra == "sentence_encoders" in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (2.4.0rc0) Requirement already satisfied: tensorflow-hub; extra == "sentence_encoders" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from top2vec[sentence_encoders]) (0.9.0) Requirement already satisfied: tensorflow; extra == "sentence_encoders" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from top2vec[sentence_encoders]) (2.3.1) Requirement already satisfied: numba!=0.47,>=0.46 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from umap-learn->top2vec[sentence_encoders]) (0.51.2) Requirement already satisfied: scikit-learn>=0.20 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from umap-learn->top2vec[sentence_encoders]) (0.23.2) Requirement already satisfied: scipy>=1.3.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from umap-learn->top2vec[sentence_encoders]) (1.5.4) Requirement already satisfied: smart-open>=1.8.1 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from gensim->top2vec[sentence_encoders]) (3.0.0) Requirement already satisfied: six>=1.5.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from gensim->top2vec[sentence_encoders]) (1.15.0) Requirement already satisfied: Cython==0.29.14 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from gensim->top2vec[sentence_encoders]) (0.29.14) Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from pandas->top2vec[sentence_encoders]) (2.8.1) Requirement already satisfied: pytz>=2017.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from pandas->top2vec[sentence_encoders]) (2020.4) Requirement already satisfied: pillow in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from wordcloud->top2vec[sentence_encoders]) (8.0.1) Requirement already satisfied: matplotlib in c:\users\rsiddiqui\anaconda3\lib\site-packages (from wordcloud->top2vec[sentence_encoders]) (3.2.2) Requirement already satisfied: joblib in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from hdbscan->top2vec[sentence_encoders]) (0.15.1) Requirement already satisfied: llvmlite>=0.30 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from pynndescent>=0.4->top2vec[sentence_encoders]) (0.34.0) Requirement already satisfied: protobuf>=3.8.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow-hub; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.13.0) Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.3.0) Requirement already satisfied: google-pasta>=0.1.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.2.0) Requirement already satisfied: wheel>=0.26 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.35.1) Requirement already satisfied: absl-py>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.10.0) Requirement already satisfied: h5py<2.11.0,>=2.10.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.10.0) Requirement already satisfied: termcolor>=1.1.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.1.0) Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.1.2) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.3.0) Requirement already satisfied: wrapt>=1.11.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.12.1) Requirement already satisfied: grpcio>=1.8.6 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.32.0) Requirement already satisfied: gast==0.3.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.3.3) Requirement already satisfied: astunparse==1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.6.3) Requirement already satisfied: tensorboard<3,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.4.0) Requirement already satisfied: setuptools in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from numba!=0.47,>=0.46->umap-learn->top2vec[sentence_encoders]) (50.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from scikit-learn>=0.20->umap-learn->top2vec[sentence_encoders]) (2.1.0) Requirement already satisfied: requests in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (2.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud->top2vec[sentence_encoders]) (1.3.1) Requirement already satisfied: cycler>=0.10 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud->top2vec[sentence_encoders]) (0.10.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud->top2vec[sentence_encoders]) (2.4.7) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.4.2) Requirement already satisfied: werkzeug>=0.11.15 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.0.1) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.7.0) Requirement already satisfied: markdown>=2.6.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.3.3) Requirement already satisfied: google-auth<2,>=1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.23.0) Requirement already satisfied: idna<3,>=2.5 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (2.10) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (1.26.2) Requirement already satisfied: certifi>=2017.4.17 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (2020.11.8) Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (3.0.4) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.3.0) Requirement already satisfied: cachetools<5.0,>=2.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (4.1.1) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.2.8) Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.5" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (4.6) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.1.0) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.4.8)

    (base) C:\Users\rsiddiqui>pip install tensorflow tensorflow_hub tensorflow_text Requirement already satisfied: tensorflow in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (2.3.1) Requirement already satisfied: tensorflow_hub in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (0.9.0) Requirement already satisfied: tensorflow_text in c:\users\rsiddiqui\anaconda3\lib\site-packages (2.4.0rc0) Requirement already satisfied: protobuf>=3.9.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (3.13.0) Requirement already satisfied: gast==0.3.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.3.3) Requirement already satisfied: termcolor>=1.1.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.1.0) Requirement already satisfied: tensorboard<3,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (2.4.0) Requirement already satisfied: grpcio>=1.8.6 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.32.0) Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (2.3.0) Requirement already satisfied: astunparse==1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.6.3) Requirement already satisfied: six>=1.12.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.15.0) Requirement already satisfied: wrapt>=1.11.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.12.1) Requirement already satisfied: google-pasta>=0.1.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.2.0) Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.1.2) Requirement already satisfied: wheel>=0.26 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.35.1) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (3.3.0) Requirement already satisfied: h5py<2.11.0,>=2.10.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (2.10.0) Requirement already satisfied: numpy<1.19.0,>=1.16.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.18.5) Requirement already satisfied: absl-py>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.10.0) Requirement already satisfied: setuptools in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from protobuf>=3.9.2->tensorflow) (50.3.2) Requirement already satisfied: requests<3,>=2.21.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (2.25.0) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.7.0) Requirement already satisfied: werkzeug>=0.11.15 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.0.1) Requirement already satisfied: google-auth<2,>=1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.23.0) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (0.4.2) Requirement already satisfied: markdown>=2.6.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (3.3.3) Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (3.0.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (1.26.2) Requirement already satisfied: idna<3,>=2.5 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2.10) Requirement already satisfied: certifi>=2017.4.17 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2020.11.8) Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.5" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.6) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.2.8) Requirement already satisfied: cachetools<5.0,>=2.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.1.1) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (1.3.0) Requirement already satisfied: pyasn1>=0.1.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from rsa<5,>=3.1.4; python_version >= "3.5"->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.4.8) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (3.1.0)

    opened by Rmsharks4 9
  • "embedding_model" parameter in Top2Vec is unrecognized

    In code documentation it is mentioned that we can use pretrained model using embedding_model but it is not recognized. I have updated the library as well

    opened by Prashant118 9
  • TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N

    TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N

    I'm using a set of text documents (pdf documents converted into text) for topic modeling. While training the model I'm getting this error. It's a great help if someone can help me to sort this out. C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py:1678: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py:1590: RuntimeWarning: k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead. warnings.warn("k >= N for N * N square matrix. " Traceback (most recent call last): File "c:/Users/prabo/Desktop/Topic modeling pipeline/test.py", line 27, in model = Top2Vec(documents=df.text, speed="learn", workers=8) File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\top2vec\Top2Vec.py", line 222, in init umap_model = umap.UMAP(n_neighbors=15, File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1965, in fit self.embedding_ = simplicial_set_embedding( File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1033, in simplicial_set_embedding initialisation = spectral_layout( File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\spectral.py", line 324, in spectral_layout eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh( File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py", line 1595, in eigsh raise TypeError("Cannot use scipy.linalg.eigh for sparse A with " TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

    opened by dulanafdo 9
  • How can extract topics from new added documents in inference

    How can extract topics from new added documents in inference

    Hey, this is an amazing project to work with. I was wondering is there any way to extract topics from newly added document in inference. Thanks in advance.

    opened by meetttttt 0
  • Stop words are included in the model and topics are generated with them

    Stop words are included in the model and topics are generated with them

    Here is my topic_words outputs :

    0 Words: ['and' 'the' 'in' 'to' 'of' 'games' 'or' 'first' 'game' 'that' 'by' 'at' 'is' 'released' 'with' 'as' 'its' 'was' 'from' 'developed' 'for' 'it' 'series' 'video' 'were' 'produced' 'an' 'on' 'designed' 'aircraft' 'published' 'built'] 1 Words: ['series' 'games' 'an' 'was' 'by' 'with' 'and' 'first' 'published' 'in' 'is' 'from' 'released' 'of' 'to' 'as' 'the' 'it' 'at' 'were' 'designed' 'for' 'or' 'game' 'aircraft' 'its' 'on' 'built' 'that' 'produced' 'video' 'developed']

    It is written that no stop word elimination is needed before using Top2Vec - and in a youtube tutorial he just called Top2Vec function without any parameters and it worked well without stop words. What am I doing wrong or is it a bug?

    Thanks

    opened by cuneyttyler 0
  • AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

    AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

    Hello,

    Following the examples in the readme I created this code:

    documents = df["cleaned_message"].tolist()
    model = Top2Vec(
        documents,
        embedding_model="universal-sentence-encoder",
        speed="learn",
        workers=multiprocessing.cpu_count() - 1,
    )
    
    print(f"Num topics: {model.get_num_topics()}")
    

    And that is throwing the following error:

    AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

    opened by jmorenobl 1
  • access topic/document/etc. vectors

    access topic/document/etc. vectors

    First of all, great package! it is awesome to use!

    I was wondering if it is possible to access individual vectors on different levels of the model. For example, if I want to extract the 3 topics that cover the most documents I would want to use a combination of between-topic spread and within-topic spread of the vectors. Is it possible to extract these from the trained model?

    thanks in advance!

    opened by SjoerdBraaksma 0
  • how to get bi-gram and tri-gram and n-gram topic words ?

    how to get bi-gram and tri-gram and n-gram topic words ?

    I remember in LDA and NMF we have configuration parameter called ngram_range where by configuring it as (2,2) or (3,3) we can get topic words as bigrams and trigrams. Is there any such configuration in Top2vec where we can get bigram and trigram or ngram based topic words?

    opened by sivachaitanya 2
Releases(1.0.27)
  • 1.0.27(Apr 3, 2022)

    • New pre-trained transformer models available
    • Ability to use any embedding model by passing callable to embedding_model
    • New embedding_batch_size option
    • Document chunking options for long documents
    • Phrases in topics by setting ngram_vocab=True
    Source code(tar.gz)
    Source code(zip)
  • 1.0.25(Jun 23, 2021)

    Added query_documents and query_topics methods which allow for using a sequence of text such as a question, a sentence, a paragraph or a document to query documents or topics.

    Added num_topics parameter to get_documents_topics method which allows retrieving multiple topics per document.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.24(Apr 1, 2021)

  • 1.0.23(Feb 12, 2021)

  • 1.0.22(Feb 12, 2021)

  • 1.0.21(Feb 5, 2021)

  • 1.0.20(Jan 9, 2021)

    Added use_embedding_model_tokenizer parameter. If set to True and if using an embedding_model other than doc2vec, use the model's tokenizer for document embedding.

    Fixed dependency issue with joblib.

    Fixed issues with wordclouds caused by negative similarity scores.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.19(Dec 10, 2020)

  • 1.0.18(Dec 10, 2020)

    Added option for indexing word vectors, this will speed up search for models with large vocabularies. Specifically search_words_by_vector and similar_words.

    Added new method search_words_by_vector.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.17(Dec 7, 2020)

    Added option for indexing document vectors, this will speed up search for models with large number of documents. Specifically search_documents_by_vector, search_documents_by_keywords, and search_documents_by_documents.

    Added new method search_documents_by_vector.

    Added code to prevent hierarchical topic reduction error #79.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.16(Nov 10, 2020)

    Dependencies for universal sentence encoder and BERT sentence transformer options are now optional. With pip install top2vec[sentence-encoders] and pip install top2vec[sentence_transformers]

    Faster cosine similarity.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.15(Oct 16, 2020)

    The verbose parameter will be set to True by default.

    Fixed a bug that stopped showing logging updates after downloading pre-trained models.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.12(Oct 15, 2020)

    Top2Vec now has an option to choose the embedding model with doc2vec, universal-sentence-encoder, universal-sentence-encoder-multilingual, and distiluse-base-multilingual-cased as the options.

    A get_documents_topics method was added.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.11(Oct 8, 2020)

    Added a method for deleting documents from model.

    Fixed bug when using corpus_file that resulted in documents getting dropped. Fixed bug when using add_documents and delete_documents which resulted in improper ordering of topic words.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.10(Aug 29, 2020)

    There was an issue with UMAP install due to a missing comma in the setup.py file, this has been fixed. An optional min_count parameter has been added, the default is still 50. All words with total frequency lower min_count are ignored by the model.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.9(Jun 26, 2020)

    Added functionality to perform hierarchical topic reduction. Added the ability to add new documents to an already trained model. Added use_corpus option which may lead to faster training with very large datasets in multi-worker environments.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.8(Apr 18, 2020)

    Added option for custom document ids, these can be string or int. Option to not save documents in model, this allows for the trained model to be used as an index and for saved models to be smaller in size. Ability to pass in a custom tokenizer that will override the default. Verbose mode that will log status of training. Also added the ability to search documents by multiple documents, positive and negative semantic search.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.7(Apr 7, 2020)

    Topic size is defined as the number of document vectors which have the topic as its nearest topic vector. Search by topic has been modified to only show documents who have the topic as its nearest topic, in order to avoid overlapping results from similar topics.

    Topic deduplication is added to make topics more robust.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.6(Mar 25, 2020)

Owner
Dimo Angelov
Data Scientist
Dimo Angelov
Fast topic modeling platform

The state-of-the-art platform for topic modeling. Full Documentation User Mailing List Download Releases User survey What is BigARTM? BigARTM is a pow

BigARTM 633 Dec 21, 2022
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

null 3 Dec 20, 2022
topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

NLP Space News Topic Modeling Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com Table of Contents Project Idea Data acquisition Primary data sour

edesz 1 Jan 3, 2022
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 3, 2023
Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

Greg Ver Steeg 592 Dec 18, 2022
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

Sidharth Pal 20 Dec 14, 2022
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 13.8k Jan 2, 2023
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 11.7k Feb 12, 2021
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 11.7k Feb 18, 2021
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 6, 2022
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

null 1 Oct 14, 2021
Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

Rita Anjana 55 Nov 28, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 1, 2022
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 2, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 6, 2022
Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated. This engine can later be used for downstream tasks in NLP such as Q&A, summarization, generation, and natural language understanding (NLU).

Diego 1 Mar 20, 2022