Topic Modelling for Humans

Overview

gensim – Topic Modelling in Python

Build Status GitHub release Downloads DOI Mailing List Follow

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Features

  • All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
  • Intuitive interfaces
    • easy to plug in your own input corpus/datastream (trivial streaming API)
    • easy to extend with other Vector Space algorithms (trivial transformation API)
  • Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
  • Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
  • Extensive documentation and Jupyter Notebook tutorials.

If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Installation

This software depends on NumPy and Scipy, two Python packages for scientific computing. You must have them installed prior to installing gensim.

It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as ATLAS or OpenBLAS is known to improve performance by as much as an order of magnitude. On OS X, NumPy picks up the BLAS that comes with it automatically, so you don’t need to do anything special.

Install the latest version of gensim:

    pip install --upgrade gensim

Or, if you have instead downloaded and unzipped the source tar.gz package:

    python setup.py install

For alternative modes of installation, see the documentation.

Gensim is being continuously tested under Python 3.6, 3.7 and 3.8. Support for Python 2.7 was dropped in gensim 4.0.0 – install gensim 3.8.3 if you must use Python 2.7.

How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?

Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).

Memory-wise, gensim makes heavy use of Python’s built-in generators and iterators for streamed data processing. Memory efficiency was one of gensim’s design goals, and is a central feature of gensim, rather than something bolted on as an afterthought.

Documentation

Support

Ask open-ended or research questions on the Gensim Mailing List.

Raise bugs on Github but make sure you follow the issue template. Issues that are not bugs or fail to follow the issue template will be closed without inspection.


Adopters

Company Logo Industry Use of Gensim
RARE Technologies rare ML & NLP consulting Creators of Gensim – this is us!
Amazon amazon Retail Document similarity.
National Institutes of Health nih Health Processing grants and publications with word2vec.
Cisco Security cisco Security Large-scale fraud detection.
Mindseye mindseye Legal Similarities in legal documents.
Channel 4 channel4 Media Recommendation engine.
Talentpair talent-pair HR Candidate matching in high-touch recruiting.
Juju juju HR Provide non-obvious related job suggestions.
Tailwind tailwind Media Post interesting and relevant content to Pinterest.
Issuu issuu Media Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about.
Search Metrics search-metrics Content Marketing Gensim word2vec used for entity disambiguation in Search Engine Optimisation.
12K Research 12k Media Document similarity analysis on media articles.
Stillwater Supercomputing stillwater Hardware Document comprehension and association with word2vec.
SiteGround siteground Web hosting An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA.
Capital One capitalone Finance Topic modeling for customer complaints exploration.

Citing gensim

When citing gensim in academic papers and theses, please use this BibTeX entry:

@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}
Comments
  • NMF metrics and wikipedia

    NMF metrics and wikipedia

    Add clean up and fixes on top of #2361:

    opened by anotherbugmaster 113
  • File-based fast training for Any2Vec models

    File-based fast training for Any2Vec models

    Tutorial explaining the whats & hows: Jupyter notebook

    note: all preliminary discussions are in https://github.com/RaRe-Technologies/gensim/pull/2048

    This PR summarizes all my work during GSoC 2018. For more understanding what's going on, follow the links:

    • My proposal: https://persiyanov.github.io/jekyll/update/2018/04/24/accepted-to-gsoc-2018.html
    • First benchmarks: https://persiyanov.github.io/jekyll/update/2018/05/28/gsoc-first-weeks.html
    • Last blog post about the almost final solution: https://persiyanov.github.io/2018/07/06/gsoc-midreport.html
    • Links to all benchmarks: https://gist.github.com/persiyanov/84b806233947e0069a243433579b35db
    • Previous PR about vocab building : https://github.com/RaRe-Technologies/gensim/pull/2078 (reverted these changes in current PR because of API design issues)
    • Previous PR about multistream training (all useful changes in this PR): https://github.com/RaRe-Technologies/gensim/pull/2048

    Summary

    In this pull request, new argument corpus_file is proposed for Word2Vec, FastText and Doc2Vec models. It is supposed to use corpus_file instead of standard sentences argument if you have the preprocessed dataset on disk and want to get significant speedup during model training.

    On our benchmarks, training Word2Vec on English Wikipedia dump is 370% faster with corpus_file than training with sentences (see the attached jupyter notebook with the code).

    Look at this chart for Word2Vec: word2vec_file_scaling

    Usage

    The usage is really simple. I'll provide examples for Word2Vec while the usage for FastText and Doc2Vec is identical. The corpus_file argument is supported for:

    Constructor

    # Standard way
    model = Word2Vec(sentences=my_corpus, <...other arguments...>)
    
    # New way
    model = Word2Vec(corpus_file='my_corpus_saved.txt', <...other arguments...>)
    
    # You can save your own corpus using
    gensim.utils.save_as_line_sentence(my_corpus, 'my_corpus_saved.txt')
    
    

    build_vocab

    # Create the model without training
    model = Word2Vec(<...other arguments...>)
    
    # Standard way
    model.build_vocab(sentences=my_corpus, ...)
    
    # New way
    model.build_vocab(corpus_file='my_corpus_saved.txt', ...)
    

    train

    # Create the model without training
    model = Word2Vec(<...other arguments...>)
    
    # Build vocab (with `sentences` or `corpus_file` way, choose what you like)
    model.build_vocab(corpus_file='my_corpus_saved.txt')
    
    # Train the model (old way)
    model.train(sentences=my_corpus, total_examples=model.corpus_count, ...)
    
    # Train the model (new way)
    model.train(corpus_file='my_corpus_saved.txt', total_words=model.corpus_total_words, ...)
    

    That's it! Everything else remains the same as before.

    Details

    Firstly, let me describe the standard approach to train *2Vec models:

    1. A user provides input data stream (python iterable object)
    2. One job_producer python thread is created. This thread reads data from the input stream and pushes batches into the python threading.Queue (job_queue).
    3. Several worker threads pull batches from job_queue and perform model updates. Batches are python lists of lists of tokens. They are first translated into C structures and then a model update is performed without GIL.

    Such approach allows to scale model updates linearly, but batch producing (from reading up to filling C structures from python object) is a bottleneck in this pipeline.

    It is evident that we can't optimize batch generation for abstract python stream (with custom user logic). Instead of this, we performed such an optimization only for data which is stored on a disk in a form of gensim.models.word2vec.LineSentence (one sentence per line, words are separated by whitespace).

    Such a restriction allowed us to read the data directly on C++ level without GIL. And then, immediately, perform model updates. Finally, this resulted in linear scaling during training.

    opened by persiyanov 102
  • Use FastSS for fast kNN over Levenshtein distance

    Use FastSS for fast kNN over Levenshtein distance

    Introduction

    The LevenshteinSimilarityIndex term similarity index in the termsim.levenshtein module implements the lexical text similarity search technique described by Charlet and Damnati (2017) in their paper describing their winning system at SemEval-2017 Task 3: Community Question Answering.

    We are showing a related semantic similarity search technique using the WordEmbeddingSimilarityIndex term similarity index in our Soft Cosine Similarity autoexample, which enjoys some popularity among our users. We would like to also advertise LevenshteinSimilarityIndex, which provides a different but equally useful kind of search. However, the current implementation uses brute-force kNN search over the Levenshtein distance to produce a term similarity matrix, which is so slow that it can take years to produce a matrix even for medium-sized corpora such as the English Wikipedia.

    Following the discussion in https://github.com/RaRe-Technologies/gensim/issues/2541, @piskvorky and I implemented indexing using the FastSS algorithm for kNN search over the Levenshtein distance in hopes that this would speed LevenshteinSimilarityIndex up by at least three orders of magnitude (1,000×), so that it can compete with WordEmbeddingSimilarityIndex. As an added bonus, using the FastSS algorithm allows us to remove our external dependence on the python-Levenshtein library.

    Speed comparison

    Below, I will show a before-and-after speed comparison of LevenshteinSimilarityIndex compared to the standard WordEmbeddingSimilarityIndex shown in the Soft Cosine Similarity autoexample. We are measuring how many kNN searches per second, k = 100, a term similarity index can perform. To produce my dictionary (253,854 terms) and word embeddings, I will use the text8 corpus (100 MB). I am running the code on a Dell Inspiron 15 7559.

    Before the change

    We can see that even with our tiny corpus, the LevenshteinSimilarityIndex takes over three days to find the 100 nearest neighbors for all 253,854 terms in our vocabulary. Contrast this with the WordEmbeddingSimilarityIndex, which finishes in under four minutes even though we are using exact nearest-neighbor search and we could get further speed-up by using e.g. the Annoy index.

    $ pip install gensim==4.0.1 python-Levenshtein
    $ wget http://mattmahoney.net/dc/text8.zip
    $ unzip text8.zip
    $ python
    >>> from gensim.corpora import Dictionary
    >>> from gensim.models.word2vec import LineSentence, Word2Vec
    >>> from gensim.similarities import (
    ...     SparseTermSimilarityMatrix,
    ...     WordEmbeddingSimilarityIndex,
    ...     LevenshteinSimilarityIndex,
    ... )
    >>> 
    >>> corpus = LineSentence('text8')
    >>> dictionary = Dictionary(corpus)
    >>> w2v_model = Word2Vec(sentences=corpus)
    >>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
    >>> levenshtein_index = LevenshteinSimilarityIndex(dictionary)
    >>>
    >>> SparseTermSimilarityMatrix(embedding_index, dictionary)
    100%|███████████████████████████████| 253854/253854 [04:04<00:00, 1037.97it/s]
    >>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
      0%|                               | 124/253854 [02:24<80:18:05,  1.14s/it]
    

    After the change

    With the FastSS algorithm, the LevenshteinSimilarityIndex receives a 1,500× speed-up and is now not only not slower than the WordEmbeddingSimilarityIndex, but 1.5× faster. Both term similarity indexes now find the 100 nearest neighbors for all 253,854 terms in our vocabulary in under 4 minutes.

    $ pip install lexpy git+https://github.com/witiko/gensim@7054f90
    $ python
    >>> from gensim.corpora import Dictionary
    >>> from gensim.models.word2vec import LineSentence, Word2Vec
    >>> from gensim.similarities import (
    ...     SparseTermSimilarityMatrix,
    ...     WordEmbeddingSimilarityIndex,
    ...     LevenshteinSimilarityIndex,
    ... )
    >>> 
    >>> corpus = LineSentence('text8')
    >>> dictionary = Dictionary(corpus)
    >>> w2v_model = Word2Vec(sentences=corpus)
    >>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
    >>> levenshtein_index = LevenshteinSimilarityIndex(dictionary)
    >>>
    >>> SparseTermSimilarityMatrix(embedding_index, dictionary)
    100%|███████████████████████████████| 253854/253854 [03:57<00:00, 1070.14it/s]
    >>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
    100%|███████████████████████████████| 253854/253854 [02:34<00:00, 1639.23it/s]
    

    Conclusion

    Using the FastSS algorithm for kNN search over the Levenshtein distance, we managed to increase the speed of the LevenshteinSimilarityIndex term similarity index by four orders of magnitude (1,500×) on the text8 corpus. As an added bonus, using the FastSS algorithm allowed us to remove our external dependence on the Levenshtein library. Closes #2541.

    opened by Witiko 70
  • numpy 1.19.2 incompatible with gensim 4.1.0

    numpy 1.19.2 incompatible with gensim 4.1.0

    Problem description

    When importing gensim I get the following error

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/__init__.py", line 11, in <module>
        from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils  # noqa:F401
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/corpora/__init__.py", line 6, in <module>
        from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/corpora/indexedcorpus.py", line 14, in <module>
        from gensim import interfaces, utils
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/interfaces.py", line 19, in <module>
        from gensim import utils, matutils
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/matutils.py", line 1024, in <module>
        from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
      File "gensim/_matutils.pyx", line 1, in init gensim._matutils
    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    

    Steps/code/corpus to reproduce

    conda create --name=test python=3.7 -y
    conda install -y numpy==1.19.2
    pip install gensim
    

    Versions

    Linux-5.11.0-25-generic-x86_64-with-debian-bullseye-sid Python 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] Bits 64 NumPy 1.19.2 SciPy 1.7.1

    opened by martinobertoni 58
  • Doc2vec not parallelizing

    Doc2vec not parallelizing

    Doc2vec does not use all my cores despite my setting workers=8 when I instantiate it.

    My install passes the assert below:

    assert gensim.models.doc2vec.FAST_VERSION > -1
    

    Do I have to do something else?

    bug difficulty hard 
    opened by fccoelho 55
  • LSI worker getting

    LSI worker getting "stuck"

    Description

    When building an LsiModel in distributed mode, one of the workers gets "stuck" while orthonormalizing the action matrix. This stalls the whole process of building the model, as the dispatcher hangs on "reached the end of input; now waiting for all remaining jobs to finish".

    Steps/Code/Corpus to Reproduce

    lsi_model = LsiModel(
            id2word=bow,
            num_topics=300,
            chunksize=5000,
            distributed=True
        )
    lsi_model.add_documents(corpus)
    

    LSI dispatcher and workers are initialized in separate bash script. I have tried with the number of LSI workers set to 16 and 8.

    Gensim version: 3.6.0 Pyro4 version: 4.63

    Expected Results

    Process should run to completion

    Actual Results

    Main script output:

    [2019-01-06 04:04:09,862] [23465] [gensim.models.lsimodel] [INFO] {add_documents:462} updating model with new documents
    [2019-01-06 04:04:09,862] [23465] [gensim.models.lsimodel] [INFO] {add_documents:485} initializing 8 workers
    [2019-01-06 04:05:12,131] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
    [2019-01-06 04:05:12,135] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
    [2019-01-06 04:05:12,497] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #0
    [2019-01-06 04:05:12,541] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #5000
    [2019-01-06 04:06:46,191] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
    [2019-01-06 04:06:46,200] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
    [2019-01-06 04:06:46,618] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #1
    [2019-01-06 04:06:46,682] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #10000
    [2019-01-06 04:08:11,839] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
    [2019-01-06 04:08:11,843] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
    [2019-01-06 04:08:12,561] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #2
    [2019-01-06 04:08:12,786] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #15000
    [2019-01-06 04:09:48,217] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
    [2019-01-06 04:09:48,230] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
    [2019-01-06 04:09:48,700] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #3
    [2019-01-06 04:09:48,786] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #20000
    [2019-01-06 04:09:48,938] [23465] [gensim.models.lsimodel] [INFO] {add_documents:518} reached the end of input; now waiting for all remaining jobs to finish
    

    Output of LSI worker that is stuck:

    2019-01-06 04:04:09,867 - INFO - resetting worker #1
    2019-01-06 04:06:46,705 - INFO - worker #1 received job #208
    2019-01-06 04:06:46,705 - INFO - updating model with new documents
    2019-01-06 04:06:46,705 - INFO - using 100 extra samples and 2 power iterations
    2019-01-06 04:06:46,705 - INFO - 1st phase: constructing (500000, 400) action matrix
    2019-01-06 04:06:48,402 - INFO - orthonormalizing (500000, 400) action matrix
    

    CPU for that LSI worker has been ~100% for >24 hours.

    Versions

    Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] NumPy 1.15.2 SciPy 1.1.0 gensim 3.6.0 FAST_VERSION 1

    need info 
    opened by robguinness 53
  • Loading fastText binary output to gensim like word2vec

    Loading fastText binary output to gensim like word2vec

    Facebook's recent open sourced fasttext https://github.com/facebookresearch/fastText improves the word2vec SkipGram model. It follows a similar output format for word - vector key value pairs, and the similarity calculation is about the same too, but their binary output format is kind of different from that of the C version of word2vec binary format. Do we want to support loading fastText model output in gensim? Thanks.

    need info 
    opened by phunterlau 49
  • Add nmslib indexer

    Add nmslib indexer

    Hi, I added nmslib indexer.

    Some research shows nmslib is better than annoy indexer. https://erikbern.com/2018/06/17/new-approximate-nearest-neighbor-benchmarks.html https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/

    This is the first time to contribute to gensim. If I miss something, please let me know.

    feature 
    opened by masa3141 47
  • Distributed Representations of Sentences and Documents

    Distributed Representations of Sentences and Documents

    A reasonable approximation of the method described in the paper Distributed Representations of Sentences and Documents (http://cs.stanford.edu/~quocle/paragraph_vector.pdf).

    Python isn't my first language, so I don't pretend that I did everything in the most "pythonic" way here, but I pulled out the portions of the word2vec code that needed to be modified into their own methods, then added another class extending word2vec which just modifies those few functions.

    I also don't really know what to do in terms of refactoring cython code to limit code duplication, so doc2vec has its own set of cython functions which are completely independent of the word2vec ones.

    Hope that helps. Tim

    opened by temerick 47
  • Easy import of GloVe vectors using Gensim

    Easy import of GloVe vectors using Gensim

    word2vec embeddings start with a line with the number of lines (tokens?) and the number of dimensions of the file. This allows gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file.

    feature 
    opened by manasRK 46
  • Implement Okapi BM25 variants in Gensim

    Implement Okapi BM25 variants in Gensim

    This pull request implements the gensim.models.bm25model module, which contains an implementation of the Okapi BM25 model and its modifications (Lucene BM25 and ATIRE) as discussed in https://github.com/RaRe-Technologies/gensim/issues/2592#issuecomment-866799145. The module acts as a replacement for the gensim.summarization.bm25model module deprecated and removed in Gensim 4. The module should supersede the gensim.models.tfidfmodel module as the baseline weighting function for information retrieval and related NLP tasks.

    Most implementations of BM25 such as the rank-bm25 library combine indexing with weighting and often forgo dictionary building for a speed improvement at indexing time (but a hefty penalty at retrieval time). To give an example, here is how a user would search for documents with rank-bm25:

    >>> from rank_bm25 import BM25Okapi
    >>>
    >>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
    >>> bm25_model = BM25Okapi(corpus)
    >>>
    >>> query = ["Hello", "bar"]
    >>> similarities = bm25_model.get_scores(query)
    >>> similarities
    
    array([0.51082562, 0.09121886, 0.0638532 ])
    
    >>> best_document, = bm25_model.get_top_n(query, corpus, n=1)
    >>> best_document
    
    ['Hello', 'world']
    

    As you can see, the interface is convenient, but retrieval is slow due to the lack of a dictionary. Furthermore, any advanced operations such as pruning the dictionary, applying semantic matching (e.g. SCM) and query expansion (e.g. RM3), or sharding the index are unavailable.

    By contrast, the gensim.models.bm25 module separates the three operations. To give an example, here is how a user would search for documents with the gensim.models.bm25 module:

    >>> from gensim.corpora import Dictionary
    >>> from gensim.models import TfidfModel, OkapiBM25Model
    >>> from gensim.similarities import SparseMatrixSimilarity
    >>> import numpy as np
    >>>
    >>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
    >>> dictionary = Dictionary(corpus)
    >>> bm25_model = OkapiBM25Model(dictionary=dictionary)
    >>> bm25_corpus = bm25_model[list(map(dictionary.doc2bow, corpus))]
    >>> bm25_index = SparseMatrixSimilarity(bm25_corpus, num_docs=len(corpus), num_terms=len(dictionary),
    ...                                     normalize_queries=False, normalize_documents=False)
    >>>
    >>> query = ["Hello", "bar"]
    >>> tfidf_model = TfidfModel(dictionary=dictionary, smartirs='bnn')  # Enforce binary weighting of queries
    >>> tfidf_query = tfidf_model[dictionary.doc2bow(query)]
    >>>
    >>> similarities = bm25_index[tfidf_query]
    >>> similarities
    
    array([0.51082563, 0.09121886, 0.0638532 ], dtype=float32)
    
    >>> best_document = corpus[np.argmax(similarities)]
    >>> best_document
    
    ['Hello', 'world']
    

    Tasks:

    • [x] Add Okapi BM25, ~~BM25L and BM25+~~ [1, 2], Lucene BM25 [3, 4], and ATIRE BM25 [3, 5].
    • [x] Add comments and docstrings to models.bm25.
    • [x] Add comments and docstrings to similarities.docsim.
    • [x] Add BM25 to the run_topics_and_transformations autoexample.
    • [x] Add normalize_queries=True, normalize_documents=True named parameters to SparseMatrixSimilarity, DenseMatrixSimilarity, and SoftCosineSimilarity classes as discussed in https://github.com/RaRe-Technologies/gensim/pull/3304#issuecomment-1061031969 and on the Gensim mailing list. Deprecate the normalize named parameter of SoftCosineSimilarity. Add normalize_queries=False, normalize_documents=False to TF-IDF and BM25 examples.
    opened by Witiko 44
  • Unnecessary dependency on FuzzyTM pulls in many libraries

    Unnecessary dependency on FuzzyTM pulls in many libraries

    Problem description

    I'm trying to upgrade to the new Gensim 4.3.0 release. My colleague @juhoinkinen noticed in https://github.com/NatLibFi/Annif/pull/660 that Gensim 4.3.0 pulls in more dependencies than the previous release 4.2.0, including pandas. I suspect that at least the FuzzyTM dependency (which in turn pulls in pandas) is actually unused and thus unnecessary.

    Steps/code/corpus to reproduce

    Installing Gensim 4.2.0 into an empty venv (only four packages installed):

    $ pip install gensim==4.2.0
    Collecting gensim==4.2.0
      Downloading gensim-4.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.0/24.0 MB 2.0 MB/s eta 0:00:00
    Collecting scipy>=0.18.1
      Downloading scipy-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 3.3 MB/s eta 0:00:00
    Collecting numpy>=1.17.0
      Downloading numpy-1.24.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 10.6 MB/s eta 0:00:00
    Collecting smart-open>=1.8.1
      Downloading smart_open-6.3.0-py3-none-any.whl (56 kB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 KB 9.7 MB/s eta 0:00:00
    Installing collected packages: smart-open, numpy, scipy, gensim
    Successfully installed gensim-4.2.0 numpy-1.24.1 scipy-1.10.0 smart-open-6.3.0
    

    Installing Gensim 4.3.0 into an empty venv (18 packages installed):

    $ pip install gensim==4.3.0
    Collecting gensim==4.3.0
      Downloading gensim-4.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.1 MB)
         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.1/24.1 MB 6.9 MB/s eta 0:00:00
    
    [...skipping downloads...]
    
    Installing collected packages: pytz, urllib3, smart-open, six, numpy, idna, charset-normalizer, certifi, scipy, requests, python-dateutil, simpful, pandas, miniful, fst-pso, pyfume, FuzzyTM, gensim
      Running setup.py install for miniful ... done
      Running setup.py install for fst-pso ... done
    Successfully installed FuzzyTM-2.0.5 certifi-2022.12.7 charset-normalizer-2.1.1 fst-pso-1.8.1 gensim-4.3.0 idna-3.4 miniful-0.0.6 numpy-1.24.1 pandas-1.5.2 pyfume-0.2.25 python-dateutil-2.8.2 pytz-2022.7 requests-2.28.1 scipy-1.10.0 simpful-2.9.0 six-1.16.0 smart-open-6.3.0 urllib3-1.26.13
    

    The size of the venv has grown from 249MB to 318MB, an increase of 69MB.

    Here is what pipdeptree shows - FuzzyTM appears to be the main reason why so many libraries are pulled in:

    gensim==4.3.0
      - FuzzyTM [required: >=0.4.0, installed: 2.0.5]
        - numpy [required: Any, installed: 1.24.1]
        - pandas [required: Any, installed: 1.5.2]
          - numpy [required: >=1.21.0, installed: 1.24.1]
          - python-dateutil [required: >=2.8.1, installed: 2.8.2]
            - six [required: >=1.5, installed: 1.16.0]
          - pytz [required: >=2020.1, installed: 2022.7]
        - pyfume [required: Any, installed: 0.2.25]
          - fst-pso [required: Any, installed: 1.8.1]
            - miniful [required: Any, installed: 0.0.6]
              - numpy [required: >=1.12.0, installed: 1.24.1]
              - scipy [required: >=1.0.0, installed: 1.10.0]
                - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
            - numpy [required: Any, installed: 1.24.1]
          - numpy [required: Any, installed: 1.24.1]
          - scipy [required: Any, installed: 1.10.0]
            - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
          - simpful [required: Any, installed: 2.9.0]
            - numpy [required: >=1.12.0, installed: 1.24.1]
            - requests [required: Any, installed: 2.28.1]
              - certifi [required: >=2017.4.17, installed: 2022.12.7]
              - charset-normalizer [required: >=2,<3, installed: 2.1.1]
              - idna [required: >=2.5,<4, installed: 3.4]
              - urllib3 [required: >=1.21.1,<1.27, installed: 1.26.13]
            - scipy [required: >=1.0.0, installed: 1.10.0]
              - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
        - scipy [required: Any, installed: 1.10.0]
          - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
      - numpy [required: >=1.18.5, installed: 1.24.1]
      - scipy [required: >=1.7.0, installed: 1.10.0]
        - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
      - smart-open [required: >=1.8.1, installed: 6.3.0]
    pip==22.0.2
    pipdeptree==2.3.3
    setuptools==59.6.0
    

    It appears that the FuzzyTM dependency was added in PR #3398 (Flsamodel) by @ERijck . The first commits in this PR depended on the library, but a subsequent commit 9fec00b32d281e795f3b4701bf11fa1c97780227 reworked the code so it doesn't need to import FuzzyTM at all. But the dependency in setup.py wasn't actually removed, it's still there: https://github.com/RaRe-Technologies/gensim/blob/f35faae7a7b0c3c8586fb61208560522e37e0e7e/setup.py#L347

    I think the FuzzyTM dependency could be safely dropped, as the library is not actually imported. It would reduce the number of libraries Gensim pulls in and thus reduce the size of installations, including Docker images where minimal size is often required.

    Versions

    I'm using Ubuntu Linux 22.04.

    Linux-5.15.0-56-generic-x86_64-with-glibc2.35 Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] Bits 64 NumPy 1.24.1 SciPy 1.10.0 gensim 4.3.0 FAST_VERSION 0

    bug difficulty easy impact HIGH reach HIGH 
    opened by osma 3
  • Gensim LdaMulticore can't work on cloud function

    Gensim LdaMulticore can't work on cloud function

    Problem description

    I want to use gensim LDA module on cloud function, but it time out and show "/layers/google.python.pip/pip/lib/python3.8/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses".

    But the same code worked on colab (python 3.8.16) and I did't find any bug in it. It can print 'LDA1' and 'LDA2', then time out.

    Steps/code/corpus to reproduce

    1.I have tried diffierent python version like 3.10, 3.8, 3.7

    2.ADD import warnings warnings.filterwarnings("ignore", category=DeprecationWarning)

    3.It works on colab and 300 text just cost 10 sec, but I need it work on cloud function

    def LDA(corpus, dictionary, NumTopic):
        print('LDA1')
        time1 = time.time()
        print('LDA2')
        lda = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=NumTopic,  chunksize=1000, iterations=200, passes=20, per_word_topics=False,  random_state=100)
        print('LDA3')
        corpus_lda = lda[corpus] 
        print("LDA takes %2.2f seconds." % (time.time() - time1))
        return lda, corpus_lda
    

    Versions

    Please provide the output of:

    from __future__ import unicode_literals
    import base64
    import importlib
    import re
    import os
    import sys
    import numpy as np
    import pandas as pd
    import gensim
    import gensim.corpora as corpora
    from gensim.utils import simple_preprocess
    from gensim.models import CoherenceModel
    from gensim import corpora, models, similarities
    from google.cloud import bigquery
    import pandas_gbq
    import requests
    import tqdm
    import json
    import pyLDAvis
    import pyLDAvis.gensim_models
    import matplotlib.pyplot as plt
    import logging
    import time
    
    opened by tinac5519 1
  • pip install gensim==4.2.0 raises deprecation warning

    pip install gensim==4.2.0 raises deprecation warning

    When installing gensim in a fresh environment I get the following warning: Sorry it is a lot of output. The command is pip3 install gensim--4.2.0 my pip version is 22.3.1 and python version 3.11 (see below)

    because of character limits, i have cut out chunks of the output to just the warnings:

          ...
          running egg_info
          writing gensim.egg-info/PKG-INFO
          writing dependency_links to gensim.egg-info/dependency_links.txt
          writing requirements to gensim.egg-info/requires.txt
          writing top-level names to gensim.egg-info/top_level.txt
          reading manifest file 'gensim.egg-info/SOURCES.txt'
          reading manifest template 'MANIFEST.in'
          warning: no files found matching 'COPYING.LESSER'
          warning: no files found matching 'ez_setup.py'
          warning: no files found matching 'gensim/models/doc2vec_inner.c'
          adding license file 'COPYING'
          writing manifest file 'gensim.egg-info/SOURCES.txt'
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.DTM' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.DTM' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.DTM' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.DTM' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.PathLineSentences' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.PathLineSentences' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.PathLineSentences' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.PathLineSentences' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_d2v_models' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.old_d2v_models' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.old_d2v_models' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.old_d2v_models' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_w2v_models' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.old_w2v_models' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.old_w2v_models' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.old_w2v_models' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          copying gensim/_matutils.c -> build/lib.macosx-10.9-universal2-cpython-311/gensim
          copying gensim/_matutils.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim
          ...
          copying gensim/corpora/_mmreader.c -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          copying gensim/corpora/_mmreader.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          running build_ext
          building 'gensim.models.word2vec_inner' extension
          creating build/temp.macosx-10.9-universal2-cpython-311
          creating build/temp.macosx-10.9-universal2-cpython-311/gensim
          creating build/temp.macosx-10.9-universal2-cpython-311/gensim/models
          clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -I/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include -c gensim/models/word2vec_inner.c -o build/temp.macosx-10.9-universal2-cpython-311/gensim/models/word2vec_inner.o
          In file included from gensim/models/word2vec_inner.c:706:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/arrayobject.h:5:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarrayobject.h:12:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarraytypes.h:1948:
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: "Using deprecated NumPy API, disable it with "          "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
          #warning "Using deprecated NumPy API, disable it with " \
           ^
          gensim/models/word2vec_inner.c:12424:5: error: incomplete definition of type 'struct _frame'
              __Pyx_PyFrame_SetLineNumber(py_frame, py_line);
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          gensim/models/word2vec_inner.c:457:62: note: expanded from macro '__Pyx_PyFrame_SetLineNumber'
            #define __Pyx_PyFrame_SetLineNumber(frame, lineno)  (frame)->f_lineno = (lineno)
                                                                ~~~~~~~^
          /Library/Frameworks/Python.framework/Versions/3.11/include/python3.11/pytypedefs.h:22:16: note: forward declaration of 'struct _frame'
          typedef struct _frame PyFrameObject;
                         ^
          1 warning and 1 error generated.
          error: command '/usr/bin/clang' failed with exit code 1
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for gensim
      Running setup.py clean for gensim
    Failed to build gensim
    Installing collected packages: gensim
      Running setup.py install for gensim ... error
      error: subprocess-exited-with-error
      
      × Running setup.py install for gensim did not run successfully.
      │ exit code: 1
      ╰─> [540 lines of output]
          running install
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
            warnings.warn(
          running build
          running build_py
          creating build
          creating build/lib.macosx-10.9-universal2-cpython-311
          creating build/lib.macosx-10.9-universal2-cpython-311/gensim
          ...
          copying gensim/corpora/svmlightcorpus.py -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          copying gensim/corpora/hashdictionary.py -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          running egg_info
          writing gensim.egg-info/PKG-INFO
          writing dependency_links to gensim.egg-info/dependency_links.txt
          writing requirements to gensim.egg-info/requires.txt
          writing top-level names to gensim.egg-info/top_level.txt
          reading manifest file 'gensim.egg-info/SOURCES.txt'
          reading manifest template 'MANIFEST.in'
          warning: no files found matching 'COPYING.LESSER'
          warning: no files found matching 'ez_setup.py'
          warning: no files found matching 'gensim/models/doc2vec_inner.c'
          adding license file 'COPYING'
          writing manifest file 'gensim.egg-info/SOURCES.txt'
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.DTM' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.DTM' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.DTM' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.DTM' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.PathLineSentences' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.PathLineSentences' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.PathLineSentences' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.PathLineSentences' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_d2v_models' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.old_d2v_models' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.old_d2v_models' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.old_d2v_models' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_w2v_models' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.old_w2v_models' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.old_w2v_models' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.old_w2v_models' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          copying gensim/_matutils.c -> build/lib.macosx-10.9-universal2-cpython-311/gensim
          copying gensim/_matutils.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim
         ...
          copying gensim/corpora/_mmreader.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          running build_ext
          building 'gensim.models.word2vec_inner' extension
          creating build/temp.macosx-10.9-universal2-cpython-311
          creating build/temp.macosx-10.9-universal2-cpython-311/gensim
          creating build/temp.macosx-10.9-universal2-cpython-311/gensim/models
          clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -I/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include -c gensim/models/word2vec_inner.c -o build/temp.macosx-10.9-universal2-cpython-311/gensim/models/word2vec_inner.o
          In file included from gensim/models/word2vec_inner.c:706:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/arrayobject.h:5:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarrayobject.h:12:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarraytypes.h:1948:
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: "Using deprecated NumPy API, disable it with "          "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
          #warning "Using deprecated NumPy API, disable it with " \
           ^
          gensim/models/word2vec_inner.c:12424:5: error: incomplete definition of type 'struct _frame'
              __Pyx_PyFrame_SetLineNumber(py_frame, py_line);
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          gensim/models/word2vec_inner.c:457:62: note: expanded from macro '__Pyx_PyFrame_SetLineNumber'
            #define __Pyx_PyFrame_SetLineNumber(frame, lineno)  (frame)->f_lineno = (lineno)
                                                                ~~~~~~~^
          /Library/Frameworks/Python.framework/Versions/3.11/include/python3.11/pytypedefs.h:22:16: note: forward declaration of 'struct _frame'
          typedef struct _frame PyFrameObject;
                         ^
          1 warning and 1 error generated.
          error: command '/usr/bin/clang' failed with exit code 1
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: legacy-install-failure
    
    × Encountered error while trying to install package.
    ╰─> gensim
    
    note: This is an issue with the package mentioned above, not pip.
    hint: See above for output from the failure.
    
    

    Versions

    Please provide the output of:

    >>> import platform; print(platform.platform())
    
    import sys; print("Python", sys.version)
    
    import struct; print("Bits", 8 * struct.calcsize("P"))
    
    import numpy; print("NumPy", numpy.__version__)
    
    import scipy; print("SciPy", scipy.__version__)
    macOS-12.5-arm64-arm-64bit
    >>> 
    >>> import sys; print("Python", sys.version)
    Python 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)]
    >>> 
    >>> import struct; print("Bits", 8 * struct.calcsize("P"))
    Bits 64
    >>> 
    >>> import numpy; print("NumPy", numpy.__version__)
    NumPy 1.23.5
    >>> 
    >>> import scipy; print("SciPy", scipy.__version__)
    SciPy 1.9.3
    
    opened by labouz 2
  • Parameter shardsize ignored on queries

    Parameter shardsize ignored on queries

    Problem description

    When I use the shardsize parameter in the similarities.Similarity method, when querying the index the same parameter is not used, causing errors:

    self._similarity_index = similarities.Similarity(MODELS_PATH + f'/{model}', sim_vectors, num_features=len(self._dictionary), shardsize=50000)
    
    sims = self._similarity_index[doc_vector]
    

    image

    PS: If I don't use the parameter shardsize, the error already occurs in the similarities.Similarity call.

    Steps/code/corpus to reproduce

    Save the .py files in the pruvo folder (package), the .parquet file in data folder and run this script:

    import pandas as pd
    
    from pruvo.embedding import Corpus
    
    df = pd.read_parquet('data/preprocess.parquet')
    
    corpus = Corpus()
    corpus.add(list(df['bookingRoomType'].unique()), pre_processed=True)
    corpus.add(list(df['mappedRoomType'].unique()), pre_processed=True)
    
    w2v = corpus.train(model='word2vec')
    
    w2v_similars = corpus.get_similars('apartment 1 king bed in neverland')
    w2v_similars.head(10)
    

    Versions

    Please provide the output of:

    import platform; print(platform.platform())
    import sys; print("Python", sys.version)
    import struct; print("Bits", 8 * struct.calcsize("P"))
    import numpy; print("NumPy", numpy.__version__)
    import scipy; print("SciPy", scipy.__version__)
    import gensim; print("gensim", gensim.__version__)
    from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
    

    image

    files.zip

    opened by MaickelHubner 0
  • Add parameter from_topn in evaluate_word_analogies

    Add parameter from_topn in evaluate_word_analogies

    from_topn will mark correct if the expected vector is not necessarily the most similar but among to from_topn most similar.

    Useful for the evaluation of vectors like confusion vectors, in which any of the top two results match then it is marked correct.

    opened by divyanx 3
Releases(4.3.0)
  • 4.3.0(Dec 21, 2022)

    What's Changed

    • Allow overriding the Cython version requirement by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3323
    • Update Python module MANIFEST by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3343
    • Clean up references to Morfessor, tox and gensim.models.wrappers by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3345
    • Disable the Gensim 3=>4 warning in docs by @piskvorky in https://github.com/RaRe-Technologies/gensim/pull/3346
    • pin sphinx versions, add explicit gallery_top label by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3383
    • Declare variables prior to for loop in fastss.pyx for ANSI C compatibility by @hstk30 in https://github.com/RaRe-Technologies/gensim/pull/3378
    • Fix typo in word2vec and KeyedVectors docstrings by @dymil in https://github.com/RaRe-Technologies/gensim/pull/3365
    • Replace np.multiply with np.square and copyedit in translation_matrix.py by @dymil in https://github.com/RaRe-Technologies/gensim/pull/3374
    • Copyedit and fix outdated statements in translation matrix tutorial by @dymil in https://github.com/RaRe-Technologies/gensim/pull/3375
    • Implement Okapi BM25 variants in Gensim by @Witiko in https://github.com/RaRe-Technologies/gensim/pull/3304
    • Giving missing credit in EnsembleLDA to Alex in docs by @sezanzeb in https://github.com/RaRe-Technologies/gensim/pull/3393
    • PERF: pyemd to POT for EMD computation in wmdistance by @TLouf in https://github.com/RaRe-Technologies/gensim/pull/3327
    • Fixed bug in loss computation for Word2Vec with hierarchical softmax by @TalIfargan in https://github.com/RaRe-Technologies/gensim/pull/3397
    • fix deprecation warning from pytest by @martino-vic in https://github.com/RaRe-Technologies/gensim/pull/3354
    • Switch to Cython language level 3 by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3344
    • Implement numpy hack in setup.py to enable install under Poetry by @jaymegordo in https://github.com/RaRe-Technologies/gensim/pull/3363
    • Fixed the broken link in readme.md by @aswin2108 in https://github.com/RaRe-Technologies/gensim/pull/3409
    • Path Coherence Model to correctly handle empty documents by @PrimozGodec in https://github.com/RaRe-Technologies/gensim/pull/3406
    • Add support for Python 3.11 and drop support for Python 3.7 by @acul3 in https://github.com/RaRe-Technologies/gensim/pull/3402
    • clarify runtime expectations by @gojomo in https://github.com/RaRe-Technologies/gensim/pull/3381
    • Fix bug that prevents loading old models by @funasshi in https://github.com/RaRe-Technologies/gensim/pull/3359
    • refactor wheel building and testing workflow by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3410
    • Fixed FastTextKeyedVectors handling in add_vector by @globba in https://github.com/RaRe-Technologies/gensim/pull/3389
    • Flsamodel by @ERijck in https://github.com/RaRe-Technologies/gensim/pull/3398
    • Fix backwards compatibility bug in Word2Vec by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3415
    • fix numpy hack in setup.py by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3416
    • updated changelog for next release by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3412

    New Contributors

    • @hstk30 made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3378
    • @TLouf made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3327
    • @TalIfargan made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3397
    • @martino-vic made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3354
    • @jaymegordo made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3363
    • @aswin2108 made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3409
    • @acul3 made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3402
    • @funasshi made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3359
    • @globba made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3389
    • @ERijck made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3398

    Full Changelog: https://github.com/RaRe-Technologies/gensim/compare/4.2.0...4.3.0

    Source code(tar.gz)
    Source code(zip)
  • 4.2.0(May 1, 2022)

  • 4.1.2(Sep 18, 2021)

    4.1.2, 2021-09-17

    This is a bugfix release that addresses left over compatibility issues with older versions of numpy and MacOS.

    4.1.1, 2021-09-14

    This is a bugfix release that addresses compatibility issues with older versions of numpy.

    4.1.0, 2021-08-15

    Gensim 4.1 brings two major new functionalities:

    There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

    Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

    We now handle both positive and negative keyword parameters consistently. They may now be either:

    1. A string, in which case the value is reinterpreted as a list of one element (the string value)
    2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
    3. A list of strings
    4. A list of vectors

    So you can now simply do:

        model.most_similar(positive='war', negative='peace')
    

    instead of the slightly more involved

    model.most_similar(positive=['war'], negative=['peace'])
    

    Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

    model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
    

    then you will need to specify the lists explicitly in gensim 4.1.

    Deprecated obsolete step parameter from doc2vec

    With the newer version, do this:

    model.infer_vector(..., epochs=123)
    

    instead of this:

    model.infer_vector(..., steps=123)
    

    Plus a large number of smaller improvements and fixes, as usual.

    ⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

    :+1: New features

    • #3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
    • #3163: Optimize word mover distance (WMD) computation, by @flowlight0
    • #3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
    • #3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
    • #3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
    • #3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
    • #3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
    • #3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
    • #2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
    • #2978: Optimize performance of Author-Topic model, by @horpto
    • #3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
    • #3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
    • #3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
    • #3131: Add missing import to NMF docs and models/init.py, by @properGrammar
    • #3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
    • #2830: Fixed KeyError in coherence model, by @pietrotrope

    :warning: Removed functionality & deprecations

    • #3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    • #3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

    🔮 Testing, CI, housekeeping

    • #3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
    • #3143: replace _mul function with explicit casts, by @mpenkov
    • #2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    Source code(tar.gz)
    Source code(zip)
  • 4.1.1(Sep 14, 2021)

    4.1.1, 2021-09-14

    This is a bugfix release that addresses compatibility issues with older versions of numpy.

    4.1.0, 2021-08-15

    Gensim 4.1 brings two major new functionalities:

    There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

    Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

    We now handle both positive and negative keyword parameters consistently. They may now be either:

    1. A string, in which case the value is reinterpreted as a list of one element (the string value)
    2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
    3. A list of strings
    4. A list of vectors

    So you can now simply do:

        model.most_similar(positive='war', negative='peace')
    

    instead of the slightly more involved

    model.most_similar(positive=['war'], negative=['peace'])
    

    Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

    model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
    

    then you will need to specify the lists explicitly in gensim 4.1.

    Deprecated obsolete step parameter from doc2vec

    With the newer version, do this:

    model.infer_vector(..., epochs=123)
    

    instead of this:

    model.infer_vector(..., steps=123)
    

    Plus a large number of smaller improvements and fixes, as usual.

    ⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

    :+1: New features

    • #3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
    • #3163: Optimize word mover distance (WMD) computation, by @flowlight0
    • #3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
    • #3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
    • #3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
    • #3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
    • #3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
    • #3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
    • #2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
    • #2978: Optimize performance of Author-Topic model, by @horpto
    • #3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
    • #3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
    • #3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
    • #3131: Add missing import to NMF docs and models/init.py, by @properGrammar
    • #3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
    • #2830: Fixed KeyError in coherence model, by @pietrotrope

    :warning: Removed functionality & deprecations

    • #3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    • #3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

    🔮 Testing, CI, housekeeping

    • #3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
    • #3143: replace _mul function with explicit casts, by @mpenkov
    • #2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    Source code(tar.gz)
    Source code(zip)
  • 4.1.0(Aug 29, 2021)

    4.1.0, 2021-08-15

    Gensim 4.1 brings two major new functionalities:

    There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

    Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

    We now handle both positive and negative keyword parameters consistently. They may now be either:

    1. A string, in which case the value is reinterpreted as a list of one element (the string value)
    2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
    3. A list of strings
    4. A list of vectors

    So you can now simply do:

        model.most_similar(positive='war', negative='peace')
    

    instead of the slightly more involved

    model.most_similar(positive=['war'], negative=['peace'])
    

    Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

    model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
    

    then you will need to specify the lists explicitly in gensim 4.1.

    Deprecated obsolete step parameter from doc2vec

    With the newer version, do this:

    model.infer_vector(..., epochs=123)
    

    instead of this:

    model.infer_vector(..., steps=123)
    

    Plus a large number of smaller improvements and fixes, as usual.

    ⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

    :+1: New features

    • #3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
    • #3163: Optimize word mover distance (WMD) computation, by @flowlight0
    • #3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
    • #3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
    • #3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
    • #3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
    • #3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
    • #3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
    • #2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
    • #2978: Optimize performance of Author-Topic model, by @horpto
    • #3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
    • #3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
    • #3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
    • #3131: Add missing import to NMF docs and models/init.py, by @properGrammar
    • #3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
    • #2830: Fixed KeyError in coherence model, by @pietrotrope

    :warning: Removed functionality & deprecations

    • #3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    • #3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

    🔮 Testing, CI, housekeeping

    • #3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
    • #3143: replace _mul function with explicit casts, by @mpenkov
    • #2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

    4.0.1, 2021-04-01

    Bugfix release to address issues with Wheels on Windows:

    • https://github.com/RaRe-Technologies/gensim/issues/3095
    • https://github.com/RaRe-Technologies/gensim/issues/3097

    4.0.0, 2021-03-24

    ⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

    Gensim 4.0 is a major release with lots of performance & robustness improvements, and a new website.

    Main highlights

    • Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

      a. Efficiency

      | model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput | |----------|------------|--------| | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

      In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

      b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

      c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

      These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

    • Dropped a bunch of externally contributed modules and wrappers: summarization, pivoted TFIDF, Mallet…

      • Code quality was not up to our standards. Also there was no one to maintain these modules, answer user questions, support them.

        So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them, please fork & publish into your own repo. They can live happily outside of Gensim.

    • Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

      • If you still need Python 2 for some reason, stay at Gensim 3.8.3.
    • A new Gensim website – finally! 🙃

    So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

    This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting concrete NLP & document similarity use-cases.

    :+1: New features

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #2891: Fix fastText word-vectors with ngrams off, by @gojomo
    • #2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
    • #2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
    • #2899: Fix deprecation warnings in Annoy integration, by @piskvorky
    • #2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
    • #2940: Fix deprecations in SoftCosineSimilarity, by @Witiko
    • #2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo
    • #2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk
    • #2973: phrases.export_phrases() doesn't yield all bigrams, by @piskvorky
    • #2942: Segfault when training doc2vec, by @gojomo
    • #3041: Fix RuntimeError in export_phrases (change defaultdict to dict), by @thalishsajeed
    • #3059: Fix race condition in FastText tests, by @sleepy-owl

    :warning: Removed functionality & deprecations

    🔮 Testing, CI, housekeeping

    4.0.0.rc1, 2021-03-19

    ⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

    Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

    Main highlights (see also 👍 Improvements below)

    • Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

      a. Efficiency

      | model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput | |----------|------------|--------| | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

      In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

      b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

      c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

      These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

    • Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.

      • Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.

        So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.

    • Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

      • If you still need Python 2 for some reason, stay at Gensim 3.8.3.
    • A new Gensim website – finally! 🙃

    So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

    This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

    :star2: New Features

    :red_circle: Bug fixes

    :books: Tutorial and doc improvements

    • fix various documentation warnings (mpenkov, #3077)
    • Fix broken link in run_doc how-to (sezanzeb, #2991)
    • Point WordEmbeddingSimilarityIndex documentation to gensim.similarities (Witiko, #3003)
    • Make the link to the Gensim 3.8.3 documentation dynamic (Witiko, #2996)

    :warning: Removed functionality

    🔮 Miscellaneous

    4.0.0beta, 2020-10-31

    ⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

    Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

    Main highlights (see also 👍 Improvements below)

    • Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

      a. Efficiency

      | model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput | |----------|------------|--------| | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

      In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

      b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

      c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

      These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

    • Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.

      • Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.

        So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.

    • Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

      • If you still need Python 2 for some reason, stay at Gensim 3.8.3.
    • A new Gensim website – finally! 🙃

    So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

    This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

    Why pre-release?

    This 4.0.0beta pre-release is for users who want the cutting edge performance and bug fixes. Plus users who want to help out, by testing and providing feedback: code, documentation, workflows… Please let us know on the mailing list!

    Install the pre-release with:

    pip install --pre --upgrade gensim
    

    What will change between this pre-release and a "full" 4.0 release?

    Production stability is important to Gensim, so we're improving the process of upgrading already-trained saved models. There'll be an explicit model upgrade script between each 4.n to 4.(n+1) Gensim release. Check progress here.

    :+1: Improvements

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #2891: Fix fastText word-vectors with ngrams off, by @gojomo
    • #2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
    • #2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
    • #2899: Fix deprecation warnings in Annoy integration, by @piskvorky
    • #2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
    • #2940; Fix deprecations in SoftCosineSimilarity, by @Witiko
    • #2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo
    • #2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk
    • #2973: phrases.export_phrases() doesn't yield all bigrams
    • #2942: Segfault when training doc2vec

    :warning: Removed functionality & deprecations

    • #6: No more binary wheels for x32 platforms, by menshikh-iv
    • #2899: Renamed overly broad similarities.index to the more appropriate similarities.annoy, by @piskvorky
    • #2958: Remove gensim.summarization subpackage, docs and test data, by @mpenkov
    • #2926: Rename num_words to topn in dtm_coherence, by @MeganStodel
    • #2937: Remove Keras dependency, by @piskvorky
    • Removed all code, methods, attributes and functions marked as deprecated in Gensim 3.8.3.
    • Removed pattern dependency (PR #3012, @mpenkov). If you need to lemmatize, do it prior to passing the corpus to gensim.
    Source code(tar.gz)
    Source code(zip)
Owner
RARE Technologies
Commercial Machine Learning & NLP
RARE Technologies
A Topic Modeling toolbox

Topik A Topic Modeling toolbox. Introduction The aim of topik is to provide a full suite and high-level interface for anyone interested in applying to

Anaconda, Inc. (formerly Continuum Analytics, Inc.) 93 Dec 1, 2022
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 3k Dec 29, 2022
A standard framework for modelling Deep Learning Models for tabular data

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike.

null 801 Jan 8, 2023
Supervised domain-agnostic prediction framework for probabilistic modelling

A supervised domain-agnostic framework that allows for probabilistic modelling, namely the prediction of probability distributions for individual data

The Alan Turing Institute 112 Oct 23, 2022
:boar: :bear: Deep Learning based Python Library for Stock Market Prediction and Modelling

bulbea "Deep Learning based Python Library for Stock Market Prediction and Modelling." Table of Contents Installation Usage Documentation Dependencies

Achilles Rasquinha 1.8k Jan 5, 2023
Civsim is a basic civilisation simulation and modelling system built in Python 3.8.

Civsim Introduction Civsim is a basic civilisation simulation and modelling system built in Python 3.8. It requires the following packages: perlin_noi

null 17 Aug 8, 2022
Dataloader tools for language modelling

Installation: pip install lm_dataloader Design Philosophy A library to unify lm dataloading at large scale Simple interface, any tokenizer can be inte

null 5 Mar 25, 2022
A Tensorflow based library for Time Series Modelling with Gaussian Processes

Markovflow Documentation | Tutorials | API reference | Slack What does Markovflow do? Markovflow is a Python library for time-series analysis via prob

Secondmind Labs 24 Dec 12, 2022
Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

Pranaydeep Singh 22 Dec 8, 2022
Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations

Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations Code repo for paper Trans-Encoder: Unsupervised sentence-pa

Amazon 101 Dec 29, 2022
Reaction SMILES-AA mapping via language modelling

rxn-aa-mapper Reactions SMILES-AA sequence mapping setup conda env create -f conda.yml conda activate rxn_aa_mapper In the following we consider on ex

null 16 Dec 13, 2022
Computational modelling of ray propagation through optical elements using the principles of geometric optics (Ray Tracer)

Computational modelling of ray propagation through optical elements using the principles of geometric optics (Ray Tracer) Introduction By applying the

Son Gyo Jung 1 Jul 9, 2022
CLNTM - Contrastive Learning for Neural Topic Model

Contrastive Learning for Neural Topic Model This repository contains the impleme

Thong Thanh Nguyen 25 Nov 24, 2022
Deep Learning for humans

Keras: Deep Learning for Python Under Construction In the near future, this repository will be used once again for developing the Keras codebase. For

Keras 57k Jan 9, 2023
Machine Learning toolbox for Humans

Reproducible Experiment Platform (REP) REP is ipython-based environment for conducting data-driven research in a consistent and reproducible way. Main

Yandex 662 Nov 20, 2022
Deep Learning for humans

Keras: Deep Learning for Python Under Construction In the near future, this repository will be used once again for developing the Keras codebase. For

Keras 50.7k Feb 12, 2021
Knowledge Management for Humans using Machine Learning & Tags

HyperTag HyperTag helps humans intuitively express how they think about their files using tags and machine learning.

Ravn Tech, Inc. 165 Nov 4, 2022
Reimplementation of the paper `Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (ACL2020)`

Human Attention for Text Classification Re-implementation of the paper Human Attention Maps for Text Classification: Do Humans and Neural Networks Foc

Shunsuke KITADA 15 Dec 13, 2021
Synthetic Humans for Action Recognition, IJCV 2021

SURREACT: Synthetic Humans for Action Recognition from Unseen Viewpoints Gül Varol, Ivan Laptev and Cordelia Schmid, Andrew Zisserman, Synthetic Human

Gul Varol 59 Dec 14, 2022