Topic Modelling for Humans

RARE Technologies

Last update: Jan 2, 2023

Related tags

Text Data & NLP python nlp data-science machine-learning natural-language-processing information-retrieval data-mining neural-network word2vec word-embeddings topic-modeling gensim fasttext document-similarity word-similarity

Overview

gensim – Topic Modelling in Python

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Features

All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
Intuitive interfaces
- easy to plug in your own input corpus/datastream (trivial streaming API)
- easy to extend with other Vector Space algorithms (trivial transformation API)
Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
Extensive documentation and Jupyter Notebook tutorials.

If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Installation

This software depends on NumPy and Scipy, two Python packages for scientific computing. You must have them installed prior to installing gensim.

It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as ATLAS or OpenBLAS is known to improve performance by as much as an order of magnitude. On OS X, NumPy picks up the BLAS that comes with it automatically, so you don’t need to do anything special.

Install the latest version of gensim:

    pip install --upgrade gensim

Or, if you have instead downloaded and unzipped the source tar.gz package:

    python setup.py install

For alternative modes of installation, see the documentation.

Gensim is being continuously tested under Python 3.6, 3.7 and 3.8. Support for Python 2.7 was dropped in gensim 4.0.0 – install gensim 3.8.3 if you must use Python 2.7.

How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?

Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).

Memory-wise, gensim makes heavy use of Python’s built-in generators and iterators for streamed data processing. Memory efficiency was one of gensim’s design goals, and is a central feature of gensim, rather than something bolted on as an afterthought.

Documentation

Support

Ask open-ended or research questions on the Gensim Mailing List.

Raise bugs on Github but make sure you follow the issue template. Issues that are not bugs or fail to follow the issue template will be closed without inspection.

Adopters

Company	Industry	Use of Gensim
RARE Technologies	ML & NLP consulting	Creators of Gensim – this is us!
Amazon	Retail	Document similarity.
National Institutes of Health	Health	Processing grants and publications with word2vec.
Cisco Security	Security	Large-scale fraud detection.
Mindseye	Legal	Similarities in legal documents.
Channel 4	Media	Recommendation engine.
Talentpair	HR	Candidate matching in high-touch recruiting.
Juju	HR	Provide non-obvious related job suggestions.
Tailwind	Media	Post interesting and relevant content to Pinterest.
Issuu	Media	Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about.
Search Metrics	Content Marketing	Gensim word2vec used for entity disambiguation in Search Engine Optimisation.
12K Research	Media	Document similarity analysis on media articles.
Stillwater Supercomputing	Hardware	Document comprehension and association with word2vec.
SiteGround	Web hosting	An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA.
Capital One	Finance	Topic modeling for customer complaints exploration.

Citing gensim

When citing gensim in academic papers and theses, please use this BibTeX entry:

@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}

Comments

NMF metrics and wikipedia
Add clean up and fixes on top of #2361:

[x] Fix l2 of all models

[x] Fix perplexity of the sklearn NMF

[x] Change loss function in the logs of the model

[x] Print topics in the wikipedia notebook

[x] Reorder the benchmark tables

[x] Fix an out-of-bounds error in grouper

[x] Specify the shape of the corpus in the module docstring

[x] Fix the generator issue

[x] Add more unittests

[x] More explanations in wikipedia

[x] Fix flake8

[x] Add comment about sparsity method

[x] Mention pass parameter behavior for generators/iterators

[x] Fix smart_open import

[x] Fix tests
opened by anotherbugmaster 113
File-based fast training for Any2Vec models
Tutorial explaining the whats & hows: Jupyter notebook

note: all preliminary discussions are in https://github.com/RaRe-Technologies/gensim/pull/2048

This PR summarizes all my work during GSoC 2018. For more understanding what's going on, follow the links:

My proposal: https://persiyanov.github.io/jekyll/update/2018/04/24/accepted-to-gsoc-2018.html

First benchmarks: https://persiyanov.github.io/jekyll/update/2018/05/28/gsoc-first-weeks.html

Last blog post about the almost final solution: https://persiyanov.github.io/2018/07/06/gsoc-midreport.html

Links to all benchmarks: https://gist.github.com/persiyanov/84b806233947e0069a243433579b35db

Previous PR about vocab building : https://github.com/RaRe-Technologies/gensim/pull/2078 (reverted these changes in current PR because of API design issues)

Previous PR about multistream training (all useful changes in this PR): https://github.com/RaRe-Technologies/gensim/pull/2048

Summary

In this pull request, new argument corpus_file is proposed for Word2Vec, FastText and Doc2Vec models. It is supposed to use corpus_file instead of standard sentences argument if you have the preprocessed dataset on disk and want to get significant speedup during model training.

On our benchmarks, training Word2Vec on English Wikipedia dump is 370% faster with corpus_file than training with sentences (see the attached jupyter notebook with the code).

Look at this chart for Word2Vec:

Usage

The usage is really simple. I'll provide examples for Word2Vec while the usage for FastText and Doc2Vec is identical. The corpus_file argument is supported for:

Constructor

# Standard way model = Word2Vec(sentences=my_corpus, <...other arguments...>) # New way model = Word2Vec(corpus_file='my_corpus_saved.txt', <...other arguments...>) # You can save your own corpus using gensim.utils.save_as_line_sentence(my_corpus, 'my_corpus_saved.txt')

build_vocab

# Create the model without training model = Word2Vec(<...other arguments...>) # Standard way model.build_vocab(sentences=my_corpus, ...) # New way model.build_vocab(corpus_file='my_corpus_saved.txt', ...)

train

# Create the model without training model = Word2Vec(<...other arguments...>) # Build vocab (with `sentences` or `corpus_file` way, choose what you like) model.build_vocab(corpus_file='my_corpus_saved.txt') # Train the model (old way) model.train(sentences=my_corpus, total_examples=model.corpus_count, ...) # Train the model (new way) model.train(corpus_file='my_corpus_saved.txt', total_words=model.corpus_total_words, ...)

That's it! Everything else remains the same as before.

Details

Firstly, let me describe the standard approach to train *2Vec models:

A user provides input data stream (python iterable object)

One job_producer python thread is created. This thread reads data from the input stream and pushes batches into the python threading.Queue (job_queue).

Several worker threads pull batches from job_queue and perform model updates. Batches are python lists of lists of tokens. They are first translated into C structures and then a model update is performed without GIL.

Such approach allows to scale model updates linearly, but batch producing (from reading up to filling C structures from python object) is a bottleneck in this pipeline.

It is evident that we can't optimize batch generation for abstract python stream (with custom user logic). Instead of this, we performed such an optimization only for data which is stored on a disk in a form of gensim.models.word2vec.LineSentence (one sentence per line, words are separated by whitespace).

Such a restriction allowed us to read the data directly on C++ level without GIL. And then, immediately, perform model updates. Finally, this resulted in linear scaling during training.
opened by persiyanov 102
Use FastSS for fast kNN over Levenshtein distance
Introduction

The LevenshteinSimilarityIndex term similarity index in the termsim.levenshtein module implements the lexical text similarity search technique described by Charlet and Damnati (2017) in their paper describing their winning system at SemEval-2017 Task 3: Community Question Answering.

We are showing a related semantic similarity search technique using the WordEmbeddingSimilarityIndex term similarity index in our Soft Cosine Similarity autoexample, which enjoys some popularity among our users. We would like to also advertise LevenshteinSimilarityIndex, which provides a different but equally useful kind of search. However, the current implementation uses brute-force kNN search over the Levenshtein distance to produce a term similarity matrix, which is so slow that it can take years to produce a matrix even for medium-sized corpora such as the English Wikipedia.

Following the discussion in https://github.com/RaRe-Technologies/gensim/issues/2541, @piskvorky and I implemented indexing using the FastSS algorithm for kNN search over the Levenshtein distance in hopes that this would speed LevenshteinSimilarityIndex up by at least three orders of magnitude (1,000×), so that it can compete with WordEmbeddingSimilarityIndex. As an added bonus, using the FastSS algorithm allows us to remove our external dependence on the python-Levenshtein library.

Speed comparison

Below, I will show a before-and-after speed comparison of LevenshteinSimilarityIndex compared to the standard WordEmbeddingSimilarityIndex shown in the Soft Cosine Similarity autoexample. We are measuring how many kNN searches per second, k = 100, a term similarity index can perform. To produce my dictionary (253,854 terms) and word embeddings, I will use the text8 corpus (100 MB). I am running the code on a Dell Inspiron 15 7559.

Before the change

We can see that even with our tiny corpus, the LevenshteinSimilarityIndex takes over three days to find the 100 nearest neighbors for all 253,854 terms in our vocabulary. Contrast this with the WordEmbeddingSimilarityIndex, which finishes in under four minutes even though we are using exact nearest-neighbor search and we could get further speed-up by using e.g. the Annoy index.

$ pip install gensim==4.0.1 python-Levenshtein $ wget http://mattmahoney.net/dc/text8.zip $ unzip text8.zip $ python >>> from gensim.corpora import Dictionary >>> from gensim.models.word2vec import LineSentence, Word2Vec >>> from gensim.similarities import ( ... SparseTermSimilarityMatrix, ... WordEmbeddingSimilarityIndex, ... LevenshteinSimilarityIndex, ... ) >>> >>> corpus = LineSentence('text8') >>> dictionary = Dictionary(corpus) >>> w2v_model = Word2Vec(sentences=corpus) >>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv) >>> levenshtein_index = LevenshteinSimilarityIndex(dictionary) >>> >>> SparseTermSimilarityMatrix(embedding_index, dictionary) 100%|███████████████████████████████| 253854/253854 [04:04<00:00, 1037.97it/s] >>> SparseTermSimilarityMatrix(levenshtein_index, dictionary) 0%| | 124/253854 [02:24<80:18:05, 1.14s/it]

After the change

With the FastSS algorithm, the LevenshteinSimilarityIndex receives a 1,500× speed-up and is now not only not slower than the WordEmbeddingSimilarityIndex, but 1.5× faster. Both term similarity indexes now find the 100 nearest neighbors for all 253,854 terms in our vocabulary in under 4 minutes.

$ pip install lexpy git+https://github.com/witiko/gensim@7054f90 $ python >>> from gensim.corpora import Dictionary >>> from gensim.models.word2vec import LineSentence, Word2Vec >>> from gensim.similarities import ( ... SparseTermSimilarityMatrix, ... WordEmbeddingSimilarityIndex, ... LevenshteinSimilarityIndex, ... ) >>> >>> corpus = LineSentence('text8') >>> dictionary = Dictionary(corpus) >>> w2v_model = Word2Vec(sentences=corpus) >>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv) >>> levenshtein_index = LevenshteinSimilarityIndex(dictionary) >>> >>> SparseTermSimilarityMatrix(embedding_index, dictionary) 100%|███████████████████████████████| 253854/253854 [03:57<00:00, 1070.14it/s] >>> SparseTermSimilarityMatrix(levenshtein_index, dictionary) 100%|███████████████████████████████| 253854/253854 [02:34<00:00, 1639.23it/s]

Conclusion

Using the FastSS algorithm for kNN search over the Levenshtein distance, we managed to increase the speed of the LevenshteinSimilarityIndex term similarity index by four orders of magnitude (1,500×) on the text8 corpus. As an added bonus, using the FastSS algorithm allowed us to remove our external dependence on the Levenshtein library. Closes #2541.
opened by Witiko 70

numpy 1.19.2 incompatible with gensim 4.1.0

Problem description

When importing gensim I get the following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/__init__.py", line 11, in <module>
    from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils  # noqa:F401
  File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/corpora/__init__.py", line 6, in <module>
    from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
  File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/corpora/indexedcorpus.py", line 14, in <module>
    from gensim import interfaces, utils
  File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/interfaces.py", line 19, in <module>
    from gensim import utils, matutils
  File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/matutils.py", line 1024, in <module>
    from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
  File "gensim/_matutils.pyx", line 1, in init gensim._matutils
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Steps/code/corpus to reproduce

conda create --name=test python=3.7 -y
conda install -y numpy==1.19.2
pip install gensim

Versions

Linux-5.11.0-25-generic-x86_64-with-debian-bullseye-sid Python 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] Bits 64 NumPy 1.19.2 SciPy 1.7.1

opened by martinobertoni 58

Doc2vec not parallelizing
Doc2vec does not use all my cores despite my setting workers=8 when I instantiate it.

My install passes the assert below:

assert gensim.models.doc2vec.FAST_VERSION > -1

Do I have to do something else?
bug difficulty hard
opened by fccoelho 55

LSI worker getting "stuck"

Description

When building an LsiModel in distributed mode, one of the workers gets "stuck" while orthonormalizing the action matrix. This stalls the whole process of building the model, as the dispatcher hangs on "reached the end of input; now waiting for all remaining jobs to finish".

Steps/Code/Corpus to Reproduce

lsi_model = LsiModel(
        id2word=bow,
        num_topics=300,
        chunksize=5000,
        distributed=True
    )
lsi_model.add_documents(corpus)

LSI dispatcher and workers are initialized in separate bash script. I have tried with the number of LSI workers set to 16 and 8.

Gensim version: 3.6.0 Pyro4 version: 4.63

Expected Results

Process should run to completion

Actual Results

Main script output:

[2019-01-06 04:04:09,862] [23465] [gensim.models.lsimodel] [INFO] {add_documents:462} updating model with new documents
[2019-01-06 04:04:09,862] [23465] [gensim.models.lsimodel] [INFO] {add_documents:485} initializing 8 workers
[2019-01-06 04:05:12,131] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
[2019-01-06 04:05:12,135] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
[2019-01-06 04:05:12,497] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #0
[2019-01-06 04:05:12,541] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #5000
[2019-01-06 04:06:46,191] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
[2019-01-06 04:06:46,200] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
[2019-01-06 04:06:46,618] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #1
[2019-01-06 04:06:46,682] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #10000
[2019-01-06 04:08:11,839] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
[2019-01-06 04:08:11,843] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
[2019-01-06 04:08:12,561] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #2
[2019-01-06 04:08:12,786] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #15000
[2019-01-06 04:09:48,217] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
[2019-01-06 04:09:48,230] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
[2019-01-06 04:09:48,700] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #3
[2019-01-06 04:09:48,786] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #20000
[2019-01-06 04:09:48,938] [23465] [gensim.models.lsimodel] [INFO] {add_documents:518} reached the end of input; now waiting for all remaining jobs to finish

Output of LSI worker that is stuck:

2019-01-06 04:04:09,867 - INFO - resetting worker #1
2019-01-06 04:06:46,705 - INFO - worker #1 received job #208
2019-01-06 04:06:46,705 - INFO - updating model with new documents
2019-01-06 04:06:46,705 - INFO - using 100 extra samples and 2 power iterations
2019-01-06 04:06:46,705 - INFO - 1st phase: constructing (500000, 400) action matrix
2019-01-06 04:06:48,402 - INFO - orthonormalizing (500000, 400) action matrix

CPU for that LSI worker has been ~100% for >24 hours.

Versions

Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] NumPy 1.15.2 SciPy 1.1.0 gensim 3.6.0 FAST_VERSION 1

need info

opened by robguinness 53

Loading fastText binary output to gensim like word2vec

Facebook's recent open sourced fasttext https://github.com/facebookresearch/fastText improves the word2vec SkipGram model. It follows a similar output format for word - vector key value pairs, and the similarity calculation is about the same too, but their binary output format is kind of different from that of the C version of word2vec binary format. Do we want to support loading fastText model output in gensim? Thanks.
need info

opened by phunterlau 49
Add nmslib indexer

Hi, I added nmslib indexer.

Some research shows nmslib is better than annoy indexer. https://erikbern.com/2018/06/17/new-approximate-nearest-neighbor-benchmarks.html https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/

This is the first time to contribute to gensim. If I miss something, please let me know.
feature

opened by masa3141 47
Distributed Representations of Sentences and Documents

A reasonable approximation of the method described in the paper Distributed Representations of Sentences and Documents (http://cs.stanford.edu/~quocle/paragraph_vector.pdf).

Python isn't my first language, so I don't pretend that I did everything in the most "pythonic" way here, but I pulled out the portions of the word2vec code that needed to be modified into their own methods, then added another class extending word2vec which just modifies those few functions.

I also don't really know what to do in terms of refactoring cython code to limit code duplication, so doc2vec has its own set of cython functions which are completely independent of the word2vec ones.

Hope that helps. Tim

opened by temerick 47
Easy import of GloVe vectors using Gensim

word2vec embeddings start with a line with the number of lines (tokens?) and the number of dimensions of the file. This allows gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file.
feature

opened by manasRK 46
Implement Okapi BM25 variants in Gensim
This pull request implements the gensim.models.bm25model module, which contains an implementation of the Okapi BM25 model and its modifications (Lucene BM25 and ATIRE) as discussed in https://github.com/RaRe-Technologies/gensim/issues/2592#issuecomment-866799145. The module acts as a replacement for the gensim.summarization.bm25model module deprecated and removed in Gensim 4. The module should supersede the gensim.models.tfidfmodel module as the baseline weighting function for information retrieval and related NLP tasks.

Most implementations of BM25 such as the rank-bm25 library combine indexing with weighting and often forgo dictionary building for a speed improvement at indexing time (but a hefty penalty at retrieval time). To give an example, here is how a user would search for documents with rank-bm25:

>>> from rank_bm25 import BM25Okapi >>> >>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]] >>> bm25_model = BM25Okapi(corpus) >>> >>> query = ["Hello", "bar"] >>> similarities = bm25_model.get_scores(query) >>> similarities array([0.51082562, 0.09121886, 0.0638532 ]) >>> best_document, = bm25_model.get_top_n(query, corpus, n=1) >>> best_document ['Hello', 'world']

As you can see, the interface is convenient, but retrieval is slow due to the lack of a dictionary. Furthermore, any advanced operations such as pruning the dictionary, applying semantic matching (e.g. SCM) and query expansion (e.g. RM3), or sharding the index are unavailable.

By contrast, the gensim.models.bm25 module separates the three operations. To give an example, here is how a user would search for documents with the gensim.models.bm25 module:

>>> from gensim.corpora import Dictionary >>> from gensim.models import TfidfModel, OkapiBM25Model >>> from gensim.similarities import SparseMatrixSimilarity >>> import numpy as np >>> >>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]] >>> dictionary = Dictionary(corpus) >>> bm25_model = OkapiBM25Model(dictionary=dictionary) >>> bm25_corpus = bm25_model[list(map(dictionary.doc2bow, corpus))] >>> bm25_index = SparseMatrixSimilarity(bm25_corpus, num_docs=len(corpus), num_terms=len(dictionary), ... normalize_queries=False, normalize_documents=False) >>> >>> query = ["Hello", "bar"] >>> tfidf_model = TfidfModel(dictionary=dictionary, smartirs='bnn') # Enforce binary weighting of queries >>> tfidf_query = tfidf_model[dictionary.doc2bow(query)] >>> >>> similarities = bm25_index[tfidf_query] >>> similarities array([0.51082563, 0.09121886, 0.0638532 ], dtype=float32) >>> best_document = corpus[np.argmax(similarities)] >>> best_document ['Hello', 'world']

Tasks:

[x] Add Okapi BM25, ~~BM25L and BM25⁺~~ [1, 2], Lucene BM25 [3, 4], and ATIRE BM25 [3, 5].

[x] Add comments and docstrings to models.bm25.

[x] Add comments and docstrings to similarities.docsim.

[x] Add BM25 to the run_topics_and_transformations autoexample.

[x] Add normalize_queries=True, normalize_documents=True named parameters to SparseMatrixSimilarity, DenseMatrixSimilarity, and SoftCosineSimilarity classes as discussed in https://github.com/RaRe-Technologies/gensim/pull/3304#issuecomment-1061031969 and on the Gensim mailing list. Deprecate the normalize named parameter of SoftCosineSimilarity. Add normalize_queries=False, normalize_documents=False to TF-IDF and BM25 examples.
opened by Witiko 44

pip install gensim==4.2.0 raises deprecation warning

When installing gensim in a fresh environment I get the following warning: Sorry it is a lot of output. The command is pip3 install gensim--4.2.0 my pip version is 22.3.1 and python version 3.11 (see below)

because of character limits, i have cut out chunks of the output to just the warnings:

      ...
      running egg_info
      writing gensim.egg-info/PKG-INFO
      writing dependency_links to gensim.egg-info/dependency_links.txt
      writing requirements to gensim.egg-info/requires.txt
      writing top-level names to gensim.egg-info/top_level.txt
      reading manifest file 'gensim.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no files found matching 'COPYING.LESSER'
      warning: no files found matching 'ez_setup.py'
      warning: no files found matching 'gensim/models/doc2vec_inner.c'
      adding license file 'COPYING'
      writing manifest file 'gensim.egg-info/SOURCES.txt'
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.DTM' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data.DTM' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data.DTM' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data.DTM' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.PathLineSentences' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data.PathLineSentences' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data.PathLineSentences' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data.PathLineSentences' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_d2v_models' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data.old_d2v_models' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data.old_d2v_models' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data.old_d2v_models' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_w2v_models' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data.old_w2v_models' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data.old_w2v_models' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data.old_w2v_models' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      copying gensim/_matutils.c -> build/lib.macosx-10.9-universal2-cpython-311/gensim
      copying gensim/_matutils.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim
      ...
      copying gensim/corpora/_mmreader.c -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
      copying gensim/corpora/_mmreader.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
      running build_ext
      building 'gensim.models.word2vec_inner' extension
      creating build/temp.macosx-10.9-universal2-cpython-311
      creating build/temp.macosx-10.9-universal2-cpython-311/gensim
      creating build/temp.macosx-10.9-universal2-cpython-311/gensim/models
      clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -I/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include -c gensim/models/word2vec_inner.c -o build/temp.macosx-10.9-universal2-cpython-311/gensim/models/word2vec_inner.o
      In file included from gensim/models/word2vec_inner.c:706:
      In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/arrayobject.h:5:
      In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarrayobject.h:12:
      In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarraytypes.h:1948:
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: "Using deprecated NumPy API, disable it with "          "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
      #warning "Using deprecated NumPy API, disable it with " \
       ^
      gensim/models/word2vec_inner.c:12424:5: error: incomplete definition of type 'struct _frame'
          __Pyx_PyFrame_SetLineNumber(py_frame, py_line);
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      gensim/models/word2vec_inner.c:457:62: note: expanded from macro '__Pyx_PyFrame_SetLineNumber'
        #define __Pyx_PyFrame_SetLineNumber(frame, lineno)  (frame)->f_lineno = (lineno)
                                                            ~~~~~~~^
      /Library/Frameworks/Python.framework/Versions/3.11/include/python3.11/pytypedefs.h:22:16: note: forward declaration of 'struct _frame'
      typedef struct _frame PyFrameObject;
                     ^
      1 warning and 1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for gensim
  Running setup.py clean for gensim
Failed to build gensim
Installing collected packages: gensim
  Running setup.py install for gensim ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for gensim did not run successfully.
  │ exit code: 1
  ╰─> [540 lines of output]
      running install
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.macosx-10.9-universal2-cpython-311
      creating build/lib.macosx-10.9-universal2-cpython-311/gensim
      ...
      copying gensim/corpora/svmlightcorpus.py -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
      copying gensim/corpora/hashdictionary.py -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
      running egg_info
      writing gensim.egg-info/PKG-INFO
      writing dependency_links to gensim.egg-info/dependency_links.txt
      writing requirements to gensim.egg-info/requires.txt
      writing top-level names to gensim.egg-info/top_level.txt
      reading manifest file 'gensim.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no files found matching 'COPYING.LESSER'
      warning: no files found matching 'ez_setup.py'
      warning: no files found matching 'gensim/models/doc2vec_inner.c'
      adding license file 'COPYING'
      writing manifest file 'gensim.egg-info/SOURCES.txt'
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.DTM' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data.DTM' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data.DTM' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data.DTM' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.PathLineSentences' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data.PathLineSentences' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data.PathLineSentences' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data.PathLineSentences' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_d2v_models' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data.old_d2v_models' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data.old_d2v_models' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data.old_d2v_models' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_w2v_models' as data is deprecated, please list it in `packages`.
          !!
      
      
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'gensim.test.test_data.old_w2v_models' as an importable package,
          but it is not listed in the `packages` configuration of setuptools.
      
          'gensim.test.test_data.old_w2v_models' has been automatically added to the distribution only
          because it may contain data files, but this behavior is likely to change
          in future versions of setuptools (and therefore is considered deprecated).
      
          Please make sure that 'gensim.test.test_data.old_w2v_models' is included as a package by using
          the `packages` configuration field or the proper discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
      
          You can read more about "package discovery" and "data files" on setuptools
          documentation page.
      
      
      !!
      
        check.warn(importable)
      copying gensim/_matutils.c -> build/lib.macosx-10.9-universal2-cpython-311/gensim
      copying gensim/_matutils.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim
     ...
      copying gensim/corpora/_mmreader.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
      running build_ext
      building 'gensim.models.word2vec_inner' extension
      creating build/temp.macosx-10.9-universal2-cpython-311
      creating build/temp.macosx-10.9-universal2-cpython-311/gensim
      creating build/temp.macosx-10.9-universal2-cpython-311/gensim/models
      clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -I/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include -c gensim/models/word2vec_inner.c -o build/temp.macosx-10.9-universal2-cpython-311/gensim/models/word2vec_inner.o
      In file included from gensim/models/word2vec_inner.c:706:
      In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/arrayobject.h:5:
      In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarrayobject.h:12:
      In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarraytypes.h:1948:
      /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: "Using deprecated NumPy API, disable it with "          "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
      #warning "Using deprecated NumPy API, disable it with " \
       ^
      gensim/models/word2vec_inner.c:12424:5: error: incomplete definition of type 'struct _frame'
          __Pyx_PyFrame_SetLineNumber(py_frame, py_line);
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      gensim/models/word2vec_inner.c:457:62: note: expanded from macro '__Pyx_PyFrame_SetLineNumber'
        #define __Pyx_PyFrame_SetLineNumber(frame, lineno)  (frame)->f_lineno = (lineno)
                                                            ~~~~~~~^
      /Library/Frameworks/Python.framework/Versions/3.11/include/python3.11/pytypedefs.h:22:16: note: forward declaration of 'struct _frame'
      typedef struct _frame PyFrameObject;
                     ^
      1 warning and 1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> gensim

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Versions

Please provide the output of:

>>> import platform; print(platform.platform())

import sys; print("Python", sys.version)

import struct; print("Bits", 8 * struct.calcsize("P"))

import numpy; print("NumPy", numpy.__version__)

import scipy; print("SciPy", scipy.__version__)
macOS-12.5-arm64-arm-64bit
>>> 
>>> import sys; print("Python", sys.version)
Python 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)]
>>> 
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> 
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.23.5
>>> 
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.9.3

opened by labouz 2

Parameter shardsize ignored on queries

Problem description

When I use the shardsize parameter in the similarities.Similarity method, when querying the index the same parameter is not used, causing errors:

self._similarity_index = similarities.Similarity(MODELS_PATH + f'/{model}', sim_vectors, num_features=len(self._dictionary), shardsize=50000)

sims = self._similarity_index[doc_vector]

PS: If I don't use the parameter shardsize, the error already occurs in the similarities.Similarity call.

Steps/code/corpus to reproduce

Save the .py files in the pruvo folder (package), the .parquet file in data folder and run this script:

import pandas as pd

from pruvo.embedding import Corpus

df = pd.read_parquet('data/preprocess.parquet')

corpus = Corpus()
corpus.add(list(df['bookingRoomType'].unique()), pre_processed=True)
corpus.add(list(df['mappedRoomType'].unique()), pre_processed=True)

w2v = corpus.train(model='word2vec')

w2v_similars = corpus.get_similars('apartment 1 king bed in neverland')
w2v_similars.head(10)

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import struct; print("Bits", 8 * struct.calcsize("P"))
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

files.zip

opened by MaickelHubner 0

Add parameter from_topn in evaluate_word_analogies

from_topn will mark correct if the expected vector is not necessarily the most similar but among to from_topn most similar.

Useful for the evaluation of vectors like confusion vectors, in which any of the top two results match then it is marked correct.

opened by divyanx 3
BUG: word2vec skipgram model wont work with numpy array
Problem description

i have language with 240 distinct words. Because of it can fit 1 byte, i have map each word to bytes and save them in numpy uint8 array to minimize memory footprint. Doing this significantly reduce memory consumtion. However, due to "gensim\models\word2vec_inner.pyx", line 542, numpy arrays cant be used and throws: "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" error

Related line checks if sentence is empty or not, however it doing it as "if not sent:" More generic checker if len(sent)==0: will fix the problem.

Work around is, casting numpy array to python list. However this significantly increase memory footprint and time consuming operation on big dataset.

What are you trying to achieve? What is the expected result? What are you seeing instead?

Steps/code/corpus to reproduce

reproduce:

class SentenceIterator: def __init__(self, dataset): self.dataset = dataset def __iter__(self): for sentence in self.dataset: yield sentence data= [] data.append(np.array([22,33,44,55,1,2,3,5,4,100])) data.append(np.array([100,100,100,100,11])) sentences = SentenceIterator(data) model = gensim.models.Word2Vec(sentences, vector_size=32, window=3, workers=4, sg=1, negative=10)

ps: casting np.array to python list fixes the issue however casting is very slow on big dataset and significantly increases memory footprint

**workaround:** class SentenceIterator: def __init__(self, dataset): self.dataset = dataset def __iter__(self): for sentence in self.dataset: yield sentence.tolist()

possible fix

changing "if not sent:" controls to "if len(sent) ==0:"

Versions

Python 3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information.

import platform; print(platform.platform()) Windows-10-10.0.19044-SP0 import sys; print("Python", sys.version) Python 3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)] import struct; print("Bits", 8 * struct.calcsize("P")) Bits 64 import numpy; print("NumPy", numpy.version) NumPy 1.23.4 import scipy; print("SciPy", scipy.version) SciPy 1.9.3 import gensim; print("gensim", gensim.version) gensim 4.2.0 from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION) FAST_VERSION 0
opened by isimsizolan 1
Add tox environments to setup.cfg, fix napoleon import
tox -e compile,docs fails with ERROR: tox config file (either pyproject.toml, tox.ini, setup.cfg) not found otherwise on my machine. tox -e ALL can be used now

fixed napoleon import Could not import extension sphinxcontrib.napoleon (exception: cannot import name 'Callable' from 'collections' (/usr/lib/python3.10/collections/__init__.py)) https://github.com/sphinx-doc/sphinx/issues/10378#issuecomment-1107455569

Python 3.10.6, 5.15.65-1-MANJARO
documentation housekeeping
opened by sezanzeb 6
FastTextKeyedVectors.add_vectors is not adding vectors
Problem description

I have been trying to create a FastTextKeyedVectors and adding vectors to it using either add_vector or add_vectors but the methods are not adding anything. After looking at the implementation of those methods, I think there is an error while checking if a key has already been added.

Steps/code/corpus to reproduce

I create a FastTextKeyedVectors using the defaults used by the FastText model, then try to add vectors to it using add_vector or add_vectors:

wv = FastTextKeyedVectors(vector_size=2, min_n=3, max_n=6, bucket=2000000) wv.add_vector("test", [0.5, 0.5]) print(wv.key_to_index) >> {} print(wv.index_to_key) >> [] print(wv.vectors) >> [] wv.add_vectors(["test"], [[0.5, 0.5]]) print(wv.key_to_index) >> {} print(wv.index_to_key) >> [] print(wv.vectors) >> []

wv.key_to_index, wv.index_to_key and wv.vectors are all empty.

FastTextKeyedVectors is a child of KeyedVectors where the add_vector/s methods are implemented. add_vector does a few checks then calls add_vectors. In add_vectors, there is an in_vocab_mask, which is a list of booleans indicating if a key is already present in the KeyedVectors.

in_vocab_mask = np.zeros(len(keys), dtype=bool) for idx, key in enumerate(keys): if key in self: in_vocab_mask[idx] = True

Since Gensim 4.0, key in wv will always return True with FastText by design. The proper way of checking if a key exists is by calling key in wv.key_to_index (See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#10-check-if-a-word-is-fully-oov-out-of-vocabulary-for-fasttext)

So replacing the above code by

in_vocab_mask = np.zeros(len(keys), dtype=bool) for idx, key in enumerate(keys): if key in self.key_to_index: in_vocab_mask[idx] = True

seems to fix the issue.

wv = FastTextKeyedVectors(vector_size=2, min_n=3, max_n=6, bucket=2000000) wv.add_vectors(["test"], [[0.5, 0.5]]) print(wv.key_to_index) >> {'test': 0} print(wv.index_to_key) >> ['test'] print(wv.vectors) >> [[0.5 0.5]]

I am not sure how FastText models are able to add vectors to FastTextKeyedVectors the proper way when training without encountering this issue as I have not looked at the training code in detail.

Versions

Linux-5.10.0-17-amd64-x86_64-with-glibc2.31 Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] Bits 64 NumPy 1.21.6 SciPy 1.7.3 gensim 4.2.0 FAST_VERSION 1
bug
opened by globba 2

Releases(4.3.0)

4.3.0(Dec 21, 2022)
What's Changed

Allow overriding the Cython version requirement by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3323

Update Python module MANIFEST by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3343

Clean up references to Morfessor, tox and gensim.models.wrappers by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3345

Disable the Gensim 3=>4 warning in docs by @piskvorky in https://github.com/RaRe-Technologies/gensim/pull/3346

pin sphinx versions, add explicit gallery_top label by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3383

Declare variables prior to for loop in fastss.pyx for ANSI C compatibility by @hstk30 in https://github.com/RaRe-Technologies/gensim/pull/3378

Fix typo in word2vec and KeyedVectors docstrings by @dymil in https://github.com/RaRe-Technologies/gensim/pull/3365

Replace np.multiply with np.square and copyedit in translation_matrix.py by @dymil in https://github.com/RaRe-Technologies/gensim/pull/3374

Copyedit and fix outdated statements in translation matrix tutorial by @dymil in https://github.com/RaRe-Technologies/gensim/pull/3375

Implement Okapi BM25 variants in Gensim by @Witiko in https://github.com/RaRe-Technologies/gensim/pull/3304

Giving missing credit in EnsembleLDA to Alex in docs by @sezanzeb in https://github.com/RaRe-Technologies/gensim/pull/3393

PERF: pyemd to POT for EMD computation in wmdistance by @TLouf in https://github.com/RaRe-Technologies/gensim/pull/3327

Fixed bug in loss computation for Word2Vec with hierarchical softmax by @TalIfargan in https://github.com/RaRe-Technologies/gensim/pull/3397

fix deprecation warning from pytest by @martino-vic in https://github.com/RaRe-Technologies/gensim/pull/3354

Switch to Cython language level 3 by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3344

Implement numpy hack in setup.py to enable install under Poetry by @jaymegordo in https://github.com/RaRe-Technologies/gensim/pull/3363

Fixed the broken link in readme.md by @aswin2108 in https://github.com/RaRe-Technologies/gensim/pull/3409

Path Coherence Model to correctly handle empty documents by @PrimozGodec in https://github.com/RaRe-Technologies/gensim/pull/3406

Add support for Python 3.11 and drop support for Python 3.7 by @acul3 in https://github.com/RaRe-Technologies/gensim/pull/3402

clarify runtime expectations by @gojomo in https://github.com/RaRe-Technologies/gensim/pull/3381

Fix bug that prevents loading old models by @funasshi in https://github.com/RaRe-Technologies/gensim/pull/3359

refactor wheel building and testing workflow by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3410

Fixed FastTextKeyedVectors handling in add_vector by @globba in https://github.com/RaRe-Technologies/gensim/pull/3389

Flsamodel by @ERijck in https://github.com/RaRe-Technologies/gensim/pull/3398

Fix backwards compatibility bug in Word2Vec by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3415

fix numpy hack in setup.py by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3416

updated changelog for next release by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3412

New Contributors

@hstk30 made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3378

@TLouf made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3327

@TalIfargan made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3397

@martino-vic made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3354

@jaymegordo made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3363

@aswin2108 made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3409

@acul3 made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3402

@funasshi made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3359

@globba made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3389

@ERijck made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3398

Full Changelog: https://github.com/RaRe-Technologies/gensim/compare/4.2.0...4.3.0
Source code(tar.gz)
Source code(zip)
4.2.0(May 1, 2022)

A number of incremental improvements, optimizations and bugfixes: CHANGELOG
Source code(tar.gz)
Source code(zip)
4.1.2(Sep 18, 2021)
4.1.2, 2021-09-17

This is a bugfix release that addresses left over compatibility issues with older versions of numpy and MacOS.

4.1.1, 2021-09-14

This is a bugfix release that addresses compatibility issues with older versions of numpy.

4.1.0, 2021-08-15

Gensim 4.1 brings two major new functionalities:

Ensemble LDA for robust training, selection and comparison of LDA models.

FastSS module for super fast Levenshtein "fuzzy search" queries. Used e.g. for "soft term similarity" calculations.

There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

We now handle both positive and negative keyword parameters consistently. They may now be either:

A string, in which case the value is reinterpreted as a list of one element (the string value)

A vector, in which case the value is reinterpreted as a list of one element (the vector)

A list of strings

A list of vectors

So you can now simply do:

model.most_similar(positive='war', negative='peace')

instead of the slightly more involved

model.most_similar(positive=['war'], negative=['peace'])

Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])

then you will need to specify the lists explicitly in gensim 4.1.

Deprecated obsolete step parameter from doc2vec

With the newer version, do this:

model.infer_vector(..., epochs=123)

instead of this:

model.infer_vector(..., steps=123)

Plus a large number of smaller improvements and fixes, as usual.

⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

:+1: New features

#3169: Implement shrink_windows argument for Word2Vec, by @M-Demay

#3163: Optimize word mover distance (WMD) computation, by @flowlight0

#3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko

#3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay

#3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko

#3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko

#3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness

#3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4

#2980: Added EnsembleLda for stable LDA topics, by @sezanzeb

#2978: Optimize performance of Author-Topic model, by @horpto

#3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

:books: Tutorials and docs

#3155: Correct parameter name in documentation of fasttext.py, by @bizzyvinci

#3148: Fix broken link to mycorpus.txt in documentation, by @rohit901

#3142: Use more permanent pdf link and update code link, by @dymil

#3141: Update link for online LDA paper, by @dymil

#3133: Update link to Hoffman paper (online VB LDA), by @jonaschn

#3129: [MRG] Add bronze sponsor: TechTarget, by @piskvorky

#3126: Fix typos in make_wiki_online.py and make_wikicorpus.py, by @nicolasassi

#3125: Improve & unify docs for dirichlet priors, by @jonaschn

#3123: Fix hyperlink for doc2vec tutorial, by @AdityaSoni19031997

#3121: [MRG] Add bronze sponsor: eaccidents.com, by @piskvorky

#3120: Fix URL for ldamodel.py, by @jonaschn

#3118: Fix URL in doc string, by @jonaschn

#3107: Draw attention to sponsoring in README, by @piskvorky

#3105: Fix documentation links: Travis to Github Actions, by @piskvorky

#3057: Clarify doc comment in LdaModel.inference(), by @yocen

#2964: Document that preprocessing.strip_punctuation is limited to ASCII, by @sciatro

:red_circle: Bug fixes

#3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko

#3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv

#3136: Fix indexing error in word2vec_inner.pyx, by @bluekura

#3131: Add missing import to NMF docs and models/init.py, by @properGrammar

#3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29

#2830: Fixed KeyError in coherence model, by @pietrotrope

:warning: Removed functionality & deprecations

#3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420

#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

#3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

🔮 Testing, CI, housekeeping

#3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec

#3143: replace _mul function with explicit casts, by @mpenkov

#2952: Allow newer versions of the Morfessor module for the tests, by @pabs3

#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

Source code(tar.gz)
Source code(zip)
4.1.1(Sep 14, 2021)
4.1.1, 2021-09-14

This is a bugfix release that addresses compatibility issues with older versions of numpy.

4.1.0, 2021-08-15

Gensim 4.1 brings two major new functionalities:

Ensemble LDA for robust training, selection and comparison of LDA models.

FastSS module for super fast Levenshtein "fuzzy search" queries. Used e.g. for "soft term similarity" calculations.

There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

We now handle both positive and negative keyword parameters consistently. They may now be either:

A string, in which case the value is reinterpreted as a list of one element (the string value)

A vector, in which case the value is reinterpreted as a list of one element (the vector)

A list of strings

A list of vectors

So you can now simply do:

model.most_similar(positive='war', negative='peace')

instead of the slightly more involved

model.most_similar(positive=['war'], negative=['peace'])

Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])

then you will need to specify the lists explicitly in gensim 4.1.

Deprecated obsolete step parameter from doc2vec

With the newer version, do this:

model.infer_vector(..., epochs=123)

instead of this:

model.infer_vector(..., steps=123)

Plus a large number of smaller improvements and fixes, as usual.

⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

:+1: New features

#3169: Implement shrink_windows argument for Word2Vec, by @M-Demay

#3163: Optimize word mover distance (WMD) computation, by @flowlight0

#3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko

#3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay

#3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko

#3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko

#3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness

#3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4

#2980: Added EnsembleLda for stable LDA topics, by @sezanzeb

#2978: Optimize performance of Author-Topic model, by @horpto

#3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

:books: Tutorials and docs

#3155: Correct parameter name in documentation of fasttext.py, by @bizzyvinci

#3148: Fix broken link to mycorpus.txt in documentation, by @rohit901

#3142: Use more permanent pdf link and update code link, by @dymil

#3141: Update link for online LDA paper, by @dymil

#3133: Update link to Hoffman paper (online VB LDA), by @jonaschn

#3129: [MRG] Add bronze sponsor: TechTarget, by @piskvorky

#3126: Fix typos in make_wiki_online.py and make_wikicorpus.py, by @nicolasassi

#3125: Improve & unify docs for dirichlet priors, by @jonaschn

#3123: Fix hyperlink for doc2vec tutorial, by @AdityaSoni19031997

#3121: [MRG] Add bronze sponsor: eaccidents.com, by @piskvorky

#3120: Fix URL for ldamodel.py, by @jonaschn

#3118: Fix URL in doc string, by @jonaschn

#3107: Draw attention to sponsoring in README, by @piskvorky

#3105: Fix documentation links: Travis to Github Actions, by @piskvorky

#3057: Clarify doc comment in LdaModel.inference(), by @yocen

#2964: Document that preprocessing.strip_punctuation is limited to ASCII, by @sciatro

:red_circle: Bug fixes

#3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko

#3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv

#3136: Fix indexing error in word2vec_inner.pyx, by @bluekura

#3131: Add missing import to NMF docs and models/init.py, by @properGrammar

#3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29

#2830: Fixed KeyError in coherence model, by @pietrotrope

:warning: Removed functionality & deprecations

#3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420

#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

#3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

🔮 Testing, CI, housekeeping

#3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec

#3143: replace _mul function with explicit casts, by @mpenkov

#2952: Allow newer versions of the Morfessor module for the tests, by @pabs3

#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

Source code(tar.gz)
Source code(zip)
4.1.0(Aug 29, 2021)
4.1.0, 2021-08-15

Gensim 4.1 brings two major new functionalities:

Ensemble LDA for robust training, selection and comparison of LDA models.

FastSS module for super fast Levenshtein "fuzzy search" queries. Used e.g. for "soft term similarity" calculations.

There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

We now handle both positive and negative keyword parameters consistently. They may now be either:

A string, in which case the value is reinterpreted as a list of one element (the string value)

A vector, in which case the value is reinterpreted as a list of one element (the vector)

A list of strings

A list of vectors

So you can now simply do:

model.most_similar(positive='war', negative='peace')

instead of the slightly more involved

model.most_similar(positive=['war'], negative=['peace'])

Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])

then you will need to specify the lists explicitly in gensim 4.1.

Deprecated obsolete step parameter from doc2vec

With the newer version, do this:

model.infer_vector(..., epochs=123)

instead of this:

model.infer_vector(..., steps=123)

Plus a large number of smaller improvements and fixes, as usual.

⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

:+1: New features

#3169: Implement shrink_windows argument for Word2Vec, by @M-Demay

#3163: Optimize word mover distance (WMD) computation, by @flowlight0

#3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko

#3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay

#3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko

#3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko

#3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness

#3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4

#2980: Added EnsembleLda for stable LDA topics, by @sezanzeb

#2978: Optimize performance of Author-Topic model, by @horpto

#3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

:books: Tutorials and docs

#3155: Correct parameter name in documentation of fasttext.py, by @bizzyvinci

#3148: Fix broken link to mycorpus.txt in documentation, by @rohit901

#3142: Use more permanent pdf link and update code link, by @dymil

#3141: Update link for online LDA paper, by @dymil

#3133: Update link to Hoffman paper (online VB LDA), by @jonaschn

#3129: [MRG] Add bronze sponsor: TechTarget, by @piskvorky

#3126: Fix typos in make_wiki_online.py and make_wikicorpus.py, by @nicolasassi

#3125: Improve & unify docs for dirichlet priors, by @jonaschn

#3123: Fix hyperlink for doc2vec tutorial, by @AdityaSoni19031997

#3121: [MRG] Add bronze sponsor: eaccidents.com, by @piskvorky

#3120: Fix URL for ldamodel.py, by @jonaschn

#3118: Fix URL in doc string, by @jonaschn

#3107: Draw attention to sponsoring in README, by @piskvorky

#3105: Fix documentation links: Travis to Github Actions, by @piskvorky

#3057: Clarify doc comment in LdaModel.inference(), by @yocen

#2964: Document that preprocessing.strip_punctuation is limited to ASCII, by @sciatro

:red_circle: Bug fixes

#3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko

#3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv

#3136: Fix indexing error in word2vec_inner.pyx, by @bluekura

#3131: Add missing import to NMF docs and models/init.py, by @properGrammar

#3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29

#2830: Fixed KeyError in coherence model, by @pietrotrope

:warning: Removed functionality & deprecations

#3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420

#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

#3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

🔮 Testing, CI, housekeeping

#3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec

#3143: replace _mul function with explicit casts, by @mpenkov

#2952: Allow newer versions of the Morfessor module for the tests, by @pabs3

#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

4.0.1, 2021-04-01

Bugfix release to address issues with Wheels on Windows:

https://github.com/RaRe-Technologies/gensim/issues/3095

https://github.com/RaRe-Technologies/gensim/issues/3097

4.0.0, 2021-03-24

⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

Gensim 4.0 is a major release with lots of performance & robustness improvements, and a new website.

Main highlights

Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

a. Efficiency

| model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput | |----------|------------|--------| | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

Dropped a bunch of externally contributed modules and wrappers: summarization, pivoted TFIDF, Mallet…

Code quality was not up to our standards. Also there was no one to maintain these modules, answer user questions, support them.

So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them, please fork & publish into your own repo. They can live happily outside of Gensim.

Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

If you still need Python 2 for some reason, stay at Gensim 3.8.3.

A new Gensim website – finally! 🙃

So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting concrete NLP & document similarity use-cases.

:+1: New features

#2947: Bump minimum Python version to 3.6, by @gojomo

#2300: Use less RAM in LdaMulticore, by @horpto

#2698: Streamline KeyedVectors & X2Vec API, by @gojomo

#2864: Speed up random number generation in word2vec, by @zygm0nt

#2976: Speed up phrase (collocation) detection, by @piskvorky

#2979: Allow skipping common English words in multi-word phrases, by @piskvorky

#2867: Expose max_final_vocab parameter in fastText constructor, by @mpenkov

#2931: Clear up job queue parameters in word2vec, by @lunastera

#2939: X2Vec SaveLoad improvements, by @piskvorky

#3060: Record lifecycle events in Gensim models, by @piskvorky

#3073: Make WMD normalization optional, by @piskvorky

#3065: Default to pickle protocol 4 when saving models, by @piskvorky

#3069: Add Github sponsor + donation nags, by @piskvorky

:books: Tutorials and docs

#3082: Make LDA tutorial read NIPS data on the fly, by @jonaschn

#2954: New theme for the Gensin website, by @dvorakvaclav

#2960: Added Gensim and Compatibility Wiki page, by @piskvorky

#2960: Reworked & simplified the Developer Wiki page, by @piskvorky

#2968: Migrate tutorials & how-tos to 4.0.0, by @piskvorky

#2899: Clean up of language and formatting of docstrings, by @piskvorky

#2899: Added documentation for NMSLIB indexer, by @piskvorky

#2832: Clear up LdaModel documentation, by @FyzHsn

#2871: Clarify that license is LGPL-2.1, by @pombredanne

#2896: Make docs clearer on alpha parameter in LDA model, by @xh2

#2897: Update Hoffman paper link for Online LDA, by @xh2

#2910: Refresh docs for run_annoy tutorial, by @piskvorky

#2935: Fix "generator" language in word2vec docs, by @polm

#3077: Fix various documentation warnings, by @mpenkov

#2991: Fix broken link in run_doc How-To, by @sezanzeb

#3003: Point WordEmbeddingSimilarityIndex documentation to gensim.similarities, by @Witiko

#2996: Make the website link to the old Gensim 3.8.3 documentation dynamic, by @Witiko

#3063: Update link to papers in LSI model, by @jonaschn

#3080: Fix some of the warnings/deprecated functions, by @FredHappyface)

:red_circle: Bug fixes

#2891: Fix fastText word-vectors with ngrams off, by @gojomo

#2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo

#2899: Fix similarity bug in NMSLIB indexer, by @piskvorky

#2899: Fix deprecation warnings in Annoy integration, by @piskvorky

#2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah

#2940: Fix deprecations in SoftCosineSimilarity, by @Witiko

#2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo

#2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk

#2973: phrases.export_phrases() doesn't yield all bigrams, by @piskvorky

#2942: Segfault when training doc2vec, by @gojomo

#3041: Fix RuntimeError in export_phrases (change defaultdict to dict), by @thalishsajeed

#3059: Fix race condition in FastText tests, by @sleepy-owl

:warning: Removed functionality & deprecations

Removed all code, methods, attributes and functions marked as deprecated in Gensim 3.8.3.

#6: No more binary wheels for x32 platforms, by @menshikh-iv

#2899: Renamed overly broad similarities.index to the more appropriate similarities.annoy, by @piskvorky

#2958: Remove gensim.summarization subpackage, docs and test data, by @mpenkov

#2926: Rename num_words to topn in dtm_coherence, by @MeganStodel

#2937: Remove Keras dependency, by @piskvorky

#3078: Remove on_batch_begin and on_batch_end callbacks, by @mpenkov

#3012: Remove pattern dependency, by @mpenkov

#3055: Remove gensim.viz subpackage, by @mpenkov

🔮 Testing, CI, housekeeping

#2939 + #2984: Code style & py3 migration clean up, by @piskvorky

#3058: Add py39 wheels to Travis/Azure, by @FredHappyface

#3035: Update repos before trying to install gdb, by @janaknat

#3026: Move x86 tests from Travis to GHA, add aarch64 wheel build to Travis, by @janaknat

#3033: Transformed camelCase to snake_case test names, by @sezanzeb

#3024: Add Github Actions x86 and mac jobs to build python wheels, by @janaknat

4.0.0.rc1, 2021-03-19

⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

Main highlights (see also 👍 Improvements below)

Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

a. Efficiency

| model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput | |----------|------------|--------| | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.

Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.

So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.

Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

If you still need Python 2 for some reason, stay at Gensim 3.8.3.

A new Gensim website – finally! 🙃

So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

:star2: New Features

Default to pickle protocol 4 when saving models (piskvorky, #3065)

Record lifecycle events in Gensim models (piskvorky, #3060)

Make WMD normalization optional (piskvorky, #3073)

:red_circle: Bug fixes

fix RuntimeError in export_phrases (change defaultdict to dict) (thalishsajeed, #3041)

:books: Tutorial and doc improvements

fix various documentation warnings (mpenkov, #3077)

Fix broken link in run_doc how-to (sezanzeb, #2991)

Point WordEmbeddingSimilarityIndex documentation to gensim.similarities (Witiko, #3003)

Make the link to the Gensim 3.8.3 documentation dynamic (Witiko, #2996)

:warning: Removed functionality

remove on_batch_begin and on_batch_end callbacks (mpenkov, #3078)

remove pattern dependency (mpenkov, #3012)

rm gensim.viz submodule (mpenkov, #3055)

🔮 Miscellaneous

[MRG] Add Github sponsor + donation nags (piskvorky, #3069)

Update URLs (jonaschn, #3063)

Fix race condition in FastText tests (sleepy-owl, #3059)

Add py39 wheels to travis/azure (FredHappyface, #3058)

Update repos before trying to install gdb (janaknat, #3035)

transformed camelCase to snake_case test names (sezanzeb, #3033)

move x86 tests from Travis to GHA, add aarch64 wheel build to Travis (janaknat, #3026)

Add Github Actions x86 and mac jobs to build python wheels (janaknat, #3024)

4.0.0beta, 2020-10-31

⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

Main highlights (see also 👍 Improvements below)

Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

a. Efficiency

| model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput | |----------|------------|--------| | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.

Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.

So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.

Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

If you still need Python 2 for some reason, stay at Gensim 3.8.3.

A new Gensim website – finally! 🙃

So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

Why pre-release?

This 4.0.0beta pre-release is for users who want the cutting edge performance and bug fixes. Plus users who want to help out, by testing and providing feedback: code, documentation, workflows… Please let us know on the mailing list!

Install the pre-release with:

pip install --pre --upgrade gensim

What will change between this pre-release and a "full" 4.0 release?

Production stability is important to Gensim, so we're improving the process of upgrading already-trained saved models. There'll be an explicit model upgrade script between each 4.n to 4.(n+1) Gensim release. Check progress here.

:+1: Improvements

#2947: Bump minimum Python version to 3.6, by @gojomo

#2939 + #2984: Code style & py3 migration clean up, by @piskvorky

#2300: Use less RAM in LdaMulticore, by @horpto

#2698: Streamline KeyedVectors & X2Vec API, by @gojomo

#2864: Speed up random number generation in word2vec, by @zygm0nt

#2976: Speed up phrase (collocation) detection, by @piskvorky

#2979: Allow skipping common English words in multi-word phrases, by @piskvorky

#2867: Expose max_final_vocab parameter in fastText constructor, by @mpenkov

#2931: Clear up job queue parameters in word2vec, by @lunastera

#2939: X2Vec SaveLoad improvements, by @piskvorky

:books: Tutorials and docs

#2954: New theme for the Gensin website, @dvorakvaclav

#2960: Added Gensim and Compatibility Wiki page, by @piskvorky

#2960: Reworked & simplified the Developer Wiki page, by @piskvorky

#2968: Migrate tutorials & how-tos to 4.0.0, by @piskvorky

#2899: Clean up of language and formatting of docstrings, by @piskvorky

#2899: Added documentation for NMSLIB indexer, by @piskvorky

#2832: Clear up LdaModel documentation by @FyzHsn

#2871: Clarify that license is LGPL-2.1, by @pombredanne

#2896: Make docs clearer on alpha parameter in LDA model, by @xh2

#2897: Update Hoffman paper link for Online LDA, by @xh2

#2910: Refresh docs for run_annoy tutorial, by @piskvorky

#2935: Fix "generator" language in word2vec docs, by @polm

:red_circle: Bug fixes

#2891: Fix fastText word-vectors with ngrams off, by @gojomo

#2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo

#2899: Fix similarity bug in NMSLIB indexer, by @piskvorky

#2899: Fix deprecation warnings in Annoy integration, by @piskvorky

#2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah

#2940; Fix deprecations in SoftCosineSimilarity, by @Witiko

#2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo

#2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk

#2973: phrases.export_phrases() doesn't yield all bigrams

#2942: Segfault when training doc2vec

:warning: Removed functionality & deprecations

#6: No more binary wheels for x32 platforms, by menshikh-iv

#2899: Renamed overly broad similarities.index to the more appropriate similarities.annoy, by @piskvorky

#2958: Remove gensim.summarization subpackage, docs and test data, by @mpenkov

#2926: Rename num_words to topn in dtm_coherence, by @MeganStodel

#2937: Remove Keras dependency, by @piskvorky

Removed all code, methods, attributes and functions marked as deprecated in Gensim 3.8.3.

Removed pattern dependency (PR #3012, @mpenkov). If you need to lemmatize, do it prior to passing the corpus to gensim.

Source code(tar.gz)
Source code(zip)

Topic Modelling for Humans

Related tags

Overview

gensim – Topic Modelling in Python

Features

Installation

How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?

Documentation

Support

Adopters

Citing gensim

Comments

Summary

Usage

Constructor

build_vocab

train

That's it! Everything else remains the same as before.

Details

Introduction

Speed comparison

Before the change

After the change

Conclusion

Problem description

Steps/code/corpus to reproduce

Versions

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

Versions

Problem description

Steps/code/corpus to reproduce

Versions

Problem description

Steps/code/corpus to reproduce

Versions

Problem description

Steps/code/corpus to reproduce

Versions

Releases(4.3.0)

4.3.0(Dec 21, 2022)

What's Changed

New Contributors

4.2.0(May 1, 2022)

4.1.2(Sep 18, 2021)

4.1.2, 2021-09-17

4.1.1, 2021-09-14

4.1.0, 2021-08-15

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

Deprecated obsolete step parameter from doc2vec

:+1: New features

:books: Tutorials and docs

:red_circle: Bug fixes

:warning: Removed functionality & deprecations

🔮 Testing, CI, housekeeping

4.1.1(Sep 14, 2021)

4.1.1, 2021-09-14

4.1.0, 2021-08-15

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

Deprecated obsolete step parameter from doc2vec

:+1: New features

:books: Tutorials and docs

:red_circle: Bug fixes

:warning: Removed functionality & deprecations

🔮 Testing, CI, housekeeping

4.1.0(Aug 29, 2021)

4.1.0, 2021-08-15

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

Deprecated obsolete step parameter from doc2vec

:+1: New features

:books: Tutorials and docs

:red_circle: Bug fixes

:warning: Removed functionality & deprecations

🔮 Testing, CI, housekeeping

4.0.1, 2021-04-01

4.0.0, 2021-03-24

Main highlights

Deprecated obsolete `step` parameter from doc2vec

Deprecated obsolete `step` parameter from doc2vec

Deprecated obsolete `step` parameter from doc2vec