BERT, LDA, and TFIDF based keyword extraction in Python

Last update: Dec 27, 2022

Overview

BERT, LDA, and TFIDF based keyword extraction in Python

kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichlet Allocation. The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see kwx.languages for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby guaranteeing sensible results that are in line with user intuitions.

For a thorough overview of the process and techniques see the Google slides, and reference the documentation for explanations of the models and visualization methods.

Installation
Models
Usage
Visuals
To-Do

Installation `⇧`

kwx can be downloaded from PyPI via pip or sourced directly from this repository:

pip install kwx

git clone https://github.com/andrewtavis/kwx.git
cd kwx
python setup.py install

import kwx

Models `⇧`

Implemented NLP modeling methods within kwx.model include:

BERT

Bidirectional Encoder Representations from Transformers derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.

kwx uses sentence-transformers pretrained models. See their GitHub and documentation for the available models.

LDA

Latent Dirichlet Allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.

Although not as computationally robust as some machine learning models, LDA provides quick results that are suitable for many applications. Specifically for keyword extraction, in most settings the results are similar to those of BERT in a fraction of the time.

Other Methods

The user can also choose to simply query the most common words from a text corpus or compute TFIDF (Term Frequency Inverse Document Frequency) keywords - those that are unique in a text body in comparison to another that's compared. The former method is used in kwx as a baseline to check model efficacy, and the latter is a useful baseline when a user has another text or text body to compare the target corpus against.

Usage `⇧`

Keyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. examples/kw_extraction provides an example of how to use kwx by deriving keywords from tweets in the Kaggle Twitter US Airline Sentiment dataset.

The following outlines using kwx to derive keywords from a text corpus with prompt_remove_words as True (the user will be asked if some of the extracted words need to be replaced):

Text Cleaning

from kwx.utils import prepare_data

input_language = "english" # see kwx.languages for options

# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
    data="df_or_csv_xlsx_path",
    target_cols="cols_where_texts_are",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

Keyword Extraction

from kwx.model import extract_kws

num_keywords = 15
num_topics = 10
ignore_words = ["words", "user", "knows", "they", "don't", "want"]

# Remove n-grams for BERT training
corpus_no_ngrams = [
    " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]

# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
bert_kws = extract_kws(
    method="BERT", # "BERT", "LDA", "TFIDF", "frequency"
    bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
    text_corpus=corpus_no_ngrams,  # automatically tokenized if using LDA
    input_language=input_language,
    output_language=None,  # allows the output to be translated
    num_keywords=num_keywords,
    num_topics=num_topics,
    corpuses_to_compare=None,  # for TFIDF
    ignore_words=ignore_words,
    prompt_remove_words=True,  # check words with user
    show_progress_bar=True,
    batch_size=32,
)

The BERT keywords are:

['time', 'flight', 'plane', 'southwestair', 'ticket', 'cancel', 'united', 'baggage',
'love', 'virginamerica', 'service', 'customer', 'delay', 'late', 'hour']

Should words be removed [y/n]? y
Type or copy word(s) to be removed: southwestair, united, virginamerica

The new BERT keywords are:

['late', 'baggage', 'service', 'flight', 'time', 'love', 'book', 'customer',
'response', 'hold', 'hour', 'cancel', 'cancelled_flighted', 'delay', 'plane']

Should words be removed [y/n]? n

The model will be rerun until all words known to be unreasonable are removed for a suitable output. kwx.model.gen_files could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).

Visuals `⇧`

kwx.visuals includes the following functions for presenting and analyzing the results of keyword extraction:

Topic Number Evaluation

A graph of topic coherence and overlap given a variable number of topics to derive keywords from.

from kwx.visuals import graph_topic_num_evals
import matplotlib.pyplot as plt

graph_topic_num_evals(
    method=["lda", "bert"],
    text_corpus=text_corpus,
    num_keywords=num_keywords,
    topic_nums_to_compare=list(range(5, 15)),
    metrics=True, #  stability and coherence
)
plt.show()

t-SNE

t-SNE allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.

from kwx.visuals import t_sne
import matplotlib.pyplot as plt

t_sne(
    dimension="both",  # 2d and 3d are options
    text_corpus=text_corpus,
    num_topics=10,
    remove_3d_outliers=True,
)
plt.show()

pyLDAvis

pyLDAvis is included so that users can inspect LDA extracted topics, and further so that it can easily be generated for output files.

from kwx.visuals import pyLDAvis_topics

pyLDAvis_topics(
    method="lda",
    text_corpus=text_corpus,
    num_topics=10,
    display_ipython=False,  # For Jupyter integration
)

Word Cloud

Word clouds via wordcloud are included for a basic representation of the text corpus - specifically being a way to convey basic visual information to potential stakeholders. The following figure from examples/kw_extraction shows a word cloud generated from tweets of US air carrier passengers:

from kwx.visuals import gen_word_cloud

ignore_words = ["words", "user", "knows", "they", "don't", "want"]

gen_word_cloud(
    text_corpus=text_corpus,
    ignore_words=None,
    height=500,
)

To-Do `⇧`

Please see the contribution guidelines if you are interested in contributing to this project. Work that is in progress or could be implemented includes:

Including more methods to extract keywords (see issue)
Adding key phrase extraction as an option for kwx.model.extract_kws (see issues)
Adding t-SNE and pyLDAvis style visualizations for BERT models
Converting the translation feature over to use another translation api rather than py-googletrans (see issue)
Updates to kwx.languages as lemmatization and other linguistic package dependencies evolve
Creating, improving and sharing examples
Improving tests for greater code coverage
Updating and refining the documentation

Comments

Text by text keyword extraction in dataset

First of all thank you for the model. I want to do something like this; For example, there are 20 text data in my dataset. I want to extract the keyword of each text. How can I do that?
bug question

opened by AhmetCakar 6
Error "__init__() got an unexpected keyword argument 'common_terms'" occured when running example kw_extraction.ipynb

Hi, I am trying to run notebook "kw_extraction.ipynb" given as example in Google Colab. When I am at the step of preparing data, I got error "init() got an unexpected keyword argument 'common_terms'".

May I know how to solve this? It seems like it is using a parameter that does not exist in gensim_models.phrases anymore, so shall I change the version of gensim to a lower level...?
bug

opened by Y-H-Lai 5
ModuleNotFoundError: No module named 'pyLDAvis.gensim'

Hi Andrew, I found this ModuleNotFoundError while running the line

from kwx.model import extract_kws

Error description: 25 import pandas as pd 26 import pyLDAvis ---> 27 import pyLDAvis.gensim 28 import seaborn as sns 29 from gensim import corpora

ModuleNotFoundError: No module named 'pyLDAvis.gensim'

But, it can be solved by installing : pip install pyLDAvis==3.2.2
bug

opened by AbhiPawar5 5
$[WinError 3] The system cannot find the path specified: 'C:\\mysystem/.cache\\torch\\sentence_transformers\\sbert.net_models_xlm-r-bert-base-nli-stsb-mean-tokens'$

[WinError 3] The system cannot find the path specified: 'C:\\mysystem/.cache\\torch\\sentence_transformers\\sbert.net_models_xlm-r-bert-base-nli-stsb-mean-tokens'

I get this error in different percentages while trying to make keyword extraction with BERT. For example, 96 percent gave this error first, then 100 percent gave this error. The last 26 percent gave this error. Can you help me?

opened by AhmetCakar 4
Bump certifi from 2021.10.8 to 2022.12.7
Bumps certifi from 2021.10.8 to 2022.12.7.

Commits

9e9e840 2022.12.07

b81bdb2 2022.09.24

939a28f 2022.09.14

aca828a 2022.06.15.2

de0eae1 Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...

b8eb5e9 2022.06.15.1

47fb7ab Fix deprecation warning on Python 3.11 (#199)

b0b48e0 fixes #198 -- update link in license

9d514b4 2022.06.15

4151e88 Add py.typed to MANIFEST.in to package in sdist (#196)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 3
Keyword extraction for BERT does not work for less samples

Hi Andrew, I tried the keyword extraction API for just 5 samples in a dataframe.

bert_kws = extract_kws( method="BERT", # "BERT", "LDA", "TFIDF", "frequency" bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user show_progress_bar=True, batch_size=3, )

Which returns, ValueError: n_samples=5 should be >= n_clusters=10 for batch_size. I wonder why that's happening? Thanks!
bug question

opened by AbhiPawar5 3
Bump ipython from 7.10.0 to 7.16.3
⚠️ Dependabot is rebasing this PR ⚠️

Rebasing might not happen immediately, so don't worry if this takes some time.

Note: if you make any changes to this PR yourself, they will take precedence over the rebase.

Bumps ipython from 7.10.0 to 7.16.3.

Commits

d43c7c7 release 7.16.3

5fa1e40 Merge pull request from GHSA-pq7m-3gw7-gq5x

8df8971 back to dev

9f477b7 release 7.16.2

138f266 bring back release helper from master branch

5aa3634 Merge pull request #13341 from meeseeksmachine/auto-backport-of-pr-13335-on-7...

bcae8e0 Backport PR #13335: What's new 7.16.2

8fcdcd3 Pin Jedi to <0.17.2.

2486838 release 7.16.1

20bdc6f fix conda build

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 2
Bump tensorflow from 2.4.1 to 2.5.0
Bumps tensorflow from 2.4.1 to 2.5.0.

Release notes

Sourced from tensorflow's releases.

TensorFlow 2.5.0

Release 2.5.0

Major Features and Improvements

Support for Python3.9 has been added.

tf.data:

tf.data service now supports strict round-robin reads, which is useful for synchronous training workloads where example sizes vary. With strict round robin reads, users can guarantee that consumers get similar-sized examples in the same step.

tf.data service now supports optional compression. Previously data would always be compressed, but now you can disable compression by passing compression=None to tf.data.experimental.service.distribute(...).

tf.data.Dataset.batch() now supports num_parallel_calls and deterministic arguments. num_parallel_calls is used to indicate that multiple input batches should be computed in parallel. With num_parallel_calls set, deterministic is used to indicate that outputs can be obtained in the non-deterministic order.

Options returned by tf.data.Dataset.options() are no longer mutable.

tf.data input pipelines can now be executed in debug mode, which disables any asynchrony, parallelism, or non-determinism and forces Python execution (as opposed to trace-compiled graph execution) of user-defined functions passed into transformations such as map. The debug mode can be enabled through tf.data.experimental.enable_debug_mode().

tf.lite

Enabled the new MLIR-based quantization backend by default

The new backend is used for 8 bits full integer post-training quantization

The new backend removes the redundant rescales and fixes some bugs (shared weight/bias, extremely small scales, etc)

Set experimental_new_quantizer in tf.lite.TFLiteConverter to False to disable this change

tf.keras

tf.keras.metrics.AUC now support logit predictions.

Enabled a new supported input type in Model.fit, tf.keras.utils.experimental.DatasetCreator, which takes a callable, dataset_fn. DatasetCreator is intended to work across all tf.distribute strategies, and is the only input type supported for Parameter Server strategy.

tf.distribute

tf.distribute.experimental.ParameterServerStrategy now supports training with Keras Model.fit when used with DatasetCreator.

Creating tf.random.Generator under tf.distribute.Strategy scopes is now allowed (except for tf.distribute.experimental.CentralStorageStrategy and tf.distribute.experimental.ParameterServerStrategy). Different replicas will get different random-number streams.

TPU embedding support

Added profile_data_directory to EmbeddingConfigSpec in _tpu_estimator_embedding.py. This allows embedding lookup statistics gathered at runtime to be used in embedding layer partitioning decisions.

PluggableDevice

Third-party devices can now connect to TensorFlow as plug-ins through StreamExecutor C API. and PluggableDevice interface.

Add custom ops and kernels through kernel and op registration C API.

Register custom graph optimization passes with graph optimization C API.

oneAPI Deep Neural Network Library (oneDNN) CPU performance optimizations from Intel-optimized TensorFlow are now available in the official x86-64 Linux and Windows builds.

They are off by default. Enable them by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1.

We do not recommend using them in GPU systems, as they have not been sufficiently tested with GPUs yet.

TensorFlow pip packages are now built with CUDA11.2 and cuDNN 8.1.0

Breaking Changes

The TF_CPP_MIN_VLOG_LEVEL environment variable has been renamed to to TF_CPP_MAX_VLOG_LEVEL which correctly describes its effect.

Bug Fixes and Other Changes

tf.keras:

Preprocessing layers API consistency changes:

StringLookup added output_mode, sparse, and pad_to_max_tokens arguments with same semantics as TextVectorization.

IntegerLookup added output_mode, sparse, and pad_to_max_tokens arguments with same semantics as TextVectorization. Renamed max_values, oov_value and mask_value to max_tokens, oov_token and mask_token to align with StringLookup and TextVectorization.

TextVectorization default for pad_to_max_tokens switched to False.

CategoryEncoding no longer supports adapt, IntegerLookup now supports equivalent functionality. max_tokens argument renamed to num_tokens.

Discretization added num_bins argument for learning bins boundaries through calling adapt on a dataset. Renamed bins argument to bin_boundaries for specifying bins without adapt.

Improvements to model saving/loading:

model.load_weights now accepts paths to saved models.

... (truncated)

Changelog

Sourced from tensorflow's changelog.

Release 2.5.0

Breaking Changes

The TF_CPP_MIN_VLOG_LEVEL environment variable has been renamed to to TF_CPP_MAX_VLOG_LEVEL which correctly describes its effect.

Known Caveats

Major Features and Improvements

TPU embedding support

Added profile_data_directory to EmbeddingConfigSpec in _tpu_estimator_embedding.py. This allows embedding lookup statistics gathered at runtime to be used in embedding layer partitioning decisions.

tf.keras.metrics.AUC now support logit predictions.

Creating tf.random.Generator under tf.distribute.Strategy scopes is now allowed (except for tf.distribute.experimental.CentralStorageStrategy and tf.distribute.experimental.ParameterServerStrategy). Different replicas will get different random-number streams.

tf.data:

tf.data service now supports strict round-robin reads, which is useful for synchronous training workloads where example sizes vary. With strict round robin reads, users can guarantee that consumers get similar-sized examples in the same step.

tf.data service now supports optional compression. Previously data would always be compressed, but now you can disable compression by passing compression=None to tf.data.experimental.service.distribute(...).

tf.data.Dataset.batch() now supports num_parallel_calls and deterministic arguments. num_parallel_calls is used to indicate that multiple input batches should be computed in parallel. With num_parallel_calls set, deterministic is used to indicate that outputs can be obtained in the non-deterministic order.

Options returned by tf.data.Dataset.options() are no longer mutable.

tf.data input pipelines can now be executed in debug mode, which disables any asynchrony, parallelism, or non-determinism and forces Python execution (as opposed to trace-compiled graph execution) of user-defined functions passed into transformations such as map. The debug mode can be enabled through tf.data.experimental.enable_debug_mode().

tf.lite

Enabled the new MLIR-based quantization backend by default

The new backend is used for 8 bits full integer post-training quantization

The new backend removes the redundant rescales and fixes some bugs (shared weight/bias, extremely small scales, etc)

... (truncated)

Commits

a4dfb8d Merge pull request #49124 from tensorflow/mm-cherrypick-tf-data-segfault-fix-...

2107b1d Merge pull request #49116 from tensorflow-jenkins/version-numbers-2.5.0-17609

16b8139 Update snapshot_dataset_op.cc

86a0d86 Merge pull request #49126 from geetachavan1/cherrypicks_X9ZNY

9436ae6 Merge pull request #49128 from geetachavan1/cherrypicks_D73J5

6b2bf99 Validate that a and b are proper sparse tensors

c03ad1a Ensure validation sticks in banded_triangular_solve_op

12a6ead Merge pull request #49120 from geetachavan1/cherrypicks_KJ5M9

b67f5b8 Merge pull request #49118 from geetachavan1/cherrypicks_BIDTR

a13c0ad [tf.data][cherrypick] Fix snapshot segfault when using repeat and prefecth

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 2
Bump tensorflow from 2.5.0 to 2.5.1
Bumps tensorflow from 2.5.0 to 2.5.1.

Release notes

Sourced from tensorflow's releases.

TensorFlow 2.5.1

Release 2.5.1

This release introduces several vulnerability fixes:

Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)

Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)

Fixes a null pointer dereference in CompressElement (CVE-2021-37637)

Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)

Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)

Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)

Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)

Fixes a heap OOB in RaggedGather (CVE-2021-37641)

Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)

Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)

Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)

Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)

Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)

Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)

Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)

Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)

Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)

Fixes a use after free in boosted trees creation (CVE-2021-37652)

Fixes a division by 0 in ResourceGather (CVE-2021-37653)

Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)

Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)

Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse (CVE-2021-37656)

Fixes an undefined behavior arising from reference binding to nullptr in MatrixDiagV* ops (CVE-2021-37657)

Fixes an undefined behavior arising from reference binding to nullptr in MatrixSetDiagV* ops (CVE-2021-37658)

Fixes an undefined behavior arising from reference binding to nullptr and heap OOB in binary cwise ops (CVE-2021-37659)

Fixes a division by 0 in inplace operations (CVE-2021-37660)

Fixes a crash caused by integer conversion to unsigned (CVE-2021-37661)

Fixes an undefined behavior arising from reference binding to nullptr in boosted trees (CVE-2021-37662)

Fixes a heap OOB in boosted trees (CVE-2021-37664)

Fixes vulnerabilities arising from incomplete validation in QuantizeV2 (CVE-2021-37663)

Fixes vulnerabilities arising from incomplete validation in MKL requantization (CVE-2021-37665)

Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToVariant (CVE-2021-37666)

Fixes an undefined behavior arising from reference binding to nullptr in unicode encoding (CVE-2021-37667)

Fixes an FPE in tf.raw_ops.UnravelIndex (CVE-2021-37668)

Fixes a crash in NMS ops caused by integer conversion to unsigned (CVE-2021-37669)

Fixes a heap OOB in UpperBound and LowerBound (CVE-2021-37670)

Fixes an undefined behavior arising from reference binding to nullptr in map operations (CVE-2021-37671)

Fixes a heap OOB in SdcaOptimizerV2 (CVE-2021-37672)

Fixes a CHECK-fail in MapStage (CVE-2021-37673)

Fixes a vulnerability arising from incomplete validation in MaxPoolGrad (CVE-2021-37674)

Fixes an undefined behavior arising from reference binding to nullptr in shape inference (CVE-2021-37676)

Fixes a division by 0 in most convolution operators (CVE-2021-37675)

Fixes vulnerabilities arising from missing validation in shape inference for Dequantize (CVE-2021-37677)

Fixes an arbitrary code execution due to YAML deserialization (CVE-2021-37678)

Fixes a heap OOB in nested tf.map_fn with RaggedTensors (CVE-2021-37679)

... (truncated)

Changelog

Sourced from tensorflow's changelog.

Release 2.5.1

This release introduces several vulnerability fixes:

Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)

Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)

Fixes a null pointer dereference in CompressElement (CVE-2021-37637)

Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)

Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)

Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)

Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)

Fixes a heap OOB in RaggedGather (CVE-2021-37641)

Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)

Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)

Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)

Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)

Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)

Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)

Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)

Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)

Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)

Fixes a use after free in boosted trees creation (CVE-2021-37652)

Fixes a division by 0 in ResourceGather (CVE-2021-37653)

Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)

Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)

Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse

... (truncated)

Commits

8222c1c Merge pull request #51381 from tensorflow/mm-fix-r2.5-build

d584260 Disable broken/flaky test

f6c6ce3 Merge pull request #51367 from tensorflow-jenkins/version-numbers-2.5.1-17468

3ca7812 Update version numbers to 2.5.1

4fdf683 Merge pull request #51361 from tensorflow/mm-update-relnotes-on-r2.5

05fc01a Put CVE numbers for fixes in parentheses

bee1dc4 Update release notes for the new patch release

47beb4c Merge pull request #50597 from kruglov-dmitry/v2.5.0-sync-abseil-cmake-bazel

6f39597 Merge pull request #49383 from ashahab/abin-load-segfault-r2.5

0539b34 Merge pull request #48979 from liufengdb/r2.5-cherrypick

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Update gensim LDA to 4.X

This issue is for discussing and eventually implementing an update for gensim implementations of LDA in kwx. The package was originally written with 3.X versions of gensim, and 4.X versions apparently have some dramatic improvements as far as modeling options/efficency and n-gram creation (for kwx.utils.clean). Changes would need to be made in kwx.utils, kwx.model, and kwx.topic_model.

Documenting what would need to happen for the switch and then work towards implementing it would be very much appreciated :)

Thanks for your interest in contributing!
enhancement good first issue question

opened by andrewtavis 1
[ImgBot] Optimize images

Beep boop. Your images are optimized!

Your image file size has been reduced by 16% 🎉

Details

| File | Before | After | Percent reduction | |:--|:--|:--|:--| | /resources/gh_images/topic_num_eval.png | 419.53kb | 353.34kb | 15.78% |

Black Lives Matter | 💰 donate | 🎓 learn | ✍🏾 sign

📝 docs | :octocat: repo | 🙋🏾 issues | 🏅 swag | 🏪 marketplace

opened by imgbot[bot] 1
Edit spaCy loading based on version

spaCy has new loading mechanisms in the later versions that produce errors in data preparation within kwx.utils. The scripts should be changed to check the spaCy version so that these changes are accounted for and errors are produced.
bug good first issue

opened by andrewtavis 0
Adding t-SNE and pyLDA style visualizations for BERT

A major difference between BERT and LDA kwx implementations is that there are no visualization methods for BERT. It would be good to add a pyLDAvis style visualization of topic words as well as a t-SNE visualization of topic similarities. These would be added to kwx.visuals.
enhancement help wanted

opened by andrewtavis 0
Convert translation feature

The current translation feature found in kwx.utils.translate_output() is based on py-googletrans, which is steadily being less and less maintained. A better option would be if the translation feature could be converted over to another Python API such as OpenNMT-py, argos-translate, textblob, 🤗 transformers or another machine translation package.
bug enhancement good first issue

opened by andrewtavis 1
Remove ngrams and topic number

Hi Andrew, again me :) I want to ask two questions about the algorithm. When using the first BERT model, why are we remove ngrams and can't we use them without remove ngrams? My second question is that when using BERT we give the number of keywords and the number of topics. How does the number of threads work, so what is the logic?
question

opened by AhmetCakar 9
TFIDF requires a corpus to compare

Hi Andrew, I was trying the Keyword Extraction API with TF-IDF, the code is: bert_kws = extract_kws( method="TFIDF", # "BERT", "LDA", "TFIDF", "frequency" bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user show_progress_bar=True, batch_size=5, )

Which returns the error, AssertionError: TFIDF requires another text corpus to be passed to the corpuses_to_compare argument.

I wonder why we require corpus to compare for keyword extraction? Thanks!
question

opened by AbhiPawar5 2
Adding TFIDF key-phrase extraction

This issue is for discussing and eventually implementing key-phrase extraction for TFIDF in kwx. It would be best to first collect code snippets and documentation links for how to best implement this with scikit-learn based TFIDF models, and then from there work on an implementation can begin :)

Thanks for your interest in contributing!
enhancement help wanted

opened by andrewtavis 0

Releases(v1.0.0)

v1.0.0(Dec 28, 2021)
Release switches kwx over to semantic versioning and indicates that it is stable

Source code(tar.gz)
Source code(zip)
v0.1.8(Apr 29, 2021)
Changes include:

Support has been added for gensim 3.8.x and 4.x

Dependencies in requirement and environment files are now condensed

An alert for users when the corpus size is to small for the number of topics was added

An import error for pyLDAvis was fixed

Source code(tar.gz)
Source code(zip)
v0.1.7.3(Mar 30, 2021)
Changes include:

Switching over to an src structure

Removing the lda_bert method because its dependencies were causing breaks

Code quality is now checked with Codacy

Extensive code formatting to improve quality and style

Bug fixes and a more explicit use of exceptions

More extensive contributing guidelines

Tests now use random seeds and are thus more robust

Source code(tar.gz)
Source code(zip)
v0.1.5(Mar 15, 2021)
Changes include:

Keyword extraction and selection are now disjointed so that modeling doesn't occur again to get new keywords

Keyword extraction and cleaning are now fully disjointed processes

kwargs for sentence-transformers BERT, LDA, and TFIDF can now be passed

The cleaning process is verbose and uses multiprocessing

The user has greater control over the cleaning process

Reformatting of the code to make the process more clear

Source code(tar.gz)
Source code(zip)
v0.1.0(Feb 17, 2021)
First stable release of kwx

Changes include:

Full documentation of the package

Virtual environment files

Bug fixes

Extensive testing of all modules with GH Actions and Codecov

Code of conduct and contribution guidelines

Source code(tar.gz)
Source code(zip)
v0.0.2.2(Jan 31, 2021)
The minimum viable product of kwx:

Users are able to extract keywords using the following methods

Most frequent words

TFIDF words unique to one corpus when compared to others

Latent Dirichlet Allocation

Bidirectional Encoder Representations from Transformers

An autoencoder application of LDA and BERT combined

Users are able to tell the model to remove certain words to fine tune results

Support is offered for a universal cleaning process in all major languages

Visualization techniques to display keywords and topics are included

Outputs can be cleanly organized in a directory or zip file

Runtimes for topic number comparisons are estimated using tqdm

Source code(tar.gz)
Source code(zip)