BERT, LDA, and TFIDF based keyword extraction in Python

Overview

rtd ci codecov pyversions pypi pypistatus license coc codestyle colab

BERT, LDA, and TFIDF based keyword extraction in Python

kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichlet Allocation. The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see kwx.languages for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby guaranteeing sensible results that are in line with user intuitions.

For a thorough overview of the process and techniques see the Google slides, and reference the documentation for explanations of the models and visualization methods.

Contents

Installation

kwx can be downloaded from PyPI via pip or sourced directly from this repository:

pip install kwx
git clone https://github.com/andrewtavis/kwx.git
cd kwx
python setup.py install
import kwx

Models

Implemented NLP modeling methods within kwx.model include:

BERT

Bidirectional Encoder Representations from Transformers derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.

kwx uses sentence-transformers pretrained models. See their GitHub and documentation for the available models.

LDA

Latent Dirichlet Allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.

Although not as computationally robust as some machine learning models, LDA provides quick results that are suitable for many applications. Specifically for keyword extraction, in most settings the results are similar to those of BERT in a fraction of the time.

Other Methods

The user can also choose to simply query the most common words from a text corpus or compute TFIDF (Term Frequency Inverse Document Frequency) keywords - those that are unique in a text body in comparison to another that's compared. The former method is used in kwx as a baseline to check model efficacy, and the latter is a useful baseline when a user has another text or text body to compare the target corpus against.

Usage

Keyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. examples/kw_extraction provides an example of how to use kwx by deriving keywords from tweets in the Kaggle Twitter US Airline Sentiment dataset.

The following outlines using kwx to derive keywords from a text corpus with prompt_remove_words as True (the user will be asked if some of the extracted words need to be replaced):

Text Cleaning

from kwx.utils import prepare_data

input_language = "english" # see kwx.languages for options

# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
    data="df_or_csv_xlsx_path",
    target_cols="cols_where_texts_are",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

Keyword Extraction

from kwx.model import extract_kws

num_keywords = 15
num_topics = 10
ignore_words = ["words", "user", "knows", "they", "don't", "want"]

# Remove n-grams for BERT training
corpus_no_ngrams = [
    " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]

# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
bert_kws = extract_kws(
    method="BERT", # "BERT", "LDA", "TFIDF", "frequency"
    bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
    text_corpus=corpus_no_ngrams,  # automatically tokenized if using LDA
    input_language=input_language,
    output_language=None,  # allows the output to be translated
    num_keywords=num_keywords,
    num_topics=num_topics,
    corpuses_to_compare=None,  # for TFIDF
    ignore_words=ignore_words,
    prompt_remove_words=True,  # check words with user
    show_progress_bar=True,
    batch_size=32,
)
The BERT keywords are:

['time', 'flight', 'plane', 'southwestair', 'ticket', 'cancel', 'united', 'baggage',
'love', 'virginamerica', 'service', 'customer', 'delay', 'late', 'hour']

Should words be removed [y/n]? y
Type or copy word(s) to be removed: southwestair, united, virginamerica

The new BERT keywords are:

['late', 'baggage', 'service', 'flight', 'time', 'love', 'book', 'customer',
'response', 'hold', 'hour', 'cancel', 'cancelled_flighted', 'delay', 'plane']

Should words be removed [y/n]? n

The model will be rerun until all words known to be unreasonable are removed for a suitable output. kwx.model.gen_files could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).

Visuals

kwx.visuals includes the following functions for presenting and analyzing the results of keyword extraction:

Topic Number Evaluation

A graph of topic coherence and overlap given a variable number of topics to derive keywords from.

from kwx.visuals import graph_topic_num_evals
import matplotlib.pyplot as plt

graph_topic_num_evals(
    method=["lda", "bert"],
    text_corpus=text_corpus,
    num_keywords=num_keywords,
    topic_nums_to_compare=list(range(5, 15)),
    metrics=True, #  stability and coherence
)
plt.show()

t-SNE

t-SNE allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.

from kwx.visuals import t_sne
import matplotlib.pyplot as plt

t_sne(
    dimension="both",  # 2d and 3d are options
    text_corpus=text_corpus,
    num_topics=10,
    remove_3d_outliers=True,
)
plt.show()

pyLDAvis

pyLDAvis is included so that users can inspect LDA extracted topics, and further so that it can easily be generated for output files.

from kwx.visuals import pyLDAvis_topics

pyLDAvis_topics(
    method="lda",
    text_corpus=text_corpus,
    num_topics=10,
    display_ipython=False,  # For Jupyter integration
)

Word Cloud

Word clouds via wordcloud are included for a basic representation of the text corpus - specifically being a way to convey basic visual information to potential stakeholders. The following figure from examples/kw_extraction shows a word cloud generated from tweets of US air carrier passengers:

from kwx.visuals import gen_word_cloud

ignore_words = ["words", "user", "knows", "they", "don't", "want"]

gen_word_cloud(
    text_corpus=text_corpus,
    ignore_words=None,
    height=500,
)

To-Do

Please see the contribution guidelines if you are interested in contributing to this project. Work that is in progress or could be implemented includes:

Comments
  • Text by text keyword extraction in dataset

    Text by text keyword extraction in dataset

    First of all thank you for the model. I want to do something like this; For example, there are 20 text data in my dataset. I want to extract the keyword of each text. How can I do that?

    bug question 
    opened by AhmetCakar 6
  • Error

    Error "__init__() got an unexpected keyword argument 'common_terms'" occured when running example kw_extraction.ipynb

    Hi, I am trying to run notebook "kw_extraction.ipynb" given as example in Google Colab. When I am at the step of preparing data, I got error "init() got an unexpected keyword argument 'common_terms'".

    image

    May I know how to solve this? It seems like it is using a parameter that does not exist in gensim_models.phrases anymore, so shall I change the version of gensim to a lower level...?

    bug 
    opened by Y-H-Lai 5
  • ModuleNotFoundError: No module named 'pyLDAvis.gensim'

    ModuleNotFoundError: No module named 'pyLDAvis.gensim'

    Hi Andrew, I found this ModuleNotFoundError while running the line

    from kwx.model import extract_kws

    Error description: 25 import pandas as pd 26 import pyLDAvis ---> 27 import pyLDAvis.gensim 28 import seaborn as sns 29 from gensim import corpora

    ModuleNotFoundError: No module named 'pyLDAvis.gensim'

    But, it can be solved by installing : pip install pyLDAvis==3.2.2

    bug 
    opened by AbhiPawar5 5
  • [WinError 3] The system cannot find the path specified: 'C:\\mysystem/.cache\\torch\\sentence_transformers\\sbert.net_models_xlm-r-bert-base-nli-stsb-mean-tokens'

    [WinError 3] The system cannot find the path specified: 'C:\\mysystem/.cache\\torch\\sentence_transformers\\sbert.net_models_xlm-r-bert-base-nli-stsb-mean-tokens'

    I get this error in different percentages while trying to make keyword extraction with BERT. For example, 96 percent gave this error first, then 100 percent gave this error. The last 26 percent gave this error. Can you help me? Screenshot_1

    opened by AhmetCakar 4
  • Bump certifi from 2021.10.8 to 2022.12.7

    Bump certifi from 2021.10.8 to 2022.12.7

    Bumps certifi from 2021.10.8 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 3
  • Keyword extraction for BERT does not work for less samples

    Keyword extraction for BERT does not work for less samples

    Hi Andrew, I tried the keyword extraction API for just 5 samples in a dataframe.

    bert_kws = extract_kws( method="BERT", # "BERT", "LDA", "TFIDF", "frequency" bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user show_progress_bar=True, batch_size=3, )

    Which returns, ValueError: n_samples=5 should be >= n_clusters=10 for batch_size. I wonder why that's happening? Thanks!

    bug question 
    opened by AbhiPawar5 3
  • Bump ipython from 7.10.0 to 7.16.3

    Bump ipython from 7.10.0 to 7.16.3

    ⚠️ Dependabot is rebasing this PR ⚠️

    Rebasing might not happen immediately, so don't worry if this takes some time.

    Note: if you make any changes to this PR yourself, they will take precedence over the rebase.


    Bumps ipython from 7.10.0 to 7.16.3.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 2
  • Bump tensorflow from 2.4.1 to 2.5.0

    Bump tensorflow from 2.4.1 to 2.5.0

    Bumps tensorflow from 2.4.1 to 2.5.0.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.5.0

    Release 2.5.0

    Major Features and Improvements

    • Support for Python3.9 has been added.
    • tf.data:
      • tf.data service now supports strict round-robin reads, which is useful for synchronous training workloads where example sizes vary. With strict round robin reads, users can guarantee that consumers get similar-sized examples in the same step.
      • tf.data service now supports optional compression. Previously data would always be compressed, but now you can disable compression by passing compression=None to tf.data.experimental.service.distribute(...).
      • tf.data.Dataset.batch() now supports num_parallel_calls and deterministic arguments. num_parallel_calls is used to indicate that multiple input batches should be computed in parallel. With num_parallel_calls set, deterministic is used to indicate that outputs can be obtained in the non-deterministic order.
      • Options returned by tf.data.Dataset.options() are no longer mutable.
      • tf.data input pipelines can now be executed in debug mode, which disables any asynchrony, parallelism, or non-determinism and forces Python execution (as opposed to trace-compiled graph execution) of user-defined functions passed into transformations such as map. The debug mode can be enabled through tf.data.experimental.enable_debug_mode().
    • tf.lite
      • Enabled the new MLIR-based quantization backend by default
        • The new backend is used for 8 bits full integer post-training quantization
        • The new backend removes the redundant rescales and fixes some bugs (shared weight/bias, extremely small scales, etc)
        • Set experimental_new_quantizer in tf.lite.TFLiteConverter to False to disable this change
    • tf.keras
      • tf.keras.metrics.AUC now support logit predictions.
      • Enabled a new supported input type in Model.fit, tf.keras.utils.experimental.DatasetCreator, which takes a callable, dataset_fn. DatasetCreator is intended to work across all tf.distribute strategies, and is the only input type supported for Parameter Server strategy.
    • tf.distribute
      • tf.distribute.experimental.ParameterServerStrategy now supports training with Keras Model.fit when used with DatasetCreator.
      • Creating tf.random.Generator under tf.distribute.Strategy scopes is now allowed (except for tf.distribute.experimental.CentralStorageStrategy and tf.distribute.experimental.ParameterServerStrategy). Different replicas will get different random-number streams.
    • TPU embedding support
      • Added profile_data_directory to EmbeddingConfigSpec in _tpu_estimator_embedding.py. This allows embedding lookup statistics gathered at runtime to be used in embedding layer partitioning decisions.
    • PluggableDevice
    • oneAPI Deep Neural Network Library (oneDNN) CPU performance optimizations from Intel-optimized TensorFlow are now available in the official x86-64 Linux and Windows builds.
      • They are off by default. Enable them by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1.
      • We do not recommend using them in GPU systems, as they have not been sufficiently tested with GPUs yet.
    • TensorFlow pip packages are now built with CUDA11.2 and cuDNN 8.1.0

    Breaking Changes

    • The TF_CPP_MIN_VLOG_LEVEL environment variable has been renamed to to TF_CPP_MAX_VLOG_LEVEL which correctly describes its effect.

    Bug Fixes and Other Changes

    • tf.keras:
      • Preprocessing layers API consistency changes:
        • StringLookup added output_mode, sparse, and pad_to_max_tokens arguments with same semantics as TextVectorization.
        • IntegerLookup added output_mode, sparse, and pad_to_max_tokens arguments with same semantics as TextVectorization. Renamed max_values, oov_value and mask_value to max_tokens, oov_token and mask_token to align with StringLookup and TextVectorization.
        • TextVectorization default for pad_to_max_tokens switched to False.
        • CategoryEncoding no longer supports adapt, IntegerLookup now supports equivalent functionality. max_tokens argument renamed to num_tokens.
        • Discretization added num_bins argument for learning bins boundaries through calling adapt on a dataset. Renamed bins argument to bin_boundaries for specifying bins without adapt.
      • Improvements to model saving/loading:
        • model.load_weights now accepts paths to saved models.

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.5.0

    Breaking Changes

    • The TF_CPP_MIN_VLOG_LEVEL environment variable has been renamed to to TF_CPP_MAX_VLOG_LEVEL which correctly describes its effect.

    Known Caveats

    Major Features and Improvements

    • TPU embedding support

      • Added profile_data_directory to EmbeddingConfigSpec in _tpu_estimator_embedding.py. This allows embedding lookup statistics gathered at runtime to be used in embedding layer partitioning decisions.
    • tf.keras.metrics.AUC now support logit predictions.

    • Creating tf.random.Generator under tf.distribute.Strategy scopes is now allowed (except for tf.distribute.experimental.CentralStorageStrategy and tf.distribute.experimental.ParameterServerStrategy). Different replicas will get different random-number streams.

    • tf.data:

      • tf.data service now supports strict round-robin reads, which is useful for synchronous training workloads where example sizes vary. With strict round robin reads, users can guarantee that consumers get similar-sized examples in the same step.
      • tf.data service now supports optional compression. Previously data would always be compressed, but now you can disable compression by passing compression=None to tf.data.experimental.service.distribute(...).
      • tf.data.Dataset.batch() now supports num_parallel_calls and deterministic arguments. num_parallel_calls is used to indicate that multiple input batches should be computed in parallel. With num_parallel_calls set, deterministic is used to indicate that outputs can be obtained in the non-deterministic order.
      • Options returned by tf.data.Dataset.options() are no longer mutable.
      • tf.data input pipelines can now be executed in debug mode, which disables any asynchrony, parallelism, or non-determinism and forces Python execution (as opposed to trace-compiled graph execution) of user-defined functions passed into transformations such as map. The debug mode can be enabled through tf.data.experimental.enable_debug_mode().
    • tf.lite

      • Enabled the new MLIR-based quantization backend by default
        • The new backend is used for 8 bits full integer post-training quantization
        • The new backend removes the redundant rescales and fixes some bugs (shared weight/bias, extremely small scales, etc)

    ... (truncated)

    Commits
    • a4dfb8d Merge pull request #49124 from tensorflow/mm-cherrypick-tf-data-segfault-fix-...
    • 2107b1d Merge pull request #49116 from tensorflow-jenkins/version-numbers-2.5.0-17609
    • 16b8139 Update snapshot_dataset_op.cc
    • 86a0d86 Merge pull request #49126 from geetachavan1/cherrypicks_X9ZNY
    • 9436ae6 Merge pull request #49128 from geetachavan1/cherrypicks_D73J5
    • 6b2bf99 Validate that a and b are proper sparse tensors
    • c03ad1a Ensure validation sticks in banded_triangular_solve_op
    • 12a6ead Merge pull request #49120 from geetachavan1/cherrypicks_KJ5M9
    • b67f5b8 Merge pull request #49118 from geetachavan1/cherrypicks_BIDTR
    • a13c0ad [tf.data][cherrypick] Fix snapshot segfault when using repeat and prefecth
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 2
  • Bump tensorflow from 2.5.0 to 2.5.1

    Bump tensorflow from 2.5.0 to 2.5.1

    Bumps tensorflow from 2.5.0 to 2.5.1.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.5.1

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse (CVE-2021-37656)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixDiagV* ops (CVE-2021-37657)
    • Fixes an undefined behavior arising from reference binding to nullptr in MatrixSetDiagV* ops (CVE-2021-37658)
    • Fixes an undefined behavior arising from reference binding to nullptr and heap OOB in binary cwise ops (CVE-2021-37659)
    • Fixes a division by 0 in inplace operations (CVE-2021-37660)
    • Fixes a crash caused by integer conversion to unsigned (CVE-2021-37661)
    • Fixes an undefined behavior arising from reference binding to nullptr in boosted trees (CVE-2021-37662)
    • Fixes a heap OOB in boosted trees (CVE-2021-37664)
    • Fixes vulnerabilities arising from incomplete validation in QuantizeV2 (CVE-2021-37663)
    • Fixes vulnerabilities arising from incomplete validation in MKL requantization (CVE-2021-37665)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToVariant (CVE-2021-37666)
    • Fixes an undefined behavior arising from reference binding to nullptr in unicode encoding (CVE-2021-37667)
    • Fixes an FPE in tf.raw_ops.UnravelIndex (CVE-2021-37668)
    • Fixes a crash in NMS ops caused by integer conversion to unsigned (CVE-2021-37669)
    • Fixes a heap OOB in UpperBound and LowerBound (CVE-2021-37670)
    • Fixes an undefined behavior arising from reference binding to nullptr in map operations (CVE-2021-37671)
    • Fixes a heap OOB in SdcaOptimizerV2 (CVE-2021-37672)
    • Fixes a CHECK-fail in MapStage (CVE-2021-37673)
    • Fixes a vulnerability arising from incomplete validation in MaxPoolGrad (CVE-2021-37674)
    • Fixes an undefined behavior arising from reference binding to nullptr in shape inference (CVE-2021-37676)
    • Fixes a division by 0 in most convolution operators (CVE-2021-37675)
    • Fixes vulnerabilities arising from missing validation in shape inference for Dequantize (CVE-2021-37677)
    • Fixes an arbitrary code execution due to YAML deserialization (CVE-2021-37678)
    • Fixes a heap OOB in nested tf.map_fn with RaggedTensors (CVE-2021-37679)

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.5.1

    This release introduces several vulnerability fixes:

    • Fixes a heap out of bounds access in sparse reduction operations (CVE-2021-37635)
    • Fixes a floating point exception in SparseDenseCwiseDiv (CVE-2021-37636)
    • Fixes a null pointer dereference in CompressElement (CVE-2021-37637)
    • Fixes a null pointer dereference in RaggedTensorToTensor (CVE-2021-37638)
    • Fixes a null pointer dereference and a heap OOB read arising from operations restoring tensors (CVE-2021-37639)
    • Fixes an integer division by 0 in sparse reshaping (CVE-2021-37640)
    • Fixes a division by 0 in ResourceScatterDiv (CVE-2021-37642)
    • Fixes a heap OOB in RaggedGather (CVE-2021-37641)
    • Fixes a std::abort raised from TensorListReserve (CVE-2021-37644)
    • Fixes a null pointer dereference in MatrixDiagPartOp (CVE-2021-37643)
    • Fixes an integer overflow due to conversion to unsigned (CVE-2021-37645)
    • Fixes a bad allocation error in StringNGrams caused by integer conversion (CVE-2021-37646)
    • Fixes a null pointer dereference in SparseTensorSliceDataset (CVE-2021-37647)
    • Fixes an incorrect validation of SaveV2 inputs (CVE-2021-37648)
    • Fixes a null pointer dereference in UncompressElement (CVE-2021-37649)
    • Fixes a segfault and a heap buffer overflow in {Experimental,}DatasetToTFRecord (CVE-2021-37650)
    • Fixes a heap buffer overflow in FractionalAvgPoolGrad (CVE-2021-37651)
    • Fixes a use after free in boosted trees creation (CVE-2021-37652)
    • Fixes a division by 0 in ResourceGather (CVE-2021-37653)
    • Fixes a heap OOB and a CHECK fail in ResourceGather (CVE-2021-37654)
    • Fixes a heap OOB in ResourceScatterUpdate (CVE-2021-37655)
    • Fixes an undefined behavior arising from reference binding to nullptr in RaggedTensorToSparse

    ... (truncated)

    Commits
    • 8222c1c Merge pull request #51381 from tensorflow/mm-fix-r2.5-build
    • d584260 Disable broken/flaky test
    • f6c6ce3 Merge pull request #51367 from tensorflow-jenkins/version-numbers-2.5.1-17468
    • 3ca7812 Update version numbers to 2.5.1
    • 4fdf683 Merge pull request #51361 from tensorflow/mm-update-relnotes-on-r2.5
    • 05fc01a Put CVE numbers for fixes in parentheses
    • bee1dc4 Update release notes for the new patch release
    • 47beb4c Merge pull request #50597 from kruglov-dmitry/v2.5.0-sync-abseil-cmake-bazel
    • 6f39597 Merge pull request #49383 from ashahab/abin-load-segfault-r2.5
    • 0539b34 Merge pull request #48979 from liufengdb/r2.5-cherrypick
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Update gensim LDA to 4.X

    Update gensim LDA to 4.X

    This issue is for discussing and eventually implementing an update for gensim implementations of LDA in kwx. The package was originally written with 3.X versions of gensim, and 4.X versions apparently have some dramatic improvements as far as modeling options/efficency and n-gram creation (for kwx.utils.clean). Changes would need to be made in kwx.utils, kwx.model, and kwx.topic_model.

    Documenting what would need to happen for the switch and then work towards implementing it would be very much appreciated :)

    Thanks for your interest in contributing!

    enhancement good first issue question 
    opened by andrewtavis 1
  • [ImgBot] Optimize images

    [ImgBot] Optimize images

    opened by imgbot[bot] 1
  • Edit spaCy loading based on version

    Edit spaCy loading based on version

    spaCy has new loading mechanisms in the later versions that produce errors in data preparation within kwx.utils. The scripts should be changed to check the spaCy version so that these changes are accounted for and errors are produced.

    bug good first issue 
    opened by andrewtavis 0
  • Adding t-SNE and pyLDA style visualizations for BERT

    Adding t-SNE and pyLDA style visualizations for BERT

    A major difference between BERT and LDA kwx implementations is that there are no visualization methods for BERT. It would be good to add a pyLDAvis style visualization of topic words as well as a t-SNE visualization of topic similarities. These would be added to kwx.visuals.

    enhancement help wanted 
    opened by andrewtavis 0
  • Convert translation feature

    Convert translation feature

    opened by andrewtavis 1
  • Remove ngrams and topic number

    Remove ngrams and topic number

    Hi Andrew, again me :) I want to ask two questions about the algorithm. When using the first BERT model, why are we remove ngrams and can't we use them without remove ngrams? My second question is that when using BERT we give the number of keywords and the number of topics. How does the number of threads work, so what is the logic?

    question 
    opened by AhmetCakar 9
  • TFIDF requires a corpus to compare

    TFIDF requires a corpus to compare

    Hi Andrew, I was trying the Keyword Extraction API with TF-IDF, the code is: bert_kws = extract_kws( method="TFIDF", # "BERT", "LDA", "TFIDF", "frequency" bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user show_progress_bar=True, batch_size=5, )

    Which returns the error, AssertionError: TFIDF requires another text corpus to be passed to the corpuses_to_compare argument.

    I wonder why we require corpus to compare for keyword extraction? Thanks!

    question 
    opened by AbhiPawar5 2
  • Adding TFIDF key-phrase extraction

    Adding TFIDF key-phrase extraction

    This issue is for discussing and eventually implementing key-phrase extraction for TFIDF in kwx. It would be best to first collect code snippets and documentation links for how to best implement this with scikit-learn based TFIDF models, and then from there work on an implementation can begin :)

    Thanks for your interest in contributing!

    enhancement help wanted 
    opened by andrewtavis 0
Releases(v1.0.0)
  • v1.0.0(Dec 28, 2021)

  • v0.1.8(Apr 29, 2021)

    Changes include:

    • Support has been added for gensim 3.8.x and 4.x
    • Dependencies in requirement and environment files are now condensed
    • An alert for users when the corpus size is to small for the number of topics was added
    • An import error for pyLDAvis was fixed
    Source code(tar.gz)
    Source code(zip)
  • v0.1.7.3(Mar 30, 2021)

    Changes include:

    • Switching over to an src structure
    • Removing the lda_bert method because its dependencies were causing breaks
    • Code quality is now checked with Codacy
    • Extensive code formatting to improve quality and style
    • Bug fixes and a more explicit use of exceptions
    • More extensive contributing guidelines
    • Tests now use random seeds and are thus more robust
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Mar 15, 2021)

    Changes include:

    • Keyword extraction and selection are now disjointed so that modeling doesn't occur again to get new keywords

    • Keyword extraction and cleaning are now fully disjointed processes

    • kwargs for sentence-transformers BERT, LDA, and TFIDF can now be passed

    • The cleaning process is verbose and uses multiprocessing

    • The user has greater control over the cleaning process

    • Reformatting of the code to make the process more clear

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Feb 17, 2021)

    First stable release of kwx

    Changes include:

    • Full documentation of the package

    • Virtual environment files

    • Bug fixes

    • Extensive testing of all modules with GH Actions and Codecov

    • Code of conduct and contribution guidelines

    Source code(tar.gz)
    Source code(zip)
  • v0.0.2.2(Jan 31, 2021)

    The minimum viable product of kwx:

    • Users are able to extract keywords using the following methods

      • Most frequent words
      • TFIDF words unique to one corpus when compared to others
      • Latent Dirichlet Allocation
      • Bidirectional Encoder Representations from Transformers
      • An autoencoder application of LDA and BERT combined
    • Users are able to tell the model to remove certain words to fine tune results

    • Support is offered for a universal cleaning process in all major languages

    • Visualization techniques to display keywords and topics are included

    • Outputs can be cleanly organized in a directory or zip file

    • Runtimes for topic number comparisons are estimated using tqdm

    Source code(tar.gz)
    Source code(zip)
Owner
Andrew Tavis McAllister
Data scientist, developer and designer. Humboldt University of Berlin (MS); University of Oregon (BA).
Andrew Tavis McAllister
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 7, 2022
TFIDF-based QA system for AIO2 competition

AIO2 TF-IDF Baseline This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition. In the traini

Masatoshi Suzuki 4 Feb 19, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

null 1 Oct 14, 2021
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 8, 2021
Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Wake Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec Abstract استخراج خودکار کلمات کلیدی متون کوتاه فارسی با استفاده از word2vec ب

Omid Hajipoor 1 Dec 17, 2021
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

null 186 Dec 29, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

null 160 Feb 9, 2021
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.5k Feb 18, 2021
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 8, 2022
PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models. 基于BERT的文本纠错模型,使用PyTorch实现。

Heng Cai 209 Dec 30, 2022
PyTorch impelementations of BERT-based Spelling Error Correction Models

PyTorch impelementations of BERT-based Spelling Error Correction Models

Heng Cai 59 Jun 29, 2021