A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Artifici Online Services inc.

Last update: Oct 7, 2022

Related tags

Text Data & NLP multilingual machine-learning natural-language-processing clustering english french lda latent-dirichlet-allocation

Overview

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It can be adapted to many languages provided that the Snowball stemmer, a dependency of this project, supports it.

Usage

from artifici_lda.lda_service import train_lda_pipeline_default


FR_STOPWORDS = [
    "le", "les", "la", "un", "de", "en",
    "a", "b", "c", "s",
    "est", "sur", "tres", "donc", "sont",
    # even slang/texto stop words:
    "ya", "pis", "yer"]
# Note: this list of stop words is poor and is just as an example.

fr_comments = [
    "Un super-chat marche sur le trottoir",
    "Les super-chats aiment ronronner",
    "Les chats sont ronrons",
    "Un super-chien aboie",
    "Deux super-chiens",
    "Combien de chiens sont en train d'aboyer?"
]

transformed_comments, top_comments, _1_grams, _2_grams = train_lda_pipeline_default(
    fr_comments,
    n_topics=2,
    stopwords=FR_STOPWORDS,
    language='french')

print(transformed_comments)
print(top_comments)
print(_1_grams)
print(_2_grams)

Output:

array([[0.14218195, 0.85781805],
       [0.11032992, 0.88967008],
       [0.16960695, 0.83039305],
       [0.88967041, 0.11032959],
       [0.8578187 , 0.1421813 ],
       [0.83039303, 0.16960697]])

['Un super-chien aboie', 'Les super-chats aiment ronronner']

[[('chiens', 3.4911404011996545), ('super', 2.5000203653313933)],
 [('chats',  3.4911393765493255), ('super', 2.499979634668601 )]]

[[('super chiens', 2.4921035508342464)],
 [('super chats',  2.492102155345991 )]]

How it works

See Multilingual-LDA-Pipeline-Tutorial for an exhaustive example (intended to be read from top to bottom, not skimmed through). For more explanations on the Inverse Lemmatization, see Stemming-words-from-multiple-languages.

Supported Languages

Those languages are supported:

Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Porter
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish

You need to bring your own list of stop words. That could be achieved by computing the Term Frequencies on your corpus (or on a bigger corpus of the same language) and to use some of the most common words as stop words.

Dependencies and their license

numpy==1.14.3           # BSD-3-Clause and BSD-2-Clause BSD-like and Zlib
scikit-learn==0.19.1    # BSD-3-Clause
PyStemmer==1.3.0        # BSD-3-Clause and MIT
snowballstemmer==1.2.1  # BSD-3-Clause and BSD-2-Clause
translitcodec==0.4.0    # MIT License
scipy==1.1.0            # BSD-3-Clause and MIT-like

Unit tests

Run pytest with ./run_tests.sh. Coverage:

----------- coverage: platform linux, python 3.6.7-final-0 -----------
Name                                       Stmts   Miss  Cover
--------------------------------------------------------------
artifici_lda/__init__.py                       0      0   100%
artifici_lda/data_utils.py                    39      0   100%
artifici_lda/lda_service.py                   31      0   100%
artifici_lda/logic/__init__.py                 0      0   100%
artifici_lda/logic/count_vectorizer.py         9      0   100%
artifici_lda/logic/lda.py                     23      7    70%
artifici_lda/logic/letter_splitter.py         36      4    89%
artifici_lda/logic/stemmer.py                 60      3    95%
artifici_lda/logic/stop_words_remover.py      61      5    92%
--------------------------------------------------------------
TOTAL                                        259     19    93%

License

This project is published under the MIT License (MIT).

Coded by Guillaume Chevalier at Neuraxio Inc.

Comments

[Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1
Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Changes to the following files to upgrade the vulnerable dependencies to a fixed version:

requirements.txt

Vulnerabilities that will be fixed

By pinning:

Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- | 551/1000
Why? Recently disclosed, Has a fix available, CVSS 5.3 | Regular Expression Denial of Service (ReDoS)
SNYK-PYTHON-SETUPTOOLS-3180412 | setuptools:
39.0.1 -> 65.5.1
| No | No Known Exploit

(*) Note that the real score may have changed since the PR was raised.

Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

Check the changes in this PR to ensure they won't cause issues with your project.

Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

For more information: 🧐 View latest project report

🛠 Adjust project settings

📚 Read more about Snyk's upgrade and patch logic

Learn how to fix vulnerabilities with free interactive lessons:

🦉 Regular Expression Denial of Service (ReDoS)
opened by snyk-bot 0
[Snyk] Security upgrade numpy from 1.19.1 to 1.22.2
This PR was automatically created by Snyk using the credentials of a real user.

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Changes to the following files to upgrade the vulnerable dependencies to a fixed version:

requirements.txt

⚠️ Warning

pytest-cov 2.6.0 requires coverage, which is not installed.

Vulnerabilities that will be fixed

By pinning:

Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | NULL Pointer Dereference
SNYK-PYTHON-NUMPY-2321964 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept | 399/1000
Why? Has a fix available, CVSS 3.7 | Buffer Overflow
SNYK-PYTHON-NUMPY-2321966 | numpy:
1.19.1 -> 1.22.2
| No | No Known Exploit | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | Buffer Overflow
SNYK-PYTHON-NUMPY-2321969 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | Denial of Service (DoS)
SNYK-PYTHON-NUMPY-2321970 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept

(*) Note that the real score may have changed since the PR was raised.

Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

Check the changes in this PR to ensure they won't cause issues with your project.

Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

For more information: 🧐 View latest project report

🛠 Adjust project settings

📚 Read more about Snyk's upgrade and patch logic

Learn how to fix vulnerabilities with free interactive lessons:

🦉 Denial of Service (DoS)
opened by brucelightyear 0
[Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1
This PR was automatically created by Snyk using the credentials of a real user.

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Changes to the following files to upgrade the vulnerable dependencies to a fixed version:

requirements.txt

Vulnerabilities that will be fixed

By pinning:

Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- | 441/1000
Why? Recently disclosed, Has a fix available, CVSS 3.1 | Regular Expression Denial of Service (ReDoS)
SNYK-PYTHON-SETUPTOOLS-3113904 | setuptools:
39.0.1 -> 65.5.1
| No | No Known Exploit

(*) Note that the real score may have changed since the PR was raised.

Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

Check the changes in this PR to ensure they won't cause issues with your project.

Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

For more information: 🧐 View latest project report

🛠 Adjust project settings

📚 Read more about Snyk's upgrade and patch logic

Learn how to fix vulnerabilities with free interactive lessons:

🦉 Regular Expression Denial of Service (ReDoS)
opened by brucelightyear 0
[Snyk] Security upgrade numpy from 1.19.1 to 1.22.2
Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Changes to the following files to upgrade the vulnerable dependencies to a fixed version:

requirements.txt

⚠️ Warning

pytest-cov 2.6.0 requires coverage, which is not installed.

Vulnerabilities that will be fixed

By pinning:

Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | NULL Pointer Dereference
SNYK-PYTHON-NUMPY-2321964 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept | 399/1000
Why? Has a fix available, CVSS 3.7 | Buffer Overflow
SNYK-PYTHON-NUMPY-2321966 | numpy:
1.19.1 -> 1.22.2
| No | No Known Exploit | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | Buffer Overflow
SNYK-PYTHON-NUMPY-2321969 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | Denial of Service (DoS)
SNYK-PYTHON-NUMPY-2321970 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept

(*) Note that the real score may have changed since the PR was raised.

Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

Check the changes in this PR to ensure they won't cause issues with your project.

Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

For more information: 🧐 View latest project report

🛠 Adjust project settings

📚 Read more about Snyk's upgrade and patch logic

Learn how to fix vulnerabilities with free interactive lessons:

🦉 Denial of Service (DoS)
opened by snyk-bot 0
[Snyk] Security upgrade numpy from 1.19.1 to 1.22.2
This PR was automatically created by Snyk using the credentials of a real user.

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Changes to the following files to upgrade the vulnerable dependencies to a fixed version:

requirements.txt

Vulnerabilities that will be fixed

By pinning:

Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | NULL Pointer Dereference
SNYK-PYTHON-NUMPY-2321964 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept | 399/1000
Why? Has a fix available, CVSS 3.7 | Buffer Overflow
SNYK-PYTHON-NUMPY-2321966 | numpy:
1.19.1 -> 1.22.2
| No | No Known Exploit | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | Buffer Overflow
SNYK-PYTHON-NUMPY-2321969 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | Denial of Service (DoS)
SNYK-PYTHON-NUMPY-2321970 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept

(*) Note that the real score may have changed since the PR was raised.

Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

Check the changes in this PR to ensure they won't cause issues with your project.

Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

For more information: 🧐 View latest project report

🛠 Adjust project settings

📚 Read more about Snyk's upgrade and patch logic

Learn how to fix vulnerabilities with free interactive lessons:

🦉 Denial of Service (DoS)
opened by brucelightyear 0
[Snyk] Security upgrade numpy from 1.19.1 to 1.22.2
This PR was automatically created by Snyk using the credentials of a real user.

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Changes to the following files to upgrade the vulnerable dependencies to a fixed version:

requirements.txt

⚠️ Warning

pytest-cov 2.6.0 requires coverage, which is not installed.

Vulnerabilities that will be fixed

By pinning:

Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | NULL Pointer Dereference
SNYK-PYTHON-NUMPY-2321964 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept | 399/1000
Why? Has a fix available, CVSS 3.7 | Buffer Overflow
SNYK-PYTHON-NUMPY-2321966 | numpy:
1.19.1 -> 1.22.2
| No | No Known Exploit | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | Buffer Overflow
SNYK-PYTHON-NUMPY-2321969 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept | 506/1000
Why? Proof of Concept exploit, Has a fix available, CVSS 3.7 | Denial of Service (DoS)
SNYK-PYTHON-NUMPY-2321970 | numpy:
1.19.1 -> 1.22.2
| No | Proof of Concept

(*) Note that the real score may have changed since the PR was raised.

Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

Check the changes in this PR to ensure they won't cause issues with your project.

Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

For more information: 🧐 View latest project report

🛠 Adjust project settings

📚 Read more about Snyk's upgrade and patch logic

Learn how to fix vulnerabilities with free interactive lessons:

🦉 Denial of Service (DoS)
opened by brucelightyear 0

Owner

Artifici Online Services inc.

Our mission is to highlight what people have in common.

GitHub

Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

Snowball Stemming language and algorithms

471 Feb 9, 2021

Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

142 Dec 21, 2022

Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

121 Jan 6, 2021

BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

41 Dec 27, 2022

Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

103 Nov 12, 2022

NLP topic mdel LDA - Gathered from New York Times website

1 Oct 14, 2021

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

MILES Multilingual Lexical Simplifier Explore the docs » Read LSBert Paper · Report Bug · Request Feature About The Project MILES is a multilingual te

45 Oct 19, 2022

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Random Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short links

1 Jan 1, 2022

A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

4 Nov 24, 2021

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

148 Dec 26, 2022

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

1 Nov 16, 2021

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform This repo try to implement iSTFTNet : Fast

126 Jan 2, 2023

Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

2 Mar 26, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

27 Dec 22, 2022

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

47 Sep 5, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

speech-recognition-py Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to huma

1 Apr 3, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Related tags

Overview

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

Usage

How it works

Supported Languages

Dependencies and their license

Unit tests

License

Comments

[Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Vulnerabilities that will be fixed

By pinning:

[Snyk] Security upgrade numpy from 1.19.1 to 1.22.2

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Vulnerabilities that will be fixed

By pinning:

[Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Vulnerabilities that will be fixed

By pinning:

[Snyk] Security upgrade numpy from 1.19.1 to 1.22.2

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Vulnerabilities that will be fixed

By pinning:

[Snyk] Security upgrade numpy from 1.19.1 to 1.22.2

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Vulnerabilities that will be fixed

By pinning:

[Snyk] Security upgrade numpy from 1.19.1 to 1.22.2

Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

Changes included in this PR

Vulnerabilities that will be fixed

By pinning:

Owner

Artifici Online Services inc.

Snowball compiler and stemming algorithms

Get list of common stop words in various languages in Python

Get list of common stop words in various languages in Python

BERT, LDA, and TFIDF based keyword extraction in Python

Turkish Stop Words Türkçe Dolgu Sözcükleri

NLP topic mdel LDA - Gathered from New York Times website

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

A simple implementation of N-gram language model.

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

Programme de chiffrement et de déchiffrement inverse d'un message en python3.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

A python package for deep multilingual punctuation prediction.

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text