A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Overview

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It can be adapted to many languages provided that the Snowball stemmer, a dependency of this project, supports it.

Usage

from artifici_lda.lda_service import train_lda_pipeline_default


FR_STOPWORDS = [
    "le", "les", "la", "un", "de", "en",
    "a", "b", "c", "s",
    "est", "sur", "tres", "donc", "sont",
    # even slang/texto stop words:
    "ya", "pis", "yer"]
# Note: this list of stop words is poor and is just as an example.

fr_comments = [
    "Un super-chat marche sur le trottoir",
    "Les super-chats aiment ronronner",
    "Les chats sont ronrons",
    "Un super-chien aboie",
    "Deux super-chiens",
    "Combien de chiens sont en train d'aboyer?"
]

transformed_comments, top_comments, _1_grams, _2_grams = train_lda_pipeline_default(
    fr_comments,
    n_topics=2,
    stopwords=FR_STOPWORDS,
    language='french')

print(transformed_comments)
print(top_comments)
print(_1_grams)
print(_2_grams)

Output:

array([[0.14218195, 0.85781805],
       [0.11032992, 0.88967008],
       [0.16960695, 0.83039305],
       [0.88967041, 0.11032959],
       [0.8578187 , 0.1421813 ],
       [0.83039303, 0.16960697]])

['Un super-chien aboie', 'Les super-chats aiment ronronner']

[[('chiens', 3.4911404011996545), ('super', 2.5000203653313933)],
 [('chats',  3.4911393765493255), ('super', 2.499979634668601 )]]

[[('super chiens', 2.4921035508342464)],
 [('super chats',  2.492102155345991 )]]

How it works

See Multilingual-LDA-Pipeline-Tutorial for an exhaustive example (intended to be read from top to bottom, not skimmed through). For more explanations on the Inverse Lemmatization, see Stemming-words-from-multiple-languages.

Supported Languages

Those languages are supported:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Porter
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish

You need to bring your own list of stop words. That could be achieved by computing the Term Frequencies on your corpus (or on a bigger corpus of the same language) and to use some of the most common words as stop words.

Dependencies and their license

numpy==1.14.3           # BSD-3-Clause and BSD-2-Clause BSD-like and Zlib
scikit-learn==0.19.1    # BSD-3-Clause
PyStemmer==1.3.0        # BSD-3-Clause and MIT
snowballstemmer==1.2.1  # BSD-3-Clause and BSD-2-Clause
translitcodec==0.4.0    # MIT License
scipy==1.1.0            # BSD-3-Clause and MIT-like

Unit tests

Run pytest with ./run_tests.sh. Coverage:

----------- coverage: platform linux, python 3.6.7-final-0 -----------
Name                                       Stmts   Miss  Cover
--------------------------------------------------------------
artifici_lda/__init__.py                       0      0   100%
artifici_lda/data_utils.py                    39      0   100%
artifici_lda/lda_service.py                   31      0   100%
artifici_lda/logic/__init__.py                 0      0   100%
artifici_lda/logic/count_vectorizer.py         9      0   100%
artifici_lda/logic/lda.py                     23      7    70%
artifici_lda/logic/letter_splitter.py         36      4    89%
artifici_lda/logic/stemmer.py                 60      3    95%
artifici_lda/logic/stop_words_remover.py      61      5    92%
--------------------------------------------------------------
TOTAL                                        259     19    93%

License

This project is published under the MIT License (MIT).

Copyright (c) 2018 Artifici online services inc.

Coded by Guillaume Chevalier at Neuraxio Inc.

Owner
Artifici Online Services inc.
Our mission is to highlight what people have in common.
Artifici Online Services inc.
本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。

【关于 NLP】那些你不知道的事 作者:杨夕、芙蕖、李玲、陈海顺、twilight、LeoLRH、JimmyDU、艾春辉、张永泰、金金金 介绍 本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料,该资料目前包含 自然语言处理各领域的 面试题积累。 目录架构 一、【

null 363 Oct 17, 2021
Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

Snowball Stemming language and algorithms 527 Oct 21, 2021
Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

Snowball Stemming language and algorithms 471 Feb 9, 2021
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

null 1 Oct 14, 2021
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 132 Oct 13, 2021
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 121 Jan 6, 2021
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 534 Oct 21, 2021
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.7k Oct 23, 2021
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.5k Feb 17, 2021
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 1.4k Oct 21, 2021
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.

TextFlint 514 Oct 17, 2021
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 12.5k Oct 15, 2021
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 11.7k Feb 12, 2021
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 11.7k Feb 18, 2021
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

Hugging Face 52.8k Oct 21, 2021
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

null 3 Jul 3, 2021
null 114 Oct 20, 2021
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

null 89 Oct 21, 2021
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 12 Oct 12, 2021