Latent Semantic Analysis
Pipeline for training LSA models using Scikit-Learn.
Usage
Instead of writing custom code for latent semantic analysis, you just need:
- install pipeline:
pip install latent-semantic-analysis
- run pipeline:
- either in terminal:
lsa-train --path_to_config config.yaml
- or in python:
import latent_semantic_analysis
latent_semantic_analysis.train(path_to_config="config.yaml")
NOTE: more about config file here.
No data preparation is needed, only a csv file with raw text column (with arbitrary name).
Config
The user interface consists of only one files:
- config.yaml - general configuration with sklearn TF-IDF and SVD parameters
Change config.yaml to create the desired configuration and train LSA model with the following command:
- terminal:
lsa-train --path_to_config config.yaml
- python:
import latent_semantic_analysis
latent_semantic_analysis.train(path_to_config="config.yaml")
Default config.yaml:
seed: 42
path_to_save_folder: models
# data
data:
data_path: data/data.csv
sep: ','
text_column: text
# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1
# svd
svd:
n_components: 10
algorithm: arpack
NOTE: tf-idf
and svd
are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.
Output
After training the model, the pipeline will return the following files:
model.joblib
- sklearn pipeline with LSA (TF-IDF and SVD steps)config.yaml
- config that was used to train the modellogging.txt
- logging filedoc2topic.json
- document embeddingsterm2topic.json
- term embeddings
Requirements
Python >= 3.6
Citation
If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:
@misc{dayyass2021lsa,
author = {El-Ayyass, Dani},
title = {Pipeline for training LSA models},
howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
year = {2021}
}