https://github.com/dayyass/latent-semantic-analysis
Pipeline for training LSA models using Scikit-Learn.
https://github.com/dayyass/latent-semantic-analysis
data-science hacktoberfest latent-semantic-analysis lsa machine-learning natural-language-processing nlp pipeline python topic-modeling
Last synced: 3 months ago
JSON representation
Pipeline for training LSA models using Scikit-Learn.
- Host: GitHub
- URL: https://github.com/dayyass/latent-semantic-analysis
- Owner: dayyass
- License: mit
- Created: 2021-09-26T10:40:10.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2021-10-12T05:47:03.000Z (over 3 years ago)
- Last Synced: 2025-04-13T07:55:25.869Z (3 months ago)
- Topics: data-science, hacktoberfest, latent-semantic-analysis, lsa, machine-learning, natural-language-processing, nlp, pipeline, python, topic-modeling
- Language: Python
- Homepage: https://pypi.org/project/latent-semantic-analysis/
- Size: 34.2 KB
- Stars: 23
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://github.com/dayyass/latent-semantic-analysis/actions/workflows/tests.yml)
[](https://github.com/dayyass/latent-semantic-analysis/actions/workflows/linter.yml)
[](https://codecov.io/gh/dayyass/latent-semantic-analysis)[](https://github.com/dayyass/latent-semantic-analysis#requirements)
[](https://github.com/dayyass/latent-semantic-analysis/releases/latest)
[](https://github.com/dayyass/latent-semantic-analysis/blob/main/LICENSE)[](https://github.com/dayyass/latent-semantic-analysis/blob/main/.pre-commit-config.yaml)
[](https://github.com/psf/black)[](https://pypi.org/project/latent-semantic-analysis)
[](https://pypi.org/project/latent-semantic-analysis)### Latent Semantic Analysis
Pipeline for training **LSA** models using Scikit-Learn.### Usage
Instead of writing custom code for latent semantic analysis, you just need:
1. install pipeline:
```shell script
pip install latent-semantic-analysis
```
2. run pipeline:
- either in **terminal**:
```shell script
lsa-train --path_to_config config.yaml
```
- or in **python**:
```python3
import latent_semantic_analysislatent_semantic_analysis.train(path_to_config="config.yaml")
```**NOTE**: more about config file [here](https://github.com/dayyass/latent-semantic-analysis/tree/main#config).
No data preparation is needed, only a **csv** file with raw text column (with arbitrary name).
#### Config
The user interface consists of only one files:
- [**config.yaml**](https://github.com/dayyass/latent-semantic-analysis/blob/main/config.yaml) - general configuration with sklearn **TF-IDF** and **SVD** parametersChange **config.yaml** to create the desired configuration and train LSA model with the following command:
- **terminal**:
```shell script
lsa-train --path_to_config config.yaml
```
- **python**:
```python3
import latent_semantic_analysislatent_semantic_analysis.train(path_to_config="config.yaml")
```Default **config.yaml**:
```yaml
seed: 42
path_to_save_folder: models# data
data:
data_path: data/data.csv
sep: ','
text_column: text# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1# svd
svd:
n_components: 10
algorithm: arpack
```**NOTE**: `tf-idf` and `svd` are sklearn [**TfidfVectorizer**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfVectorizer) and [**TruncatedSVD**](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) parameters correspondingly, so you can parameterize instances of these classes however you want.
#### Output
After training the model, the pipeline will return the following files:
- `model.joblib` - sklearn pipeline with LSA (TF-IDF and SVD steps)
- `config.yaml` - config that was used to train the model
- `logging.txt` - logging file
- `doc2topic.json` - document embeddings
- `term2topic.json` - term embeddings### Requirements
Python >= 3.6### Citation
If you use **latent-semantic-analysis** in a scientific publication, we would appreciate references to the following BibTex entry:
```bibtex
@misc{dayyass2021lsa,
author = {El-Ayyass, Dani},
title = {Pipeline for training LSA models},
howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
year = {2021}
}
```