https://github.com/dayyass/latent-semantic-analysis
Pipeline for training LSA models using Scikit-Learn.
https://github.com/dayyass/latent-semantic-analysis
data-science hacktoberfest latent-semantic-analysis lsa machine-learning natural-language-processing nlp pipeline python topic-modeling
Last synced: 4 months ago
JSON representation
Pipeline for training LSA models using Scikit-Learn.
- Host: GitHub
- URL: https://github.com/dayyass/latent-semantic-analysis
- Owner: dayyass
- License: mit
- Created: 2021-09-26T10:40:10.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-10-12T05:47:03.000Z (over 3 years ago)
- Last Synced: 2024-09-19T14:47:45.623Z (5 months ago)
- Topics: data-science, hacktoberfest, latent-semantic-analysis, lsa, machine-learning, natural-language-processing, nlp, pipeline, python, topic-modeling
- Language: Python
- Homepage: https://pypi.org/project/latent-semantic-analysis/
- Size: 34.2 KB
- Stars: 24
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[data:image/s3,"s3://crabby-images/7452d/7452d7b88e07977a492f510e232a7a0e6092fb65" alt="tests"](https://github.com/dayyass/latent-semantic-analysis/actions/workflows/tests.yml)
[data:image/s3,"s3://crabby-images/48e83/48e838d3ba3e00198a55d786d8896beb26aac89a" alt="linter"](https://github.com/dayyass/latent-semantic-analysis/actions/workflows/linter.yml)
[data:image/s3,"s3://crabby-images/35def/35def396196cfd0056add8837872665bb368ef6a" alt="codecov"](https://codecov.io/gh/dayyass/latent-semantic-analysis)[data:image/s3,"s3://crabby-images/00793/00793950026a78c7d500c3c8ccdaa404f214686f" alt="python 3.6"](https://github.com/dayyass/latent-semantic-analysis#requirements)
[data:image/s3,"s3://crabby-images/93d9c/93d9cfccf8ebfcc4aebcafbe3cc179f226f1d092" alt="release (latest by date)"](https://github.com/dayyass/latent-semantic-analysis/releases/latest)
[data:image/s3,"s3://crabby-images/9ac84/9ac84a278422c81e6ae49be0112d3afaa4d4a08f" alt="license"](https://github.com/dayyass/latent-semantic-analysis/blob/main/LICENSE)[data:image/s3,"s3://crabby-images/a19e0/a19e0dd17ec77e6a941c27a5d04be99e6de6e1b5" alt="pre-commit"](https://github.com/dayyass/latent-semantic-analysis/blob/main/.pre-commit-config.yaml)
[data:image/s3,"s3://crabby-images/98647/986475842f2907062b79c4bb27fdd075d638e5b9" alt="code style: black"](https://github.com/psf/black)[data:image/s3,"s3://crabby-images/68dad/68dad4687d1f82d4c12d54b5b57ddef196c41c7f" alt="pypi version"](https://pypi.org/project/latent-semantic-analysis)
[data:image/s3,"s3://crabby-images/2d5a9/2d5a976ed1a3c8373883405c55aeb04c1824b79f" alt="pypi downloads"](https://pypi.org/project/latent-semantic-analysis)### Latent Semantic Analysis
Pipeline for training **LSA** models using Scikit-Learn.### Usage
Instead of writing custom code for latent semantic analysis, you just need:
1. install pipeline:
```shell script
pip install latent-semantic-analysis
```
2. run pipeline:
- either in **terminal**:
```shell script
lsa-train --path_to_config config.yaml
```
- or in **python**:
```python3
import latent_semantic_analysislatent_semantic_analysis.train(path_to_config="config.yaml")
```**NOTE**: more about config file [here](https://github.com/dayyass/latent-semantic-analysis/tree/main#config).
No data preparation is needed, only a **csv** file with raw text column (with arbitrary name).
#### Config
The user interface consists of only one files:
- [**config.yaml**](https://github.com/dayyass/latent-semantic-analysis/blob/main/config.yaml) - general configuration with sklearn **TF-IDF** and **SVD** parametersChange **config.yaml** to create the desired configuration and train LSA model with the following command:
- **terminal**:
```shell script
lsa-train --path_to_config config.yaml
```
- **python**:
```python3
import latent_semantic_analysislatent_semantic_analysis.train(path_to_config="config.yaml")
```Default **config.yaml**:
```yaml
seed: 42
path_to_save_folder: models# data
data:
data_path: data/data.csv
sep: ','
text_column: text# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1# svd
svd:
n_components: 10
algorithm: arpack
```**NOTE**: `tf-idf` and `svd` are sklearn [**TfidfVectorizer**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfVectorizer) and [**TruncatedSVD**](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) parameters correspondingly, so you can parameterize instances of these classes however you want.
#### Output
After training the model, the pipeline will return the following files:
- `model.joblib` - sklearn pipeline with LSA (TF-IDF and SVD steps)
- `config.yaml` - config that was used to train the model
- `logging.txt` - logging file
- `doc2topic.json` - document embeddings
- `term2topic.json` - term embeddings### Requirements
Python >= 3.6### Citation
If you use **latent-semantic-analysis** in a scientific publication, we would appreciate references to the following BibTex entry:
```bibtex
@misc{dayyass2021lsa,
author = {El-Ayyass, Dani},
title = {Pipeline for training LSA models},
howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
year = {2021}
}
```