https://github.com/dayyass/latent-semantic-analysis

Pipeline for training LSA models using Scikit-Learn.
https://github.com/dayyass/latent-semantic-analysis

data-science hacktoberfest latent-semantic-analysis lsa machine-learning natural-language-processing nlp pipeline python topic-modeling

Last synced: 3 months ago
JSON representation

Pipeline for training LSA models using Scikit-Learn.

Host: GitHub
URL: https://github.com/dayyass/latent-semantic-analysis
Owner: dayyass
License: mit
Created: 2021-09-26T10:40:10.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2021-10-12T05:47:03.000Z (over 3 years ago)
Last Synced: 2025-04-13T07:55:25.869Z (3 months ago)
Topics: data-science, hacktoberfest, latent-semantic-analysis, lsa, machine-learning, natural-language-processing, nlp, pipeline, python, topic-modeling
Language: Python
Homepage: https://pypi.org/project/latent-semantic-analysis/
Size: 34.2 KB
Stars: 23
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        [![tests](https://github.com/dayyass/latent-semantic-analysis/actions/workflows/tests.yml/badge.svg)](https://github.com/dayyass/latent-semantic-analysis/actions/workflows/tests.yml)

[![linter](https://github.com/dayyass/latent-semantic-analysis/actions/workflows/linter.yml/badge.svg)](https://github.com/dayyass/latent-semantic-analysis/actions/workflows/linter.yml)

[![codecov](https://codecov.io/gh/dayyass/latent-semantic-analysis/branch/main/graph/badge.svg?token=Y39Q5786DL)](https://codecov.io/gh/dayyass/latent-semantic-analysis)

[![python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://github.com/dayyass/latent-semantic-analysis#requirements)

[![release (latest by date)](https://img.shields.io/github/v/release/dayyass/latent-semantic-analysis)](https://github.com/dayyass/latent-semantic-analysis/releases/latest)

[![license](https://img.shields.io/github/license/dayyass/latent-semantic-analysis?color=blue)](https://github.com/dayyass/latent-semantic-analysis/blob/main/LICENSE)

[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-black)](https://github.com/dayyass/latent-semantic-analysis/blob/main/.pre-commit-config.yaml)

[![code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

[![pypi version](https://img.shields.io/pypi/v/latent-semantic-analysis)](https://pypi.org/project/latent-semantic-analysis)

[![pypi downloads](https://img.shields.io/pypi/dm/latent-semantic-analysis)](https://pypi.org/project/latent-semantic-analysis)

### Latent Semantic Analysis

Pipeline for training **LSA** models using Scikit-Learn.

### Usage

Instead of writing custom code for latent semantic analysis, you just need:

1. install pipeline:

```shell script

pip install latent-semantic-analysis

```

2. run pipeline:

- either in **terminal**:

```shell script

lsa-train --path_to_config config.yaml

```

- or in **python**:

```python3

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

```

**NOTE**: more about config file [here](https://github.com/dayyass/latent-semantic-analysis/tree/main#config).

No data preparation is needed, only a **csv** file with raw text column (with arbitrary name).

#### Config

The user interface consists of only one files:

- [**config.yaml**](https://github.com/dayyass/latent-semantic-analysis/blob/main/config.yaml) - general configuration with sklearn **TF-IDF** and **SVD** parameters

Change **config.yaml** to create the desired configuration and train LSA model with the following command:

- **terminal**:

```shell script

lsa-train --path_to_config config.yaml

```

- **python**:

```python3

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

```

Default **config.yaml**:

```yaml

seed: 42

path_to_save_folder: models

# data

data:

  data_path: data/data.csv

  sep: ','

  text_column: text

# tf-idf

tf-idf:

  lowercase: true

  ngram_range: (1, 1)

  max_df: 1.0

  min_df: 1

# svd

svd:

  n_components: 10

  algorithm: arpack

```

**NOTE**: `tf-idf` and `svd` are sklearn [**TfidfVectorizer**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfVectorizer) and [**TruncatedSVD**](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) parameters correspondingly, so you can parameterize instances of these classes however you want.

#### Output

After training the model, the pipeline will return the following files:

- `model.joblib` - sklearn pipeline with LSA (TF-IDF and SVD steps)

- `config.yaml` - config that was used to train the model

- `logging.txt` - logging file

- `doc2topic.json` - document embeddings

- `term2topic.json` - term embeddings

### Requirements

Python >= 3.6

### Citation

If you use **latent-semantic-analysis** in a scientific publication, we would appreciate references to the following BibTex entry:

```bibtex

@misc{dayyass2021lsa,

    author       = {El-Ayyass, Dani},

    title        = {Pipeline for training LSA models},

    howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},

    year         = {2021}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dayyass/latent-semantic-analysis

Awesome Lists containing this project

README