Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/szymonrucinski/pippi-lang
Elegant 📑 text preprocessing pipeline 🚰 available as pip package 🐍 based on scikit-learn pipeline. Combines Transformer and Column Transformer into a single object.
https://github.com/szymonrucinski/pippi-lang
data-cleaning data-science nlp pipeline scikit-learn
Last synced: 18 days ago
JSON representation
Elegant 📑 text preprocessing pipeline 🚰 available as pip package 🐍 based on scikit-learn pipeline. Combines Transformer and Column Transformer into a single object.
- Host: GitHub
- URL: https://github.com/szymonrucinski/pippi-lang
- Owner: szymonrucinski
- Created: 2023-02-04T15:34:31.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-02-11T00:03:52.000Z (over 1 year ago)
- Last Synced: 2024-04-29T08:02:44.438Z (6 months ago)
- Topics: data-cleaning, data-science, nlp, pipeline, scikit-learn
- Language: Python
- Homepage: https://pypi.org/project/pyppi/
- Size: 52.7 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Text cleaning Pipeline
[![Build package](https://github.com/szymonrucinski/pippi-lang/actions/workflows/build-pkg.yml/badge.svg)](https://github.com/szymonrucinski/pippi-lang/actions/workflows/build-pkg.yml) [![Check style](https://github.com/szymonrucinski/pippi-lang/actions/workflows/check-style.yml/badge.svg)](https://github.com/szymonrucinski/pippi-lang/actions/workflows/check-style.yml)[![Run Tests](https://github.com/szymonrucinski/pippi-lang/actions/workflows/run-tests.yml/badge.svg)](https://github.com/szymonrucinski/pippi-lang/actions/workflows/run-tests.yml)
___
## Description
This code contains a pipeline for pre-processing text data for sentiment analysis. It includes steps for removing stop words, HTML tags, changing letter size, and removing punctuation.
*Future code will include text-transformations like word-embedding and word-vectorization.*To install [this package](https://pypi.org/project/pippi-lang/) simply run:
``` bash
pip install pippi-lang
```### Example
Elegant data pipelines are a key component of any data science project. They allow you to automate the process of cleaning, transforming, and analyzing data. This code is a simple example of how to create a pipeline for text data using cutom transformers and the sklearn Pipeline class.``` python
from pippi import (
TransformLettersSize,
RemoveStopWords,
Lemmatize,
RemovePunctuation,
RemoveHTMLTags,
)
from sklearn.pipeline import Pipeline
import pandas as pdpipeline = Pipeline(
steps=[
("remove_stop_words", RemoveStopWords(columns=["review","sentiment"])),
("remove_html_tags", RemoveHTMLTags(columns=df.columns.to_list())),
("uppercase_letters", TransformLettersSize(columns=["sentiment"], case_transform="upper")),
("remove_punctuation", RemovePunctuation(columns=["review"])),
]
)
output = pipeline.fit_transform(df)
df = pd.DataFrame(output, columns=["review", "sentiment"])```
Pipeline Visualization:``` markdown
[RemoveStopWords] -> [RemoveHTMLTags] -> [TransformLettersSize] -> [RemovePunctuation]
```