https://github.com/szymonrucinski/pippi-lang

Elegant 📑 text preprocessing pipeline 🚰 available as pip package 🐍 based on scikit-learn pipeline. Combines Transformer and Column Transformer into a single object.
https://github.com/szymonrucinski/pippi-lang

data-cleaning data-science nlp pipeline scikit-learn

Last synced: 2 months ago
JSON representation

Elegant 📑 text preprocessing pipeline 🚰 available as pip package 🐍 based on scikit-learn pipeline. Combines Transformer and Column Transformer into a single object.

Host: GitHub
URL: https://github.com/szymonrucinski/pippi-lang
Owner: szymonrucinski
Created: 2023-02-04T15:34:31.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-02-11T00:03:52.000Z (over 2 years ago)
Last Synced: 2024-04-29T08:02:44.438Z (about 1 year ago)
Topics: data-cleaning, data-science, nlp, pipeline, scikit-learn
Language: Python
Homepage: https://pypi.org/project/pyppi/
Size: 52.7 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Text cleaning Pipeline 

[![Build package](https://github.com/szymonrucinski/pippi-lang/actions/workflows/build-pkg.yml/badge.svg)](https://github.com/szymonrucinski/pippi-lang/actions/workflows/build-pkg.yml) [![Check style](https://github.com/szymonrucinski/pippi-lang/actions/workflows/check-style.yml/badge.svg)](https://github.com/szymonrucinski/pippi-lang/actions/workflows/check-style.yml)[![Run Tests](https://github.com/szymonrucinski/pippi-lang/actions/workflows/run-tests.yml/badge.svg)](https://github.com/szymonrucinski/pippi-lang/actions/workflows/run-tests.yml)

___

## Description

This code contains a pipeline for pre-processing text data for sentiment analysis. It includes steps for removing stop words, HTML tags, changing letter size, and removing punctuation.

*Future code will include text-transformations like word-embedding and word-vectorization.*

To install [this package](https://pypi.org/project/pippi-lang/) simply run:

``` bash

pip install pippi-lang

``` 

### Example

Elegant data pipelines are a key component of any data science project. They allow you to automate the process of cleaning, transforming, and analyzing data. This code is a simple example of how to create a pipeline for text data using cutom transformers and the sklearn Pipeline class.

``` python

from pippi import (

    TransformLettersSize,

    RemoveStopWords,

    Lemmatize,

    RemovePunctuation,

    RemoveHTMLTags,

)

from sklearn.pipeline import Pipeline

import pandas as pd

    pipeline = Pipeline(

        steps=[

            ("remove_stop_words", RemoveStopWords(columns=["review","sentiment"])),

            ("remove_html_tags", RemoveHTMLTags(columns=df.columns.to_list())),

            ("uppercase_letters", TransformLettersSize(columns=["sentiment"], case_transform="upper")),

            ("remove_punctuation", RemovePunctuation(columns=["review"])),

        ]

    )

    output = pipeline.fit_transform(df)

    df = pd.DataFrame(output, columns=["review", "sentiment"])

```

Pipeline Visualization:

``` markdown

[RemoveStopWords] -> [RemoveHTMLTags] -> [TransformLettersSize] ->   [RemovePunctuation]

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/szymonrucinski/pippi-lang

Awesome Lists containing this project

README