https://github.com/farhad-here/textprepx
A Multilingual Text Preprocessing Tool for English and Persian.
https://github.com/farhad-here/textprepx
cleantext contractions data-analysis deep-learning emoji nlp nltk opp parsivar regex streamlit text-preprocessing textblob
Last synced: 10 months ago
JSON representation
A Multilingual Text Preprocessing Tool for English and Persian.
- Host: GitHub
- URL: https://github.com/farhad-here/textprepx
- Owner: farhad-here
- License: mit
- Created: 2025-05-06T13:11:23.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-05-06T23:43:29.000Z (10 months ago)
- Last Synced: 2025-05-07T18:13:04.553Z (10 months ago)
- Topics: cleantext, contractions, data-analysis, deep-learning, emoji, nlp, nltk, opp, parsivar, regex, streamlit, text-preprocessing, textblob
- Language: Python
- Homepage:
- Size: 3.74 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TextPrepX (Multilingual Text Preprocessing)
TextPrepX is a Streamlit-based web application for preprocessing text data in both **English** and **Persian**. It supports common preprocessing steps like lowercasing, removing punctuation and emojis, handling contractions, stemming, spell correction, and more.
# 📄 Description:
This project is an interactive text preprocessing tool built with Streamlit, designed to clean and prepare both English and Persian texts for natural language processing (NLP) tasks.
It supports a wide range of preprocessing options, including:
Lowercasing
Removing punctuation, numbers, and emojis
Expanding contractions
Spell correction using TextBlob (for English) and Parsivar (for Persian)
Stopword removal (customizable for Persian)
Lemmatization and stemming
Tokenization (word and sentence level)
Repetition reduction and slang replacement
Unicode normalization and formatting cleanup
The Persian module leverages the Parsivar library, while the English module utilizes NLTK, TextBlob, and contractions for more nuanced cleaning. Users can either upload .txt files or enter raw text directly. Results are displayed in a styled, readable format.
This toolkit is ideal for data preprocessing in NLP pipelines, educational purposes, and rapid text cleaning for bilingual corpora.
---
## ✨ Features
### ✅ English Text
- Lowercasing
- Removing numbers and punctuation
- Handling contractions (e.g., can't → cannot)
- Removing emojis
- Spell correction using TextBlob
- Stopword removal
- Lemmatization + Stemming
- Reducing repeated characters and slang normalization (e.g., gonna → going to)
### ✅ Persian Text
- Normalization using Parsivar
- Custom stopword removal
- Tokenization (words & sentences)
- Stemming
- Spell correction using Parsivar
- Removing punctuation, numbers, and extra whitespaces
---
## 🚀 How to Run
1. Clone this repository or download the code.
2. Install dependencies:
```bash
pip install -r requirements.txt
```
first create a spell folder in this path:
```
venv\Lib\site-packages\parsivar\resource
```
then replace these two file in the spell folder:
```
- onegram.pckl
- mybigram_lm.pckl
```
#### 🔽Download two files from here
```bash
streamlit run TEP.py
```
TextPrepX/
├── TEP.py # Main Streamlit app
├── persianstopwords.txt # Custom Persian stopword list
├── models/
│ └── cnn-lstm-probwordnoise/ # (Optional) NeuSpell model folder for advanced spellcheck
├── requirements.txt
# 📌 Notes
Persian spell correction is handled by Parsivar.
For advanced English spell correction (NeuSpell), set up the model separately.
You can enhance further by adding Named Entity Recognition (NER) or keyword extraction.
# 📷 Screenshots





# 🧑💻 Author
Created by [Farhad Ghaherdoost] – Feel free to fork and customize.😄