{"id":27957464,"url":"https://github.com/farhad-here/textprepx","last_synced_at":"2025-05-07T18:13:09.725Z","repository":{"id":291834540,"uuid":"978754843","full_name":"farhad-here/TextPrepX","owner":"farhad-here","description":"A Multilingual Text Preprocessing Tool for English and Persian.","archived":false,"fork":false,"pushed_at":"2025-05-06T23:43:29.000Z","size":3921,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-07T18:13:04.553Z","etag":null,"topics":["cleantext","contractions","data-analysis","deep-learning","emoji","nlp","nltk","opp","parsivar","regex","streamlit","text-preprocessing","textblob"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/farhad-here.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-06T13:11:23.000Z","updated_at":"2025-05-06T23:43:32.000Z","dependencies_parsed_at":"2025-05-06T19:56:25.344Z","dependency_job_id":null,"html_url":"https://github.com/farhad-here/TextPrepX","commit_stats":null,"previous_names":["farhad-here/textprepx"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farhad-here%2FTextPrepX","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farhad-here%2FTextPrepX/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farhad-here%2FTextPrepX/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/farhad-here%2FTextPrepX/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/farhad-here","download_url":"https://codeload.github.com/farhad-here/TextPrepX/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252931553,"owners_count":21827112,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cleantext","contractions","data-analysis","deep-learning","emoji","nlp","nltk","opp","parsivar","regex","streamlit","text-preprocessing","textblob"],"created_at":"2025-05-07T18:13:08.936Z","updated_at":"2025-05-07T18:13:09.719Z","avatar_url":"https://github.com/farhad-here.png","language":"Python","readme":"# TextPrepX (Multilingual Text Preprocessing)\n\nTextPrepX is a Streamlit-based web application for preprocessing text data in both **English** and **Persian**. It supports common preprocessing steps like lowercasing, removing punctuation and emojis, handling contractions, stemming, spell correction, and more.\n\n# 📄 Description:\nThis project is an interactive text preprocessing tool built with Streamlit, designed to clean and prepare both English and Persian texts for natural language processing (NLP) tasks.\n\nIt supports a wide range of preprocessing options, including:\n\nLowercasing\n\nRemoving punctuation, numbers, and emojis\n\nExpanding contractions\n\nSpell correction using TextBlob (for English) and Parsivar (for Persian)\n\nStopword removal (customizable for Persian)\n\nLemmatization and stemming\n\nTokenization (word and sentence level)\n\nRepetition reduction and slang replacement\n\nUnicode normalization and formatting cleanup\n\nThe Persian module leverages the Parsivar library, while the English module utilizes NLTK, TextBlob, and contractions for more nuanced cleaning. Users can either upload .txt files or enter raw text directly. Results are displayed in a styled, readable format.\n\nThis toolkit is ideal for data preprocessing in NLP pipelines, educational purposes, and rapid text cleaning for bilingual corpora.\n---\n\n## ✨ Features\n\n### ✅ English Text\n- Lowercasing\n- Removing numbers and punctuation\n- Handling contractions (e.g., can't → cannot)\n- Removing emojis\n- Spell correction using TextBlob\n- Stopword removal\n- Lemmatization + Stemming\n- Reducing repeated characters and slang normalization (e.g., gonna → going to)\n\n### ✅ Persian Text\n- Normalization using Parsivar\n- Custom stopword removal\n- Tokenization (words \u0026 sentences)\n- Stemming\n- Spell correction using Parsivar\n- Removing punctuation, numbers, and extra whitespaces\n\n---\n\n## 🚀 How to Run\n\n1. Clone this repository or download the code.\n2. Install dependencies:\n\n```bash\npip install -r requirements.txt\n```\nfirst create a spell folder in this path:\n```\nvenv\\Lib\\site-packages\\parsivar\\resource\n```\n\nthen replace these two file in the spell folder:\n```\n- onegram.pckl\n- mybigram_lm.pckl\n```\n\n#### 🔽\u003ca href='https://www.dropbox.com/scl/fi/4lspgdqw0yym6w2ewhcs7/spell.zip?e=3\u0026file_subpath=%2Fmybigram_lm.pckl\u0026rlkey=fl0moighiw7s46pgorz1xjtg0\u0026dl=0'\u003eDownload two files from here\u003c/a\u003e\n\n```bash\nstreamlit run TEP.py\n```\nTextPrepX/\n├── TEP.py                      # Main Streamlit app\n├── persianstopwords.txt        # Custom Persian stopword list\n├── models/\n│   └── cnn-lstm-probwordnoise/ # (Optional) NeuSpell model folder for advanced spellcheck\n├── requirements.txt\n\n# 📌 Notes\nPersian spell correction is handled by Parsivar.\n\nFor advanced English spell correction (NeuSpell), set up the model separately.\n\nYou can enhance further by adding Named Entity Recognition (NER) or keyword extraction.\n\n# 📷 Screenshots\n![tt1](https://github.com/user-attachments/assets/507e2c86-f4ce-4df3-bf8b-6812f4012268)\n![tt2](https://github.com/user-attachments/assets/25f1eac9-6af0-4897-8634-3c0bc34a6f3e)\n![tt3](https://github.com/user-attachments/assets/a4e963bf-bd56-45f6-a0df-a56a099a681b)\n![tt4](https://github.com/user-attachments/assets/dafaaa0c-d63b-45ce-986f-aaa89645bceb)\n![tt5](https://github.com/user-attachments/assets/11951d21-6ae3-4990-a016-da4af5215de5)\n\n# 🧑‍💻 Author\nCreated by [Farhad Ghaherdoost] – Feel free to fork and customize.😄\n\n\n\n\n\n\n\n\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarhad-here%2Ftextprepx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffarhad-here%2Ftextprepx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffarhad-here%2Ftextprepx/lists"}