Projects in Awesome Lists tagged with text-cleaning
A curated list of projects in awesome lists tagged with text-cleaning .
https://github.com/adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 14 Mar 2025
https://github.com/blmoistawinde/harvesttext
文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
dependency-parser gitee harvesttext keyword-extraction named-entity-recognition new-word-discovery nlp pyhanlp sentiment-analysis text-cleaning text-segmentation text-summarization unsupervised
Last synced: 14 May 2025
https://github.com/blmoistawinde/HarvestText
文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
dependency-parser gitee harvesttext keyword-extraction named-entity-recognition new-word-discovery nlp pyhanlp sentiment-analysis text-cleaning text-segmentation text-summarization unsupervised
Last synced: 18 Mar 2025
https://github.com/wisupai/e2m
E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.
doc2x e2m llm markdown pdf-to-markdown text-cleaning
Last synced: 15 May 2025
https://github.com/jfilter/clean-text
🧹 Python package for text cleaning
natural-language-processing nlp python python-package scraping text-cleaning text-normalization text-preprocessing user-generated-content
Last synced: 15 May 2025
https://github.com/trinker/textclean
Tools for cleaning and normalizing text data
data-munging emoticons r regex text-analysis text-cleaning
Last synced: 05 Apr 2025
https://github.com/rezach/grammarify
Grammarify is a npm package that safely cleans up text that has mispellings, improper capitalization, lexical illusions, among other things.
grammar-checker spelling-correction text-cleaning
Last synced: 23 Nov 2024
https://github.com/hscspring/pnlp
NLP预/后处理工具。
chinese-nlp concurrency nlp nlp-enhancer nlp-preprocess normalization preprocessing text-cleaning text-extraction text-length text-processing
Last synced: 17 Jan 2025
https://github.com/aayushpatel007/topicrankpy
A Python package to get useful information from documents using TopicRank Algorithm.
data-preprocessing email-parsing graph-algorithms hierarchical-clustering keyphrase-extraction keywords-extraction named-entity-recognition network-x nlp pagerank-python phone-parse spacy text-cleaning textrank topicrank
Last synced: 15 Feb 2025
https://github.com/johnjago/deformat
Remove extra whitespace from text.
formatter linebreak newline text-cleaning whitespace
Last synced: 15 Apr 2025
https://github.com/andythefactory/article-extraction-dataset
Article title, authors, date and body extraction dataset.
article-extractor corpus corpus-builder corpus-tools dataset datasets html-to-markdown html2text news news-aggregator news-crawler readability scraping scraping-websites text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 18 Feb 2025
https://github.com/ternaus/ternaus-cleantext
Cleans text as in the CLIP model
Last synced: 12 Apr 2025
https://github.com/1994nikunj/nlp-toolkit-desktop-app
The code is a collection of NLP analyses, including text cleaning, most common words, n-grams generation, co-occurrence matrix generation, wordcloud generation, topic modeling (using Latent Dirichlet Allocation), and general text statistics.
data-analysis n-grams network-visualization nlp python text-cleaning topic-modeling wordcloud-generator
Last synced: 25 Nov 2024
https://github.com/bhattbhavesh91/clean-text-demo
Tutorial on Clean-Text which is a Python package for text cleaning
machine-learning natural-language-processing nlp python text-cleaning text-preprocessing tutorial user-generated-content
Last synced: 09 Mar 2025
https://github.com/sayamalt/news-category-classification
Successfully developed a news category classification model using fine-tuned BERT which can accurately classify any news text into its respective category i.e. Politics, Business, Technology and Entertainment.
bert-embeddings exploratory-data-analysis feature-engineering fine-tuning-bert model-evaluation nlp text-classification text-cleaning text-preprocessing text-tokenization
Last synced: 19 Feb 2025
https://github.com/Shawn91/DocTor
A tabular/list/plain text cleaner
table-cleaning text-cleaning text-process
Last synced: 15 Apr 2025
https://github.com/infinitode/valx
ValX is an open-source Python package for text cleaning tasks, including profanity detection and removal. Now also includes sensitive information detection, and removal.
ai cleaner datasets nlp profanity-detection profanity-filter python removal sensitive-data sensitive-data-detection text-cleaning
Last synced: 21 Feb 2025
https://github.com/mrqadeer/text_prettifier
Python library designed to clean and preprocess text data by removing unwanted elements such as HTML tags, URLs, numbers, special characters, emojis, contractions, and stopwords. It offers flexible functionality, including options to return text in lowercase and as a list of tokens.
nltk-library python3 regular-expressions text-cleaning text-preprocessing
Last synced: 23 Mar 2025
https://github.com/youssef155/sentiment_analysis
Sentiment Analysis For Restaurant Reviews
flask jupyter-notebook nlp pkl-model python stemming stopwords text-cleaning
Last synced: 25 Feb 2025