An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with text-cleaning

A curated list of projects in awesome lists tagged with text-cleaning .

https://github.com/blmoistawinde/harvesttext

文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法

dependency-parser gitee harvesttext keyword-extraction named-entity-recognition new-word-discovery nlp pyhanlp sentiment-analysis text-cleaning text-segmentation text-summarization unsupervised

Last synced: 14 May 2025

https://github.com/blmoistawinde/HarvestText

文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法

dependency-parser gitee harvesttext keyword-extraction named-entity-recognition new-word-discovery nlp pyhanlp sentiment-analysis text-cleaning text-segmentation text-summarization unsupervised

Last synced: 18 Mar 2025

https://github.com/wisupai/e2m

E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.

doc2x e2m llm markdown pdf-to-markdown text-cleaning

Last synced: 15 May 2025

https://github.com/trinker/textclean

Tools for cleaning and normalizing text data

data-munging emoticons r regex text-analysis text-cleaning

Last synced: 05 Apr 2025

https://github.com/rezach/grammarify

Grammarify is a npm package that safely cleans up text that has mispellings, improper capitalization, lexical illusions, among other things.

grammar-checker spelling-correction text-cleaning

Last synced: 23 Nov 2024

https://github.com/johnjago/deformat

Remove extra whitespace from text.

formatter linebreak newline text-cleaning whitespace

Last synced: 15 Apr 2025

https://github.com/ternaus/ternaus-cleantext

Cleans text as in the CLIP model

python text-cleaning

Last synced: 12 Apr 2025

https://github.com/1994nikunj/nlp-toolkit-desktop-app

The code is a collection of NLP analyses, including text cleaning, most common words, n-grams generation, co-occurrence matrix generation, wordcloud generation, topic modeling (using Latent Dirichlet Allocation), and general text statistics.

data-analysis n-grams network-visualization nlp python text-cleaning topic-modeling wordcloud-generator

Last synced: 25 Nov 2024

https://github.com/sayamalt/news-category-classification

Successfully developed a news category classification model using fine-tuned BERT which can accurately classify any news text into its respective category i.e. Politics, Business, Technology and Entertainment.

bert-embeddings exploratory-data-analysis feature-engineering fine-tuning-bert model-evaluation nlp text-classification text-cleaning text-preprocessing text-tokenization

Last synced: 19 Feb 2025

https://github.com/Shawn91/DocTor

A tabular/list/plain text cleaner

table-cleaning text-cleaning text-process

Last synced: 15 Apr 2025

https://github.com/infinitode/valx

ValX is an open-source Python package for text cleaning tasks, including profanity detection and removal. Now also includes sensitive information detection, and removal.

ai cleaner datasets nlp profanity-detection profanity-filter python removal sensitive-data sensitive-data-detection text-cleaning

Last synced: 21 Feb 2025

https://github.com/mrqadeer/text_prettifier

Python library designed to clean and preprocess text data by removing unwanted elements such as HTML tags, URLs, numbers, special characters, emojis, contractions, and stopwords. It offers flexible functionality, including options to return text in lowercase and as a list of tokens.

nltk-library python3 regular-expressions text-cleaning text-preprocessing

Last synced: 23 Mar 2025