An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with text-preprocessing

A curated list of projects in awesome lists tagged with text-preprocessing .

https://github.com/lyeoni/prenlp

Preprocessing Library for Natural Language Processing

natural-language-processing nlp preprocessing-library text-preprocessing text-processing

Last synced: 10 Apr 2025

https://github.com/ezgisubasi/turkish-tweets-sentiment-analysis

This sentiment analysis project determines whether the tweets posted in the Turkish language on Twitter are positive or negative.

data-visualization deep-learning glove glove-embeddings keras n-grams nlp sentiment-analysis text-preprocessing turkish-language turkish-nlp tweets twitter-sentiment-analysis zemberek zemberek-nlp

Last synced: 25 Oct 2025

https://github.com/Lipairui/textgo

Text preprocessing, representation, similarity calculation, text search and classification. Let's go and play with text!

bert nlp text-classification text-preprocessing text-representation text-search text-similarity

Last synced: 18 Jul 2025

https://github.com/CDSoft/panda

Panda is a Pandoc Lua filter that works on internal Pandoc's AST. Panda is heavily inspired by [abp](http:/cdelord.fr/abp) reimplemented as a Pandoc Lua filter.

lua pandoc pandoc-filter text-preprocessing

Last synced: 10 May 2025

https://github.com/tesserato/inscribe

Markdown preprocessor that runs code fences

markdown rust text-preprocessing

Last synced: 19 Oct 2025

https://github.com/venkat-0706/sentimental-analysis

Build a model to classify text as positive, negative, or neutral. Apply NLP techniques for preprocessing and machine learning for classification. Aim for accurate sentiment prediction on various text formats.

data-visualization feature-engineering machine-learning natural-language-processing numpy pandas python scikit-learn sentiment-detection supervised-learning text-classification text-preprocessing tokenizaiton wordcloud

Last synced: 10 Mar 2025

https://github.com/lanl/t-elf

Tensor Extraction of Latent Features (T-ELF). Within T-ELF's arsenal are non-negative matrix and tensor factorization solutions, equipped with automatic model determination (also known as the estimation of latent factors - rank) for accurate data modeling. Our software suite encompasses cutting-edge data pre-processing and post-processing modules.

blind-source-separation dimensionality-reduction feature-extraction gpu high-performance-computing hpc latent-variables machine-learning matrix matrix-completion matrix-factorization non-negative-matrix-factorization pattern-extraction semi-supervised-learning tensor-decomposition tensor-factorization tensors text-preprocessing unsupervised-learning

Last synced: 12 Apr 2025

https://github.com/byam/mnlp

MNLP: Mongolian Natural Language Processing.

hacktoberfest mongolian mongolian-text-classification nlp text-preprocessing

Last synced: 13 Sep 2025

https://github.com/sayamalt/resume-classification-using-fine-tuned-bert

Successfully developed a resume classification model which can accurately classify the resume of any person into its corresponding job with a tremendously high accuracy of more than 99%.

bert-model exploratory-data-analysis fine-tuning-bert model-evaluation nlp text-preprocessing text-tokenization word-embeddings

Last synced: 31 Aug 2025

https://github.com/sayamalt/language-detection-using-fine-tuned-xlm-roberta-base-transformer-model

Successfully developed a language detection transformer model that can accurately recognize the language in which any given text is written.

bert-fine-tuning feature-engineering fine-tuning model-evaluation model-evaluation-metrics nlp text-classification text-preprocessing xlm-roberta

Last synced: 31 Aug 2025

https://github.com/ailln/proces

๐Ÿจ text preprocess.

python-package python3 text-preprocessing text-processing

Last synced: 08 May 2025

https://github.com/sayamalt/emotion-detection-using-fine-tuned-bert-transformer

Successfully developed a fine-tuned BERT transformer model which can effectively perform emotion classification on any given piece of texts to identify a suitable human emotion based on semantic meaning of the text.

bert-transformer emotion-classification feature-engineering fine-tuning-bert natural-language-processing natural-language-understanding text-preprocessing

Last synced: 01 Sep 2025

https://github.com/bhattbhavesh91/texthero-demo

Tutorial to demonstrate the power of Texthero which is a library used for Text preprocessing, representation and visualization from zero to hero.

nlp nlp-pipeline text-clustering text-mining text-preprocessing text-representation text-visualization texthero texthero-tutorial word-embeddings

Last synced: 26 Oct 2025

https://github.com/prashver/movie-recommendation-system

This recommendation system employs content-based filtering and NLP preprocessing to suggest similar movies based on user preferences and movie data. It fetches movie posters via APIs and is deployed on Streamlit for easy access.

api-request natural-language-processing nltk-python numpy pandas recommender-system streamlit-deployment text-preprocessing

Last synced: 12 Oct 2025

https://github.com/sayamalt/abstractive-text-summarization-of-news-articles

Successfully developed an encoder-decoder based sequence to sequence (Seq2Seq) model which can summarize the entire text of an Indian news summary into a short paragraph with limited number of words.

attention-is-all-you-need attention-mechanism lstm-neural-networks natural-language-processing sequence-to-sequence-models text-generation text-preprocessing

Last synced: 09 Nov 2025

https://github.com/adilrasheed139/ai-powered-resume-screening-using-bert

Successfully developed a resume classification model which can accurately classify the resume of any person into its corresponding job with a tremendously high accuracy of more than 99%.

bert-model deep-learning exploratory-data-analysis-eda fine-tuning-bert model-evaluation nlp nlp-machine-learning text-preprocessing text-tokenization word-embeddings word-embeddings-for-nlp

Last synced: 03 Apr 2025

https://github.com/jesly-joji/spam-ham-classifier

Used Naive Bayes Algorithm, NLP Text Preprocessing Techniques

naive-bayes-classifier nlp scikit-learn streamlit text-preprocessing

Last synced: 18 Aug 2025

https://github.com/sd7campeon/yelp-sentiment-analysis-with-python-bs4-and-llm

A scalable pipeline for automated extraction, preprocessing, and sentiment analysis of Yelp reviews. Uses advanced HTTP requests, HTML parsing, and text normalization (tokenization, stopword removal, lemmatization) to enable precise polarity and subjectivity analysis for consumer insights and business analytics.

beautifulsoup beautifulsoup4 business-analytics cuda data-analysis nlp-machine-learning nltk opinion-mining pandas python python3 requests-library-python sentiment-analysis text-preprocessing textblob torch web-scraping yelp-reviews

Last synced: 18 Oct 2025

https://github.com/sayamalt/news-category-classification

Successfully developed a news category classification model using fine-tuned BERT which can accurately classify any news text into its respective category i.e. Politics, Business, Technology and Entertainment.

bert-embeddings exploratory-data-analysis feature-engineering fine-tuning-bert model-evaluation nlp text-classification text-cleaning text-preprocessing text-tokenization

Last synced: 15 Jun 2025

https://github.com/ssciwr/mailcom

Recognize and pseudonymize named entities in emails

anonymization data-privacy llm-inference pseudonymization text-preprocessing

Last synced: 21 Apr 2025

https://github.com/ajaykumar095/natural_language_processing

Explore cutting-edge Natural Language Processing (NLP) techniques in this GitHub repository. Includes pre-trained models, custom NLP pipelines, text preprocessing tools, sentiment analysis, text classification, and more. Ideal for research, learning, and deploying NLP solutions in Python.

ann nltk-python python rnn spacy tensorflow text-preprocessing textblob

Last synced: 20 Sep 2025

https://github.com/theveryhim/massive-text-processing-1

cleaning, processing and analysis of papers' dataset in pyspark(rdd) framework

big-data data-analysis frequent-itemsets massive-datasets pyspark text-preprocessing

Last synced: 03 Jul 2025

https://github.com/bhargav-joshi/nlp-practicals

Natural Language Processing Practicals on different concepts to analyze and understand the practical implementation their use and actual use.

chunking morphological-analysis n-grams name-entity-recognition natural-language-processing nlp nlp-machine-learning pos-tagging text-classification text-preprocessing topic-modeling

Last synced: 09 Apr 2025

https://github.com/imdeepmind/textpreprocessingscript

Text Preprocessing Script: This is a simple python script that i use for preprocessing text using NLTK.

deep-learning machine-learning natural-language-processing nltk python3 text-preprocessing

Last synced: 22 Feb 2025

https://github.com/jasoncobra3/natural_language_processing

Natural Language Processing (NLP) is a captivating field at the intersection of computer science and linguistics. It enables machines to understand, interpret, and respond to human language in a way that is both meaningful and useful. From chatbots to sentiment analysis, NLP applications are transforming industries and enhancing user experiences.

artificial-intelligence artificial-neural-networks data-science deep-learning google-news-scraper machine-learning natural-language-processing nlp nlp-pipeline parts-of-speech pos python text text-analysis text-classification text-preprocessing text-representation word2vec-embeddinngs word2vec-model

Last synced: 16 Mar 2025

https://github.com/sayamalt/mental-health-classification-using-fine-tuned-distilbert

Successfully established a multiclass text classification model by fine-tuning pretrained DistilBERT transformer model to classify several distinct types of mental health statuses such as anxiety, stress, personality disorder, etc. with an accuracy of 77%.

data-visualization deep-learning distilbert-fine-tuning distilbert-model model-evaluation model-inference model-training-and-evaluation multiclass-text-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 08 Oct 2025

https://github.com/sayamalt/luxury-apparel-product-category-classification-using-fine-tuned-distilbert

Successfully developed a multiclass text classification model by fine-tuning pretrained DistilBERT transformer model to classify various distinct types of luxury apparels into their respective categories i.e. pants, accessories, underwear, shoes, etc.

deep-learning distilbert-fine-tuning distilbert-model exploratory-data-analysis fine-tuning-bert model-inference model-training-and-evaluation multiclass-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 08 Oct 2025

https://github.com/antoniskl/un-general-debate-corpus-classification

The aim of this project is to classify UNGDC speeches with regards to climate change. As a secondary objective, a correlation is being examined between these speeches, the forestation and the happiness index of the countries.

classification data-science jupyter-notebook machine-learning nlp python regression scikit-learn text-classification text-preprocessing

Last synced: 09 Oct 2025

https://github.com/somjit101/nlp-casestudy-amazon-fine-foods-review

Efficient Sentencing Encoding and Vectorization techniques with customer reviews on a product page of the popular E-Commerce website, Amazon using proven NLP techniques for the purpose of sentiment analysis.

amazon-fine-food-reviews amazon-fine-food-reviews-dataset featurization natural-language-processing nlp text-classification text-preprocessing tfidf-vectorizer vectorization word2vec

Last synced: 06 Mar 2025

https://github.com/moustafamohamed01/web-summarizer-ai

A Python tool to scrape and summarize website content using AI. Built with Selenium, BeautifulSoup, and Google's Gemini AI, this project extracts the main text from any website and generates a concise summary in markdown format. Perfect for quickly understanding long articles, blogs, or news pages.

ai beautifulsoup gemini-ai python selenium text-preprocessing web-scraping

Last synced: 17 Mar 2025

https://github.com/bilalhameed248/faq-chat-bot-using-vertexai

A generative AI-based FAQ Chat-Bot with a Flask Back-End, designed to operate within an organization's internal domain. - Jul 2023 - Oct 2023

csv embeddings flask gecko html java jquery natural-language-processing nlp python python3 pytorch text-bison text-embedding text-preprocessing vertex-ai

Last synced: 30 Dec 2025

https://github.com/vlada-pv/prediction-sociolinguistic-data-based-on-the-diaries-texts-of-the-prozhito-project

The repository contains notebooks created for collecting and preprocessing the corpus of diary entries and for experiments on creating models for predicting gender, age groups of authors and the time period of text creation.

author-profiling bag-of-words bilstm convol convolutional-neural-networks deep-learning diary-entries logistic-regression naive-bayes-classifier neural-networks recurrent-neural-networks sociolinguistics text-preprocessing text-vectorization tf-idf-vectorizer word-embeddings

Last synced: 13 Jul 2025

https://github.com/theveryhim/massive-text-processing

cleaning, processing and analysis of papers' dataset in pyspark(rdd) framework

big-data data-analysis frequent-itemsets massive-datasets pyspark text-preprocessing

Last synced: 18 Jul 2025

https://github.com/nurfawaiq/nlp-text-preprocessing

Natural Language Processing - Text Preprocessing

crawling nlp python3 text-preprocessing twitter

Last synced: 11 Sep 2025

https://github.com/abinashsahoo007/project-resume-classification

The document classification solution should significantly reduce the manual human effort in the HRM. It should achieve a higher level of accuracy and automation with minimal human intervention.

corpus count-vectorizer label-encoding lemmitization machine-learning nltk part-of-speech-tagging resume-classification spacy stemming text-mining text-preprocessing textract tfidf-vectorizer tokenization wordcloud

Last synced: 17 Jun 2025

https://github.com/arnab-0053/song-identifier

It identifies songs and artists from lyric snippets using two distinct methods - simple NLP based approach and BM25(Best Match 25) approach.

bm25 nlp nltk python rank-bm25 scikit-learn song-lyrics spotify-dataset text-preprocessing

Last synced: 05 Mar 2025

https://github.com/farhad-here/textprepx

A Multilingual Text Preprocessing Tool for English and Persian.

cleantext contractions data-analysis deep-learning emoji nlp nltk opp parsivar regex streamlit text-preprocessing textblob

Last synced: 07 May 2025

https://github.com/sayamalt/fake-news-classification-using-fine-tuned-bert

Successfully developed a text classification model to predict whether a given news text is fake or not by fine-tuning a pretrained BERT transformed model imported from Hugging Face.

bert-embeddings bert-model data-analysis data-visualization deep-learning fine-tuning-bert model-evaluation model-training-and-evaluation text-classification text-preprocessing text-tokenization tokenizer-nlp wordcloud-visualization

Last synced: 05 Apr 2025

https://github.com/farshad-hasanpour/textfeature

transforms unstructured text to feature vector using word2vec, lexicon and ...

bag-of-words python text-preprocessing text2vec word2vec

Last synced: 26 Feb 2025

https://github.com/kunalpisolkar24/ir_lab

Collection of practical codes for Savitribai Phule Pune University's Information Retrieval Lab (410247) .

cosine-similarity information-retrieval map-reduce pagerank sppu-computer-engineering text-preprocessing web-crawling

Last synced: 05 Mar 2025

https://github.com/sayamalt/cyberbullying-classification-using-fine-tuned-distilbert

Successfully fine-tuned a pretrained DistilBERT transformer model that can classify social media text data into one of 4 cyberbullying labels i.e. ethnicity/race, gender/sexual, religion and not cyberbullying with a remarkable accuracy of 99%.

cyberbullying-detection data-exploration distilbert-model exploratory-data-analysis fine-tune-bert-tensorflow llm model-inference model-training-and-evaluation multiclass-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 09 Nov 2025

https://github.com/sayamalt/text-similarity-quantifier

Successfully developed a machine learning model for computing the similarity score between two text paragraphs taken as input from a webpage.

bag-of-words cosine-similarity cosine-similarity-scores countvectorizer flask machine-learning nlp pandas python text-preprocessing tfidf

Last synced: 09 Nov 2025

https://github.com/sayamalt/quora-duplicate-question-pairs-identification

Successfully developed a machine learning model which can accurately detect whether any given pair of Quora questions are duplicate or not.

data-visualization machine-learning natural-language-processing nltk paraphrase-detection text-preprocessing

Last synced: 09 Nov 2025

https://github.com/sayamalt/e-commerce-text-classification

Successfully established a machine learning model that can accurately classify an e-commerce product into one of four categories, namely "Books", "Clothing & Accessories", "Household" and "Electronics", based on the product's description.

categorical-encoding cross-validation exploratory-data-analysis hyperparameter-optimization machine-learning model-deployment model-training-and-evaluation text-classification text-preprocessing text-vectorization

Last synced: 09 Nov 2025

https://github.com/sayamalt/detection-of-disaster-from-tweets

Successfully established a machine learning model for detecting whether a given tweet is about a real disaster or not.

data-cleaning eda feature-engineering feature-extraction machine-learning natural-language-processing sklearn text-preprocessing

Last synced: 09 Nov 2025

https://github.com/gaaniruddha/fit5196-a1

This repository contains assignments #1 that was completed as a part of "FIT5196 Data Wrangling", taught at Monash Uni in S2 2020.

bigrams count-vectorizer langid parsing-text python regular-expressions text-preprocessing unigrams

Last synced: 26 Feb 2025

https://github.com/mrqadeer/internet_words_remover

Python module designed to replace common internet slang and abbreviations with their full forms, enhancing the readability of informal text. It efficiently cleans text data from chats, social media, and online communication. The module also supports tokenization and integrates seamlessly with pandas for batch processing of text in DataFrames.

pandas python3 text-preprocessing

Last synced: 23 Mar 2025

https://github.com/michel-nemo/yelp-sentiment-analysis-with-python-bs4-and-llm

Explore the Yelp-Sentiment-Analysis-with-Python-BS4-and-LLM repository to extract and analyze customer reviews efficiently. This project utilizes powerful Python libraries for scraping, processing, and visualizing sentiment data. ๐Ÿš€๐Ÿ™

beautifulsoup beautifulsoup4 business-analytics cuda data-analysis nltk opinion-mining pandas python requests-library-python sentiment-analysis text-preprocessing textblob torch web-scraping yelp-reviews

Last synced: 15 Jun 2025

https://github.com/mrqadeer/text_prettifier

Python library designed to clean and preprocess text data by removing unwanted elements such as HTML tags, URLs, numbers, special characters, emojis, contractions, and stopwords. It offers flexible functionality, including options to return text in lowercase and as a list of tokens.

nltk-library python3 regular-expressions text-cleaning text-preprocessing

Last synced: 23 Mar 2025

https://github.com/sayamalt/english-to-german-translation-using-seq2seq

Successfully established a neural machine translation model using sequence to sequence modeling which can successfully translate English sentences to their corresponding German translations.

natural-language-processing neural-language-translation sequence-to-sequence-models text-generation-using-lstm text-preprocessing

Last synced: 02 Jul 2025