Projects in Awesome Lists tagged with text-preprocessing

https://github.com/adbar/trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping

Last synced: 24 Dec 2025

https://github.com/jbesomi/texthero

Text preprocessing, representation and visualization from zero to hero.

machine-learning nlp nlp-pipeline text-clustering text-mining text-preprocessing text-representation text-visualization texthero word-embeddings

Last synced: 14 May 2025

https://github.com/jfilter/clean-text

🧹 Python package for text cleaning

natural-language-processing nlp python python-package scraping text-cleaning text-normalization text-preprocessing user-generated-content

Last synced: 28 Jan 2026

https://github.com/lyeoni/prenlp

Preprocessing Library for Natural Language Processing

natural-language-processing nlp preprocessing-library text-preprocessing text-processing

Last synced: 10 Apr 2025

https://github.com/ezgisubasi/turkish-tweets-sentiment-analysis

This sentiment analysis project determines whether the tweets posted in the Turkish language on Twitter are positive or negative.

data-visualization deep-learning glove glove-embeddings keras n-grams nlp sentiment-analysis text-preprocessing turkish-language turkish-nlp tweets twitter-sentiment-analysis zemberek zemberek-nlp

Last synced: 25 Oct 2025

https://github.com/Lipairui/textgo

Text preprocessing, representation, similarity calculation, text search and classification. Let's go and play with text!

bert nlp text-classification text-preprocessing text-representation text-search text-similarity

Last synced: 18 Jul 2025

https://github.com/CDSoft/panda

Panda is a Pandoc Lua filter that works on internal Pandoc's AST. Panda is heavily inspired by [abp](http:/cdelord.fr/abp) reimplemented as a Pandoc Lua filter.

lua pandoc pandoc-filter text-preprocessing

Last synced: 10 May 2025

https://github.com/jangedoo/jange

Easy NLP in Python

clustering nlp nlp-library python3 text text-classification text-preprocessing topic-modeling visualization

Last synced: 14 Jan 2026

https://github.com/danielhaim1/titlecaser

A powerful utility for transforming text to title case with support for multiple style guides and extensive customization options.

acronym-identification apa-style case-conversion case-converter case-formatting headline-optimization javascript sentence-case string-manipulation string-utils style-guide text-parser text-preprocessing text-processing text-transformation text-utils title-case-converter titlecase titlecasing word-casing

Last synced: 14 Feb 2026

https://github.com/VivekChoudhary77/Textify-text-Preprocessing

A text preprocessing web application

text-generation text-preprocessing text-summarization text-summarizer

Last synced: 15 Apr 2025

https://github.com/tesserato/inscribe

Markdown preprocessor that runs code fences

markdown rust text-preprocessing

Last synced: 19 Oct 2025

https://github.com/venkat-0706/sentimental-analysis

Build a model to classify text as positive, negative, or neutral. Apply NLP techniques for preprocessing and machine learning for classification. Aim for accurate sentiment prediction on various text formats.

data-visualization feature-engineering machine-learning natural-language-processing numpy pandas python scikit-learn sentiment-detection supervised-learning text-classification text-preprocessing tokenizaiton wordcloud

Last synced: 10 Mar 2025

https://github.com/khuyentran1401/extract-text-from-article

data-science natural-language-processing newspaper3k nltk python text-preprocessing web-scraping

Last synced: 13 Apr 2025

https://github.com/lanl/t-elf

Tensor Extraction of Latent Features (T-ELF). Within T-ELF's arsenal are non-negative matrix and tensor factorization solutions, equipped with automatic model determination (also known as the estimation of latent factors - rank) for accurate data modeling. Our software suite encompasses cutting-edge data pre-processing and post-processing modules.

blind-source-separation dimensionality-reduction feature-extraction gpu high-performance-computing hpc latent-variables machine-learning matrix matrix-completion matrix-factorization non-negative-matrix-factorization pattern-extraction semi-supervised-learning tensor-decomposition tensor-factorization tensors text-preprocessing unsupervised-learning

Last synced: 12 Apr 2025

https://github.com/giocoal/reddit-tldr-summarizer-and-topic-modeling

Extreme Extractive Text Summarization and Topic Modeling (using LSA and LDA techniques) over Reddit Posts from TLDRHQ dataset.

extreme-summarization latent-dirichlet-allocation latent-semantic-analysis lda lda-model lsa lsa-model nlp part-of-speech-tagging reddit reddit-bot reddit-dataset social-media summarization text-analysis text-preprocessing text-summarization tldr tldr9 topic-modeling

Last synced: 11 Mar 2025

https://github.com/sayamalt/resume-classification-using-fine-tuned-bert

Successfully developed a resume classification model which can accurately classify the resume of any person into its corresponding job with a tremendously high accuracy of more than 99%.

bert-model exploratory-data-analysis fine-tuning-bert model-evaluation nlp text-preprocessing text-tokenization word-embeddings

Last synced: 31 Aug 2025

https://github.com/andythefactory/article-extraction-dataset

Article title, authors, date and body extraction dataset.

article-extractor corpus corpus-builder corpus-tools dataset datasets html-to-markdown html2text news news-aggregator news-crawler readability scraping scraping-websites text-cleaning text-extraction text-mining text-preprocessing web-scraping

Last synced: 27 Jan 2026

https://github.com/byam/mnlp

MNLP: Mongolian Natural Language Processing.

hacktoberfest mongolian mongolian-text-classification nlp text-preprocessing

Last synced: 13 Sep 2025

https://github.com/sayamalt/language-detection-using-fine-tuned-xlm-roberta-base-transformer-model

Successfully developed a language detection transformer model that can accurately recognize the language in which any given text is written.

bert-fine-tuning feature-engineering fine-tuning model-evaluation model-evaluation-metrics nlp text-classification text-preprocessing xlm-roberta

Last synced: 31 Aug 2025

https://github.com/ailln/proces

🐨 text preprocess.

python-package python3 text-preprocessing text-processing

Last synced: 08 May 2025

https://github.com/sayamalt/emotion-detection-using-fine-tuned-bert-transformer

Successfully developed a fine-tuned BERT transformer model which can effectively perform emotion classification on any given piece of texts to identify a suitable human emotion based on semantic meaning of the text.

bert-transformer emotion-classification feature-engineering fine-tuning-bert natural-language-processing natural-language-understanding text-preprocessing

Last synced: 10 Jan 2026

https://github.com/bhattbhavesh91/texthero-demo

Tutorial to demonstrate the power of Texthero which is a library used for Text preprocessing, representation and visualization from zero to hero.

nlp nlp-pipeline text-clustering text-mining text-preprocessing text-representation text-visualization texthero texthero-tutorial word-embeddings

Last synced: 26 Oct 2025

https://github.com/sayamalt/sms-spam-classification-using-fine-tuned-roberta-base-transformer

Successfully developed a fine-tuned RoBERTa transformer model which can almost perfectly classify whether any given SMS is spam or not.

bert-embeddings bert-fine-tuning feature-engineering natural-language-processing natural-language-understanding roberta text-classification text-preprocessing transformers

Last synced: 06 Aug 2025

https://github.com/bhattbhavesh91/clean-text-demo

Tutorial on Clean-Text which is a Python package for text cleaning

machine-learning natural-language-processing nlp python text-cleaning text-preprocessing tutorial user-generated-content

Last synced: 19 May 2026

https://github.com/prashver/movie-recommendation-system

This recommendation system employs content-based filtering and NLP preprocessing to suggest similar movies based on user preferences and movie data. It fetches movie posters via APIs and is deployed on Streamlit for easy access.

api-request natural-language-processing nltk-python numpy pandas recommender-system streamlit-deployment text-preprocessing

Last synced: 06 May 2026

https://github.com/bilalhameed248/faq-chat-bot-using-vertexai

A generative AI-based FAQ Chat-Bot with a Flask Back-End, designed to operate within an organization's internal domain. - Jul 2023 - Oct 2023

csv embeddings flask gecko html java jquery natural-language-processing nlp python python3 pytorch text-bison text-embedding text-preprocessing vertex-ai

Last synced: 06 Apr 2026

https://github.com/sayamalt/abstractive-text-summarization-of-news-articles

Successfully developed an encoder-decoder based sequence to sequence (Seq2Seq) model which can summarize the entire text of an Indian news summary into a short paragraph with limited number of words.

attention-is-all-you-need attention-mechanism lstm-neural-networks natural-language-processing sequence-to-sequence-models text-generation text-preprocessing

Last synced: 09 Nov 2025

https://github.com/sd7campeon/yelp-sentiment-analysis-with-python-bs4-and-llm

A scalable pipeline for automated extraction, preprocessing, and sentiment analysis of Yelp reviews. Uses advanced HTTP requests, HTML parsing, and text normalization (tokenization, stopword removal, lemmatization) to enable precise polarity and subjectivity analysis for consumer insights and business analytics.

beautifulsoup beautifulsoup4 business-analytics cuda data-analysis nlp-machine-learning nltk opinion-mining pandas python python3 requests-library-python sentiment-analysis text-preprocessing textblob torch web-scraping yelp-reviews

Last synced: 06 May 2026

https://github.com/ajaykumar095/natural_language_processing

Explore cutting-edge Natural Language Processing (NLP) techniques in this GitHub repository. Includes pre-trained models, custom NLP pipelines, text preprocessing tools, sentiment analysis, text classification, and more. Ideal for research, learning, and deploying NLP solutions in Python.

ann nltk-python python rnn spacy tensorflow text-preprocessing textblob

Last synced: 07 May 2026

https://github.com/ssciwr/mailcom

Recognize and pseudonymize named entities in emails

anonymization data-privacy llm-inference pseudonymization text-preprocessing

Last synced: 21 Apr 2025

https://github.com/jesly-joji/spam-ham-classifier

Used Naive Bayes Algorithm, NLP Text Preprocessing Techniques

naive-bayes-classifier nlp scikit-learn streamlit text-preprocessing

Last synced: 03 May 2026

https://github.com/sayamalt/news-category-classification

Successfully developed a news category classification model using fine-tuned BERT which can accurately classify any news text into its respective category i.e. Politics, Business, Technology and Entertainment.

bert-embeddings exploratory-data-analysis feature-engineering fine-tuning-bert model-evaluation nlp text-classification text-cleaning text-preprocessing text-tokenization

Last synced: 15 Jun 2025

https://github.com/adilrasheed139/ai-powered-resume-screening-using-bert

Successfully developed a resume classification model which can accurately classify the resume of any person into its corresponding job with a tremendously high accuracy of more than 99%.

bert-model deep-learning exploratory-data-analysis-eda fine-tuning-bert model-evaluation nlp nlp-machine-learning text-preprocessing text-tokenization word-embeddings word-embeddings-for-nlp

Last synced: 03 Apr 2025

https://github.com/shakilgithub20/text-preprocessing

contractions corpus-processing matplotlib-venn nltk preprocessing text-preprocessing

Last synced: 27 Feb 2025

https://github.com/hariprasath-v/machinehack-sentiment_analysis_weekend_hackathon_edition_2

Sentiment classification of reviews/tweets

nlp-machine-learning regular-expression seaborn text-preprocessing

Last synced: 04 Jun 2026

https://github.com/ismielabir/txtcleanen

txtcleanen

nlp python-package text-cleaning text-preprocessing

Last synced: 25 Apr 2026

https://github.com/arnab-0053/song-identifier

It identifies songs and artists from lyric snippets using two distinct methods - simple NLP based approach and BM25(Best Match 25) approach.

bm25 nlp nltk python rank-bm25 scikit-learn song-lyrics spotify-dataset text-preprocessing

Last synced: 28 Apr 2026

https://github.com/farhad-here/textprepx

A Multilingual Text Preprocessing Tool for English and Persian.

cleantext contractions data-analysis deep-learning emoji nlp nltk opp parsivar regex streamlit text-preprocessing textblob

Last synced: 29 Apr 2026

https://github.com/sayamalt/detection-of-disaster-from-tweets

Successfully established a machine learning model for detecting whether a given tweet is about a real disaster or not.

data-cleaning eda feature-engineering feature-extraction machine-learning natural-language-processing sklearn text-preprocessing

Last synced: 30 Apr 2026

https://github.com/mrqadeer/internet_words_remover

Python module designed to replace common internet slang and abbreviations with their full forms, enhancing the readability of informal text. It efficiently cleans text data from chats, social media, and online communication. The module also supports tokenization and integrates seamlessly with pandas for batch processing of text in DataFrames.

pandas python3 text-preprocessing

Last synced: 03 May 2026

https://github.com/kunalpisolkar24/ir_lab

Collection of practical codes for Savitribai Phule Pune University's Information Retrieval Lab (410247) .

Last synced: 09 Jun 2026

https://github.com/antoniskl/un-general-debate-corpus-classification

The aim of this project is to classify UNGDC speeches with regards to climate change. As a secondary objective, a correlation is being examined between these speeches, the forestation and the happiness index of the countries.

classification data-science jupyter-notebook machine-learning nlp python regression scikit-learn text-classification text-preprocessing

Last synced: 05 May 2026

https://github.com/fardinhash/bert--text-preprocessing

bert natural-language-processing python text-preprocessing tokenization

Last synced: 09 May 2026

https://github.com/pngo1997/n-gram-language-models

Builds N-gram language modes and applies text generation.

bigrams cfd conditional-frequency-distribution greedy-algorithms laplace-smoothing natural-language-processing ngrams nltk nucleus-sampling perplexity python random-sampling text-generation text-preprocessing trigrams unigram

Last synced: 14 May 2026

https://github.com/pngo1997/text-processing-tokenization

Simple text analysis and tokenization.

nltk python sentence-segmentation text-preprocessing tokenization word-frequency zipfs-law

Last synced: 14 May 2026

https://github.com/imdeepmind/textpreprocessingscript

Text Preprocessing Script: This is a simple python script that i use for preprocessing text using NLTK.

deep-learning machine-learning natural-language-processing nltk python3 text-preprocessing

Last synced: 16 May 2026

https://github.com/abinashsahoo007/project-resume-classification

The document classification solution should significantly reduce the manual human effort in the HRM. It should achieve a higher level of accuracy and automation with minimal human intervention.

corpus count-vectorizer label-encoding lemmitization machine-learning nltk part-of-speech-tagging resume-classification spacy stemming text-mining text-preprocessing textract tfidf-vectorizer tokenization wordcloud

Last synced: 02 Feb 2026

https://github.com/vlada-pv/prediction-sociolinguistic-data-based-on-the-diaries-texts-of-the-prozhito-project

The repository contains notebooks created for collecting and preprocessing the corpus of diary entries and for experiments on creating models for predicting gender, age groups of authors and the time period of text creation.

author-profiling bag-of-words bilstm convol convolutional-neural-networks deep-learning diary-entries logistic-regression naive-bayes-classifier neural-networks recurrent-neural-networks sociolinguistics text-preprocessing text-vectorization tf-idf-vectorizer word-embeddings

Last synced: 13 Jul 2025

https://github.com/theveryhim/massive-text-processing

cleaning, processing and analysis of papers' dataset in pyspark(rdd) framework

big-data data-analysis frequent-itemsets massive-datasets pyspark text-preprocessing

Last synced: 18 Jul 2025

https://github.com/pantpujan017/nenglish-stopwords-chat-analysis

nepali stop words

chat-analysis code-mixed-language messenger-nlp nenglish nepali-nlp social-media-nlp stopwords text-preprocessing viber-chat whatsapp-chat

Last synced: 20 Jul 2025

https://github.com/oya163/datascience101

Pushing things as I do data science stuff

data-science dataset embeddings quora text-preprocessing traditional-machine-learning

Last synced: 08 Jun 2026

https://github.com/vishnun0027/sentiment-analysis

Here the several ways to perform sentiment analysis on text data, with varying degrees of complexity and accuracy

bag-of-words deep-learning deep-neural-networks deeplearning gru logistic-regression lstm machine-learning nltk-python rnn rnn-model sentiment-analysis svm-model tensorflow tensorflow2 text-classification text-preprocessing tf-idf tokenizer

Last synced: 20 May 2026

https://github.com/sayamalt/fake-news-classification-using-fine-tuned-bert

Successfully developed a text classification model to predict whether a given news text is fake or not by fine-tuning a pretrained BERT transformed model imported from Hugging Face.

bert-embeddings bert-model data-analysis data-visualization deep-learning fine-tuning-bert model-evaluation model-training-and-evaluation text-classification text-preprocessing text-tokenization tokenizer-nlp wordcloud-visualization

Last synced: 05 Apr 2025

https://github.com/farshad-hasanpour/textfeature

transforms unstructured text to feature vector using word2vec, lexicon and ...

bag-of-words python text-preprocessing text2vec word2vec

Last synced: 26 Feb 2025

https://github.com/sayamalt/symptoms-disease-text-classification

Successfully developed a fine-tuned BERT transformer model which can accurately classify symptoms to their corresponding diseases upto an accuracy of 89%.

bert-fine-tuning data-exploration-and-preprocessing exploratory-data-analysis fine-tune-bert-tensorflow hugging-face-transformers model-architecture-and-implementation model-inference model-training-and-evaluation multiclass-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 09 Nov 2025

https://github.com/sayamalt/english-to-spanish-language-translation-using-seq2seq-and-attention

Successfully established a Seq2Seq with attention model which can perform English to Spanish language translation up to an accuracy of almost 97%.

attention-is-all-you-need attention-model bert-transformer exploratory-data-analysis fine-tuning-bert hugging-face-transformers language-translation luong-attention model-architecture-and-implementation model-inference model-training-and-evaluation natural-language-processing neural-machine-translation seq2seq-modeling text-generation text-preprocessing text-tokenization

Last synced: 09 Nov 2025

https://github.com/sayamalt/cyberbullying-classification-using-fine-tuned-distilbert

Successfully fine-tuned a pretrained DistilBERT transformer model that can classify social media text data into one of 4 cyberbullying labels i.e. ethnicity/race, gender/sexual, religion and not cyberbullying with a remarkable accuracy of 99%.

cyberbullying-detection data-exploration distilbert-model exploratory-data-analysis fine-tune-bert-tensorflow llm model-inference model-training-and-evaluation multiclass-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 09 Nov 2025

https://github.com/sayamalt/quora-duplicate-question-pairs-identification

Successfully developed a machine learning model which can accurately detect whether any given pair of Quora questions are duplicate or not.

data-visualization machine-learning natural-language-processing nltk paraphrase-detection text-preprocessing

Last synced: 09 Nov 2025

https://github.com/sayamalt/e-commerce-text-classification

Successfully established a machine learning model that can accurately classify an e-commerce product into one of four categories, namely "Books", "Clothing & Accessories", "Household" and "Electronics", based on the product's description.

categorical-encoding cross-validation exploratory-data-analysis hyperparameter-optimization machine-learning model-deployment model-training-and-evaluation text-classification text-preprocessing text-vectorization

Last synced: 09 Nov 2025

https://github.com/atheeralzhrani/arabic_nlp

This repository contains projects focused on Arabic Natural Language Processing (NLP)

arabic-dataset arabic-language arabic-language-dataset arabic-nlp arabic-text-classification arabic-text-detection arabic-text-recognition huggingface spacy-nlp stemming text-preprocessing tokenization

Last synced: 16 Oct 2025

https://github.com/mrqadeer/text_prettifier

Python library designed to clean and preprocess text data by removing unwanted elements such as HTML tags, URLs, numbers, special characters, emojis, contractions, and stopwords. It offers flexible functionality, including options to return text in lowercase and as a list of tokens.

nltk-library python3 regular-expressions text-cleaning text-preprocessing

Last synced: 23 Mar 2025

https://github.com/sayamalt/english-to-german-translation-using-seq2seq

Successfully established a neural machine translation model using sequence to sequence modeling which can successfully translate English sentences to their corresponding German translations.

natural-language-processing neural-language-translation sequence-to-sequence-models text-generation-using-lstm text-preprocessing

Last synced: 02 Jul 2025

https://github.com/pngo1997/word-embeddings-co-occurrence-svd-glove

Explores word embeddings.

co-occurence-matrix embeddings glove glove-embeddings nltk python singular-value-decomposition text-preprocessing

Last synced: 16 May 2026

https://github.com/theveryhim/massive-text-processing-1

cleaning, processing and analysis of papers' dataset in pyspark(rdd) framework

big-data data-analysis frequent-itemsets massive-datasets pyspark text-preprocessing

Last synced: 03 Jul 2025

https://github.com/bhargav-joshi/nlp-practicals

Natural Language Processing Practicals on different concepts to analyze and understand the practical implementation their use and actual use.

chunking morphological-analysis n-grams name-entity-recognition natural-language-processing nlp nlp-machine-learning pos-tagging text-classification text-preprocessing topic-modeling

Last synced: 09 Apr 2025

https://github.com/nurfawaiq/nlp-text-preprocessing

Natural Language Processing - Text Preprocessing

crawling nlp python3 text-preprocessing twitter

Last synced: 12 Apr 2026

https://github.com/evanch98/natural-language-processing-python

Natural Language Processing

jupyter-notebook n-grams natural-language-processing python stemming-and-lemmatization text-preprocessing tokenization

Last synced: 13 May 2026

https://github.com/jasoncobra3/natural_language_processing

Natural Language Processing (NLP) is a captivating field at the intersection of computer science and linguistics. It enables machines to understand, interpret, and respond to human language in a way that is both meaningful and useful. From chatbots to sentiment analysis, NLP applications are transforming industries and enhancing user experiences.

artificial-intelligence artificial-neural-networks data-science deep-learning google-news-scraper machine-learning natural-language-processing nlp nlp-pipeline parts-of-speech pos python text text-analysis text-classification text-preprocessing text-representation word2vec-embeddinngs word2vec-model

Last synced: 22 May 2026

https://github.com/jaimeteb/templatext

Text preprocessing template for NLP.

language nlp python python3 text text-preprocessing

Last synced: 14 Jan 2026

https://github.com/sayamalt/mental-health-classification-using-fine-tuned-distilbert

Successfully established a multiclass text classification model by fine-tuning pretrained DistilBERT transformer model to classify several distinct types of mental health statuses such as anxiety, stress, personality disorder, etc. with an accuracy of 77%.

data-visualization deep-learning distilbert-fine-tuning distilbert-model model-evaluation model-inference model-training-and-evaluation multiclass-text-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 08 Oct 2025

https://github.com/sayamalt/luxury-apparel-product-category-classification-using-fine-tuned-distilbert

Successfully developed a multiclass text classification model by fine-tuning pretrained DistilBERT transformer model to classify various distinct types of luxury apparels into their respective categories i.e. pants, accessories, underwear, shoes, etc.

deep-learning distilbert-fine-tuning distilbert-model exploratory-data-analysis fine-tuning-bert model-inference model-training-and-evaluation multiclass-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 08 Oct 2025

https://github.com/bragdond/naive-bayes-ngram-text-classifier-nlp

Basic Naive Bayes classifier for text classification using ngram

naive-bayes-algorithm naive-bayes-classifier ngram nlp text-classification text-preprocessing

Last synced: 19 Oct 2025

https://github.com/sayamalt/text-similarity-quantifier

Successfully developed a machine learning model for computing the similarity score between two text paragraphs taken as input from a webpage.

bag-of-words cosine-similarity cosine-similarity-scores countvectorizer flask machine-learning nlp pandas python text-preprocessing tfidf

Last synced: 14 Apr 2026

https://github.com/sayamalt/financial-news-sentiment-analysis

Successfully developed a fine-tuned DistilBERT transformer model which can accurately predict the overall sentiment of a piece of financial news up to an accuracy of nearly 81.5%.

data-exploration-and-preprocessing distilbert-model fine-tune-bert-tensorflow hugging-face-transformers model-architecture-and-implementation model-inference model-training-and-evaluation multiclass-classification natural-language-processing sentiment-analysis text-preprocessing text-tokenization

Last synced: 17 Oct 2025

https://github.com/mevlutayilmaz/text-summarization

text summarization in python

docx matplotlib networkx nltk pyqt5 python sklearn text-preprocessing text-summarization tf-idf

Last synced: 30 Jan 2026

https://github.com/christoph/tagrefinery-releases

Guided Label and Tag Preprocessing

data-wrangling spell-correction tags text-preprocessing text-processing

Last synced: 04 Mar 2026

https://github.com/moustafamohamed01/web-summarizer-ai

A Python tool to scrape and summarize website content using AI. Built with Selenium, BeautifulSoup, LLaMA 3.2, and Google's Gemini AI, this project extracts the main text from any website and generates a concise summary in markdown format. Perfect for quickly understanding long articles, blogs, or news pages.

ai beautifulsoup gemini-ai llama python selenium text-preprocessing web-scraping

Last synced: 17 Apr 2026

https://github.com/somjit101/nlp-casestudy-amazon-fine-foods-review

Efficient Sentencing Encoding and Vectorization techniques with customer reviews on a product page of the popular E-Commerce website, Amazon using proven NLP techniques for the purpose of sentiment analysis.

amazon-fine-food-reviews amazon-fine-food-reviews-dataset featurization natural-language-processing nlp text-classification text-preprocessing tfidf-vectorizer vectorization word2vec

Last synced: 20 Apr 2026