Projects in Awesome Lists tagged with text-tokenization

https://github.com/alasdairforsythe/tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

text-tokenization tokenisation tokenization tokenize tokenizer tokenizing vocabulary vocabulary-builder vocabulary-generator

Last synced: 16 Jan 2026

https://github.com/twardoch/split-markdown4gpt

A Python tool for splitting large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.

data-preprocessing gpt gpt-3 gpt-35-turbo gpt-35-turbo-16k gpt-4 markdown markdown-processing mistletoe natural-language-processing nlp openai openai-gpt python split-text summarization text-analysis text-processing text-summarization text-tokenization

Last synced: 08 Jul 2025

https://github.com/sayamalt/resume-classification-using-fine-tuned-bert

Successfully developed a resume classification model which can accurately classify the resume of any person into its corresponding job with a tremendously high accuracy of more than 99%.

bert-model exploratory-data-analysis fine-tuning-bert model-evaluation nlp text-preprocessing text-tokenization word-embeddings

Last synced: 31 Aug 2025

https://github.com/katanabana/nihotip

Nihotip is a web app that lets users explore Japanese text through interactive tokenization and detailed insights. Built with React and Python, it offers a dynamic way to analyze words and symbols with tooltips for deeper understanding.

japanese japanese-characters japanese-language japanese-learning jmdictfurigana language mecab nlp python react sudachipy text-analysis text-tokenization tokenization tooltips wanakana webapp

Last synced: 08 May 2026

https://github.com/adilrasheed139/ai-powered-resume-screening-using-bert

Successfully developed a resume classification model which can accurately classify the resume of any person into its corresponding job with a tremendously high accuracy of more than 99%.

bert-model deep-learning exploratory-data-analysis-eda fine-tuning-bert model-evaluation nlp nlp-machine-learning text-preprocessing text-tokenization word-embeddings word-embeddings-for-nlp

Last synced: 03 Apr 2025

https://github.com/markiskorova/machine-learning-nlp-predict-author

Machine Learning & Natural Language Processing: Predict the author of literary text snippets. Built with TensorFlow and Keras, this project trains an LSTM model on classic literature to identify writing style and authorship.

keras machine-learning natural-language-processing python tensorflow text-tokenization text-vectorization

Last synced: 23 Jan 2026

https://github.com/sayamalt/news-category-classification

Successfully developed a news category classification model using fine-tuned BERT which can accurately classify any news text into its respective category i.e. Politics, Business, Technology and Entertainment.

bert-embeddings exploratory-data-analysis feature-engineering fine-tuning-bert model-evaluation nlp text-classification text-cleaning text-preprocessing text-tokenization

Last synced: 15 Jun 2025

https://github.com/sayamalt/cyberbullying-classification-using-fine-tuned-distilbert

Successfully fine-tuned a pretrained DistilBERT transformer model that can classify social media text data into one of 4 cyberbullying labels i.e. ethnicity/race, gender/sexual, religion and not cyberbullying with a remarkable accuracy of 99%.

cyberbullying-detection data-exploration distilbert-model exploratory-data-analysis fine-tune-bert-tensorflow llm model-inference model-training-and-evaluation multiclass-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 09 Nov 2025

https://github.com/sayamalt/fake-news-classification-using-fine-tuned-bert

Successfully developed a text classification model to predict whether a given news text is fake or not by fine-tuning a pretrained BERT transformed model imported from Hugging Face.

bert-embeddings bert-model data-analysis data-visualization deep-learning fine-tuning-bert model-evaluation model-training-and-evaluation text-classification text-preprocessing text-tokenization tokenizer-nlp wordcloud-visualization

Last synced: 05 Apr 2025

https://github.com/mecanik/modern-text-tokenizer

Modern UTF-8 aware C++ tokenizer with vocabulary support, ideal for NLP and transformer models. Header-only and zero-dependency.

ai artificial-intelligence bert deep-learning distilbert header-only high-performance machine-learning modern-cpp natural-language-processing nlp preprocessing text-analysis text-encoding text-processing text-tokenization tokenizer transformer vocabulary

Last synced: 15 Sep 2025

https://github.com/sayamalt/financial-news-sentiment-analysis

Successfully developed a fine-tuned DistilBERT transformer model which can accurately predict the overall sentiment of a piece of financial news up to an accuracy of nearly 81.5%.

data-exploration-and-preprocessing distilbert-model fine-tune-bert-tensorflow hugging-face-transformers model-architecture-and-implementation model-inference model-training-and-evaluation multiclass-classification natural-language-processing sentiment-analysis text-preprocessing text-tokenization

Last synced: 17 Oct 2025

https://github.com/sayamalt/mental-health-classification-using-fine-tuned-distilbert

Successfully established a multiclass text classification model by fine-tuning pretrained DistilBERT transformer model to classify several distinct types of mental health statuses such as anxiety, stress, personality disorder, etc. with an accuracy of 77%.

data-visualization deep-learning distilbert-fine-tuning distilbert-model model-evaluation model-inference model-training-and-evaluation multiclass-text-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 08 Oct 2025

https://github.com/sayamalt/luxury-apparel-product-category-classification-using-fine-tuned-distilbert

Successfully developed a multiclass text classification model by fine-tuning pretrained DistilBERT transformer model to classify various distinct types of luxury apparels into their respective categories i.e. pants, accessories, underwear, shoes, etc.

deep-learning distilbert-fine-tuning distilbert-model exploratory-data-analysis fine-tuning-bert model-inference model-training-and-evaluation multiclass-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 08 Oct 2025

https://github.com/sayamalt/global-news-headlines-text-summarization

Successfully established a text summarization model using Seq2Seq modeling with Luong Attention, which can give a short and concise summary of the global news headlines.

attention-mechanism data-exploration-and-preprocessing luong-attention model-architecture-and-implementation model-inference natural-language-processing seq2seq-model text-generation text-summarization text-tokenization

Last synced: 09 Nov 2025

https://github.com/sayamalt/symptoms-disease-text-classification

Successfully developed a fine-tuned BERT transformer model which can accurately classify symptoms to their corresponding diseases upto an accuracy of 89%.

bert-fine-tuning data-exploration-and-preprocessing exploratory-data-analysis fine-tune-bert-tensorflow hugging-face-transformers model-architecture-and-implementation model-inference model-training-and-evaluation multiclass-classification natural-language-processing text-classification text-preprocessing text-tokenization

Last synced: 09 Nov 2025

https://github.com/sayamalt/english-to-spanish-language-translation-using-seq2seq-and-attention

Successfully established a Seq2Seq with attention model which can perform English to Spanish language translation up to an accuracy of almost 97%.

attention-is-all-you-need attention-model bert-transformer exploratory-data-analysis fine-tuning-bert hugging-face-transformers language-translation luong-attention model-architecture-and-implementation model-inference model-training-and-evaluation natural-language-processing neural-machine-translation seq2seq-modeling text-generation text-preprocessing text-tokenization

Last synced: 09 Nov 2025

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome