{"id":14111328,"url":"https://github.com/oskar-j/awesome-text-ml","last_synced_at":"2026-03-06T07:03:04.847Z","repository":{"id":71666260,"uuid":"231577607","full_name":"oskar-j/awesome-text-ml","owner":"oskar-j","description":"A curated list of ML awesome frameworks \u0026 libraries for text data","archived":false,"fork":false,"pushed_at":"2023-03-14T06:18:08.000Z","size":91,"stargazers_count":17,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-02-28T11:50:57.317Z","etag":null,"topics":["awesome-list","awesome-lists","deep-learning","machine-learning","ml","natural-language","practical-machine-learning","python","text-analysis","text-classification","text-mining"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oskar-j.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-01-03T11:52:48.000Z","updated_at":"2026-02-19T01:09:33.000Z","dependencies_parsed_at":"2024-01-11T03:00:28.804Z","dependency_job_id":"8f879de7-95df-4d5f-ae24-44a5f143c035","html_url":"https://github.com/oskar-j/awesome-text-ml","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/oskar-j/awesome-text-ml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oskar-j%2Fawesome-text-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oskar-j%2Fawesome-text-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oskar-j%2Fawesome-text-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oskar-j%2Fawesome-text-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oskar-j","download_url":"https://codeload.github.com/oskar-j/awesome-text-ml/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oskar-j%2Fawesome-text-ml/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30164901,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T04:43:31.446Z","status":"ssl_error","status_checked_at":"2026-03-06T04:40:30.133Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["awesome-list","awesome-lists","deep-learning","machine-learning","ml","natural-language","practical-machine-learning","python","text-analysis","text-classification","text-mining"],"created_at":"2024-08-14T10:03:15.091Z","updated_at":"2026-03-06T07:03:04.819Z","avatar_url":"https://github.com/oskar-j.png","language":null,"funding_links":[],"categories":["Other Lists"],"sub_categories":["TeX Lists"],"readme":"# Awesome software for Text ML [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)\n\nA curated list of awesome ML frameworks and text embeddings. Focused on SOTA libraries which are actively maintained on GitHub.\n\n## Frameworks and libraries\n\n### :snake: Python\n\n#### Text processing\n\n* [HanLP](https://github.com/hankcs/HanLP) - Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic \u0026 Semantic Dependency Parsing, Document Classification via one unified interface. https://bbs.hankcs.com/\n\n* [flair](https://github.com/flairNLP/flair) - A powerful NLP library for state-of-the-art natural language processing (NLP) models, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification.\n\n* [sentencepiece](https://github.com/google/sentencepiece) - Unsupervised text tokenizer for Neural Network-based text generation. \n\n* [stanza](https://github.com/stanfordnlp/stanza) - Official Stanford NLP Python Library for Many Human Languages. https://stanfordnlp.github.io/stanza/\n\n#### Pipelines / block-programming\n\n* [texthero](https://github.com/jbesomi/texthero) - Text preprocessing, representation and visualization from zero to hero. https://texthero.org/\n\n#### Distributed computing\n\n* [spark-nlp](https://github.com/JohnSnowLabs/spark-nlp) - Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. https://nlp.johnsnowlabs.com/\n\n#### Machine Learning\n\n* [sklearn](https://github.com/scikit-learn/scikit-learn) - Scikit-learn is a Python module for machine learning built on top of SciPy, including tools for text vectorization and vector space compression. https://scikit-learn.org/stable/\n\n* [gensim](https://github.com/RaRe-Technologies/gensim) - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. https://radimrehurek.com/gensim/\n\n* [nlpaug](https://github.com/makcedward/nlpaug) - Augmenting nlp for your machine learning projects.\n\n* [AugLy](https://github.com/facebookresearch/AugLy) - A data augmentations library from Facebook research for audio, image, text, and video.\n\n#### Deep Learning\n\n* [Transformers](https://github.com/huggingface/transformers) - Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. https://huggingface.co/transformers\n\n* [fairseq](https://github.com/facebookresearch/fairseq) - Facebook AI Research Sequence-to-Sequence Toolkit written in Python. https://fairseq.readthedocs.io/en/latest/\n\n* [bert-as-service](https://github.com/hanxiao/bert-as-service) - Mapping a variable-length sentence to a fixed-length vector using BERT model. https://bert-as-service.readthedocs.io\n\n* [Kashgari](https://github.com/BrikerMan/Kashgari) -  Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.\n\n#### Natural Language Understanding\n\n* [Snips NLU](https://github.com/snipsco/snips-nlu) - Snips Python library to extract meaning from text. https://snips-nlu.readthedocs.io\n\n* [IKY](https://github.com/alfredfrancis/ai-chatbot-framework) - A python chatbot framework with Natural Language Understanding and Artificial Intelligence.\n\n* [rasa](https://github.com/RasaHQ/rasa) - Framework to automate text- and voice-based conversations: NLU, dialogue management, chatbots. https://rasa.com/docs/rasa/\n\n* [ParlAI](https://github.com/facebookresearch/ParlAI) - A framework for training and evaluating AI models on a variety of openly available dialogue datasets. https://parl.ai/\n\n* [DeepPavlov](https://github.com/deeppavlov/DeepPavlov) - An open source library for deep learning end-to-end dialog systems and chatbots. https://deeppavlov.ai/\n\n* [Rhino](https://github.com/Picovoice/rhino) - On-device speech-to-intent engine powered by deep learning. https://picovoice.ai/\n\n* [langchain](https://github.com/hwchase17/langchain) - Building applications with LLMs (large language models) through composability. https://langchain.readthedocs.io\n\n* [NeMo](https://github.com/NVIDIA/NeMo) - NeMo: a toolkit for conversational AI. https://nvidia.github.io/NeMo/\n\n#### Text mining\n\n* [dedupe](https://github.com/dedupeio/dedupe) - A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.\n\n#### Visualizations\n\n* [Scattertext](https://github.com/JasonKessler/scattertext) - Beautiful visualizations of how language differs among document types.\n\n#### Big language models\n\n* [BIG-bench](https://github.com/google/BIG-bench) - Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models.\n\n### C++\n\n#### Text processing\n\nCurrently empty 🪹\n\n## Knowledge 📚\n\n### Learning 101\n\n* [Virgilio](https://github.com/virgili0/Virgilio) - Virgilio is an open-source initiative, aiming to mentor and guide anyone in the world of the Data Science.\n\n### Multiple languages\n\n* [Awesome Sentiment Analysis](https://github.com/laugustyniak/awesome-sentiment-analysis) - Repository with all what is necessary for sentiment analysis and related areas\n\n### Python (and Python Notebooks)\n\n* [practicalAI](https://github.com/practicalAI/practicalAI) - A practical approach to machine learning to enable everyone to learn, explore and build. https://practicalai.me\n\n* [nlp-recipes](https://github.com/microsoft/nlp-recipes) - Comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.\n\n## No longer maintained\n\n* [NeuronBlocks](https://github.com/microsoft/NeuronBlocks) - NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego.\n\n* [artificial-adversary](https://github.com/airbnb/artificial-adversary) - Tool to generate adversarial text examples and test machine learning models against them.\n\n* [DELTA](https://github.com/didi/delta) - DELTA is a deep learning based natural language and speech processing platform. https://delta-didi.readthedocs.io/\n\n* [EventForecast](https://github.com/moment-of-peace/EventForecast) - Time series prediction and text analysis using Keras LSTM, plus clustering, association rules mining.\n\n* [lazynlp](https://github.com/chiphuyen/lazynlp) - Library to scrape and clean web pages to create massive datasets.\n\n* [MeTA: ModErn Text Analysis](https://github.com/meta-toolkit/meta) - A Modern C++ Data Sciences Toolkit. https://meta-toolkit.org\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foskar-j%2Fawesome-text-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foskar-j%2Fawesome-text-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foskar-j%2Fawesome-text-ml/lists"}