Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/oskar-j/awesome-text-ml
A curated list of ML awesome frameworks & libraries for text data
https://github.com/oskar-j/awesome-text-ml
List: awesome-text-ml
awesome-list awesome-lists deep-learning machine-learning ml natural-language practical-machine-learning python text-analysis text-classification text-mining
Last synced: 3 months ago
JSON representation
A curated list of ML awesome frameworks & libraries for text data
- Host: GitHub
- URL: https://github.com/oskar-j/awesome-text-ml
- Owner: oskar-j
- License: cc0-1.0
- Created: 2020-01-03T11:52:48.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-03-14T06:18:08.000Z (over 1 year ago)
- Last Synced: 2024-05-19T22:40:10.497Z (6 months ago)
- Topics: awesome-list, awesome-lists, deep-learning, machine-learning, ml, natural-language, practical-machine-learning, python, text-analysis, text-classification, text-mining
- Size: 88.9 KB
- Stars: 15
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ultimate-awesome - awesome-text-ml - A curated list of ML awesome frameworks & libraries for text data. (Other Lists / PowerShell Lists)
README
# Awesome software for Text ML [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
A curated list of awesome ML frameworks and text embeddings. Focused on SOTA libraries which are actively maintained on GitHub.
## Frameworks and libraries
### :snake: Python
#### Text processing
* [HanLP](https://github.com/hankcs/HanLP) - Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification via one unified interface. https://bbs.hankcs.com/
* [flair](https://github.com/flairNLP/flair) - A powerful NLP library for state-of-the-art natural language processing (NLP) models, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification.
* [sentencepiece](https://github.com/google/sentencepiece) - Unsupervised text tokenizer for Neural Network-based text generation.
* [stanza](https://github.com/stanfordnlp/stanza) - Official Stanford NLP Python Library for Many Human Languages. https://stanfordnlp.github.io/stanza/
#### Pipelines / block-programming
* [texthero](https://github.com/jbesomi/texthero) - Text preprocessing, representation and visualization from zero to hero. https://texthero.org/
#### Distributed computing
* [spark-nlp](https://github.com/JohnSnowLabs/spark-nlp) - Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. https://nlp.johnsnowlabs.com/
#### Machine Learning
* [sklearn](https://github.com/scikit-learn/scikit-learn) - Scikit-learn is a Python module for machine learning built on top of SciPy, including tools for text vectorization and vector space compression. https://scikit-learn.org/stable/
* [gensim](https://github.com/RaRe-Technologies/gensim) - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. https://radimrehurek.com/gensim/
* [nlpaug](https://github.com/makcedward/nlpaug) - Augmenting nlp for your machine learning projects.
* [AugLy](https://github.com/facebookresearch/AugLy) - A data augmentations library from Facebook research for audio, image, text, and video.
#### Deep Learning
* [Transformers](https://github.com/huggingface/transformers) - Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. https://huggingface.co/transformers
* [fairseq](https://github.com/facebookresearch/fairseq) - Facebook AI Research Sequence-to-Sequence Toolkit written in Python. https://fairseq.readthedocs.io/en/latest/
* [bert-as-service](https://github.com/hanxiao/bert-as-service) - Mapping a variable-length sentence to a fixed-length vector using BERT model. https://bert-as-service.readthedocs.io
* [Kashgari](https://github.com/BrikerMan/Kashgari) - Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
#### Natural Language Understanding
* [Snips NLU](https://github.com/snipsco/snips-nlu) - Snips Python library to extract meaning from text. https://snips-nlu.readthedocs.io
* [IKY](https://github.com/alfredfrancis/ai-chatbot-framework) - A python chatbot framework with Natural Language Understanding and Artificial Intelligence.
* [rasa](https://github.com/RasaHQ/rasa) - Framework to automate text- and voice-based conversations: NLU, dialogue management, chatbots. https://rasa.com/docs/rasa/
* [ParlAI](https://github.com/facebookresearch/ParlAI) - A framework for training and evaluating AI models on a variety of openly available dialogue datasets. https://parl.ai/
* [DeepPavlov](https://github.com/deeppavlov/DeepPavlov) - An open source library for deep learning end-to-end dialog systems and chatbots. https://deeppavlov.ai/
* [Rhino](https://github.com/Picovoice/rhino) - On-device speech-to-intent engine powered by deep learning. https://picovoice.ai/
* [langchain](https://github.com/hwchase17/langchain) - Building applications with LLMs (large language models) through composability. https://langchain.readthedocs.io
* [NeMo](https://github.com/NVIDIA/NeMo) - NeMo: a toolkit for conversational AI. https://nvidia.github.io/NeMo/
#### Text mining
* [dedupe](https://github.com/dedupeio/dedupe) - A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
#### Visualizations
* [Scattertext](https://github.com/JasonKessler/scattertext) - Beautiful visualizations of how language differs among document types.
#### Big language models
* [BIG-bench](https://github.com/google/BIG-bench) - Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models.
### C++
#### Text processing
Currently empty ๐ชน
## Knowledge ๐
### Learning 101
* [Virgilio](https://github.com/virgili0/Virgilio) - Virgilio is an open-source initiative, aiming to mentor and guide anyone in the world of the Data Science.
### Multiple languages
* [Awesome Sentiment Analysis](https://github.com/laugustyniak/awesome-sentiment-analysis) - Repository with all what is necessary for sentiment analysis and related areas
### Python (and Python Notebooks)
* [practicalAI](https://github.com/practicalAI/practicalAI) - A practical approach to machine learning to enable everyone to learn, explore and build. https://practicalai.me
* [nlp-recipes](https://github.com/microsoft/nlp-recipes) - Comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.
## No longer maintained
* [NeuronBlocks](https://github.com/microsoft/NeuronBlocks) - NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego.
* [artificial-adversary](https://github.com/airbnb/artificial-adversary) - Tool to generate adversarial text examples and test machine learning models against them.
* [DELTA](https://github.com/didi/delta) - DELTA is a deep learning based natural language and speech processing platform. https://delta-didi.readthedocs.io/
* [EventForecast](https://github.com/moment-of-peace/EventForecast) - Time series prediction and text analysis using Keras LSTM, plus clustering, association rules mining.
* [lazynlp](https://github.com/chiphuyen/lazynlp) - Library to scrape and clean web pages to create massive datasets.
* [MeTA: ModErn Text Analysis](https://github.com/meta-toolkit/meta) - A Modern C++ Data Sciences Toolkit. https://meta-toolkit.org