https://github.com/oskar-j/awesome-text-ml

A curated list of ML awesome frameworks & libraries for text data
https://github.com/oskar-j/awesome-text-ml

awesome-list awesome-lists deep-learning machine-learning ml natural-language practical-machine-learning python text-analysis text-classification text-mining

Last synced: 9 months ago
JSON representation

A curated list of ML awesome frameworks & libraries for text data

Host: GitHub
URL: https://github.com/oskar-j/awesome-text-ml
Owner: oskar-j
License: cc0-1.0
Created: 2020-01-03T11:52:48.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2023-03-14T06:18:08.000Z (over 2 years ago)
Last Synced: 2024-05-19T22:40:10.497Z (over 1 year ago)
Topics: awesome-list, awesome-lists, deep-learning, machine-learning, ml, natural-language, practical-machine-learning, python, text-analysis, text-classification, text-mining
Size: 88.9 KB
Stars: 15
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

ultimate-awesome - awesome-text-ml - A curated list of ML awesome frameworks & libraries for text data. (Other Lists / TeX Lists)

README

# Awesome software for Text ML [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

A curated list of awesome ML frameworks and text embeddings. Focused on SOTA libraries which are actively maintained on GitHub.

## Frameworks and libraries

### :snake: Python

#### Text processing

* [HanLP](https://github.com/hankcs/HanLP) - Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification via one unified interface. https://bbs.hankcs.com/

* [flair](https://github.com/flairNLP/flair) - A powerful NLP library for state-of-the-art natural language processing (NLP) models, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification.

* [sentencepiece](https://github.com/google/sentencepiece) - Unsupervised text tokenizer for Neural Network-based text generation.

* [stanza](https://github.com/stanfordnlp/stanza) - Official Stanford NLP Python Library for Many Human Languages. https://stanfordnlp.github.io/stanza/

#### Pipelines / block-programming

* [texthero](https://github.com/jbesomi/texthero) - Text preprocessing, representation and visualization from zero to hero. https://texthero.org/

#### Distributed computing

* [spark-nlp](https://github.com/JohnSnowLabs/spark-nlp) - Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. https://nlp.johnsnowlabs.com/

#### Machine Learning

* [sklearn](https://github.com/scikit-learn/scikit-learn) - Scikit-learn is a Python module for machine learning built on top of SciPy, including tools for text vectorization and vector space compression. https://scikit-learn.org/stable/

* [gensim](https://github.com/RaRe-Technologies/gensim) - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. https://radimrehurek.com/gensim/

* [nlpaug](https://github.com/makcedward/nlpaug) - Augmenting nlp for your machine learning projects.

* [AugLy](https://github.com/facebookresearch/AugLy) - A data augmentations library from Facebook research for audio, image, text, and video.

#### Deep Learning

* [Transformers](https://github.com/huggingface/transformers) - Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. https://huggingface.co/transformers

* [fairseq](https://github.com/facebookresearch/fairseq) - Facebook AI Research Sequence-to-Sequence Toolkit written in Python. https://fairseq.readthedocs.io/en/latest/

* [bert-as-service](https://github.com/hanxiao/bert-as-service) - Mapping a variable-length sentence to a fixed-length vector using BERT model. https://bert-as-service.readthedocs.io

* [Kashgari](https://github.com/BrikerMan/Kashgari) - Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

#### Natural Language Understanding

* [Snips NLU](https://github.com/snipsco/snips-nlu) - Snips Python library to extract meaning from text. https://snips-nlu.readthedocs.io

* [IKY](https://github.com/alfredfrancis/ai-chatbot-framework) - A python chatbot framework with Natural Language Understanding and Artificial Intelligence.

* [rasa](https://github.com/RasaHQ/rasa) - Framework to automate text- and voice-based conversations: NLU, dialogue management, chatbots. https://rasa.com/docs/rasa/

* [ParlAI](https://github.com/facebookresearch/ParlAI) - A framework for training and evaluating AI models on a variety of openly available dialogue datasets. https://parl.ai/

* [DeepPavlov](https://github.com/deeppavlov/DeepPavlov) - An open source library for deep learning end-to-end dialog systems and chatbots. https://deeppavlov.ai/

* [Rhino](https://github.com/Picovoice/rhino) - On-device speech-to-intent engine powered by deep learning. https://picovoice.ai/

* [langchain](https://github.com/hwchase17/langchain) - Building applications with LLMs (large language models) through composability. https://langchain.readthedocs.io

* [NeMo](https://github.com/NVIDIA/NeMo) - NeMo: a toolkit for conversational AI. https://nvidia.github.io/NeMo/

#### Text mining

* [dedupe](https://github.com/dedupeio/dedupe) - A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

#### Visualizations

* [Scattertext](https://github.com/JasonKessler/scattertext) - Beautiful visualizations of how language differs among document types.

#### Big language models

* [BIG-bench](https://github.com/google/BIG-bench) - Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models.

### C++

#### Text processing

Currently empty 🪹

## Knowledge 📚

### Learning 101

* [Virgilio](https://github.com/virgili0/Virgilio) - Virgilio is an open-source initiative, aiming to mentor and guide anyone in the world of the Data Science.

### Multiple languages

* [Awesome Sentiment Analysis](https://github.com/laugustyniak/awesome-sentiment-analysis) - Repository with all what is necessary for sentiment analysis and related areas

### Python (and Python Notebooks)

* [practicalAI](https://github.com/practicalAI/practicalAI) - A practical approach to machine learning to enable everyone to learn, explore and build. https://practicalai.me

* [nlp-recipes](https://github.com/microsoft/nlp-recipes) - Comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.

## No longer maintained

* [NeuronBlocks](https://github.com/microsoft/NeuronBlocks) - NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego.

* [artificial-adversary](https://github.com/airbnb/artificial-adversary) - Tool to generate adversarial text examples and test machine learning models against them.

* [DELTA](https://github.com/didi/delta) - DELTA is a deep learning based natural language and speech processing platform. https://delta-didi.readthedocs.io/

* [EventForecast](https://github.com/moment-of-peace/EventForecast) - Time series prediction and text analysis using Keras LSTM, plus clustering, association rules mining.

* [lazynlp](https://github.com/chiphuyen/lazynlp) - Library to scrape and clean web pages to create massive datasets.

* [MeTA: ModErn Text Analysis](https://github.com/meta-toolkit/meta) - A Modern C++ Data Sciences Toolkit. https://meta-toolkit.org

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oskar-j/awesome-text-ml

Awesome Lists containing this project

README