Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ivan-bilan/The-NLP-Pandect

A comprehensive reference for all topics related to Natural Language Processing
https://github.com/ivan-bilan/The-NLP-Pandect

List: The-NLP-Pandect

awesome-list deeplearning natural-language-processing naturallanguageprocessing nlp pandect

Last synced: about 1 month ago
JSON representation

A comprehensive reference for all topics related to Natural Language Processing

Awesome Lists containing this project

README

        

![The-NLP-Pandect](./Resources/Images/pandect.png)


This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.

Ukraine

> __Note__
> Quick legend on available resource types:
>
> ⭐ - open source project, usually a GitHub repository with its number of stars
>
> 📙 - resource you can read, usually a blog post or a paper
>
> 🗂️ - a collection of additional resources
>
> 🔱 - non-open source tool, framework or paid service
>
> 🎥️ - a resource you can watch
>
> 🎙️ - a resource you can listen to

###

Table of Contents

| 📇 Main Section | 🗃️ Sub-sections Sample |
| ------------- | ------------- |
| [NLP Resources](https://github.com/ivan-bilan/The-NLP-Pandect#) | [Paper Summaries](https://github.com/ivan-bilan/The-NLP-Pandect#papers-and-paper-summaries), [Conference Summaries](https://github.com/ivan-bilan/The-NLP-Pandect#conference-summaries), [NLP Datasets](https://github.com/ivan-bilan/The-NLP-Pandect#nlp-datasets) |
| [NLP Podcasts](https://github.com/ivan-bilan/The-NLP-Pandect#-1) | [NLP-only Podcasts](https://github.com/ivan-bilan/The-NLP-Pandect#nlp-only-podcasts), [Podcasts with many NLP Episodes](https://github.com/ivan-bilan/The-NLP-Pandect#many-nlp-episodes) |
| [NLP Newsletters](https://github.com/ivan-bilan/The-NLP-Pandect#-2) | - |
| [NLP Meetups](https://github.com/ivan-bilan/The-NLP-Pandect#-3) | - |
| [NLP YouTube Channels](https://github.com/ivan-bilan/The-NLP-Pandect#-4) | - |
| [NLP Benchmarks](https://github.com/ivan-bilan/The-NLP-Pandect#-5) | [General NLU](https://github.com/ivan-bilan/The-NLP-Pandect#general-nlu), [Question Answering](https://github.com/ivan-bilan/The-NLP-Pandect#question-answering), [Multilingual](https://github.com/ivan-bilan/The-NLP-Pandect#multilingual-and-non-english-benchmarks) |
| [Research Resources](https://github.com/ivan-bilan/The-NLP-Pandect#-6) | [Resource on Transformer Models](https://github.com/ivan-bilan/The-NLP-Pandect#transformer-based-architectures), [Distillation and Pruning](https://github.com/ivan-bilan/The-NLP-Pandect#distillation-pruning-and-quantization), [Automated Summarization](https://github.com/ivan-bilan/The-NLP-Pandect#automated-summarization) |
| [Industry Resources](https://github.com/ivan-bilan/The-NLP-Pandect#-7) | [Best Practices for NLP Systems](https://github.com/ivan-bilan/The-NLP-Pandect#best-practices-for-nlp), [MLOps for NLP](https://github.com/ivan-bilan/The-NLP-Pandect#mlops-for-nlp) |
| [Speech Recognition](https://github.com/ivan-bilan/The-NLP-Pandect#-8) | [General Resources](https://github.com/ivan-bilan/The-NLP-Pandect#general-speech-recognition), [Text to Speech](https://github.com/ivan-bilan/The-NLP-Pandect#text-to-speech), [Speech to Text](https://github.com/ivan-bilan/The-NLP-Pandect#speech-to-text), [Datasets](https://github.com/ivan-bilan/The-NLP-Pandect#datasets) |
| [Topic Modeling](https://github.com/ivan-bilan/The-NLP-Pandect#-9) | [Blogs](https://github.com/ivan-bilan/The-NLP-Pandect#blogs-1), [Frameworks](https://github.com/ivan-bilan/The-NLP-Pandect#frameworks-for-topic-modeling), [Repositories and Projects](https://github.com/ivan-bilan/The-NLP-Pandect#repositories-1) |
| [Keyword Extraction](https://github.com/ivan-bilan/The-NLP-Pandect#-10) | [Text Rank](https://github.com/ivan-bilan/The-NLP-Pandect#text-rank), [Rake](https://github.com/ivan-bilan/The-NLP-Pandect#rake---rapid-automatic-keyword-extraction), [Other Approaches](https://github.com/ivan-bilan/The-NLP-Pandect#other-approaches) |
| [Responsible NLP](https://github.com/ivan-bilan/The-NLP-Pandect#-11) | [NLP and ML Interpretability](https://github.com/ivan-bilan/The-NLP-Pandect#nlp-and-ml-interpretability), [Ethics, Bias, and Equality in NLP](https://github.com/ivan-bilan/The-NLP-Pandect#ethics-bias-and-equality-in-nlp), [Adversarial Attacks for NLP](https://github.com/ivan-bilan/The-NLP-Pandect#adversarial-attacks-for-nlp) |
| [NLP Frameworks](https://github.com/ivan-bilan/The-NLP-Pandect#-12) | [General Purpose](https://github.com/ivan-bilan/The-NLP-Pandect#general-purpose), [Data Augmentation](https://github.com/ivan-bilan/The-NLP-Pandect#data-augmentation), [Machine Translation](https://github.com/ivan-bilan/The-NLP-Pandect#machine-translation), [Adversarial Attacks](https://github.com/ivan-bilan/The-NLP-Pandect#adversarial-nlp-attacks--behavioral-testing), [Dialog Systems & Speech](https://github.com/ivan-bilan/The-NLP-Pandect#dialog-systems-and-speech), [Entity and String Matching](https://github.com/ivan-bilan/The-NLP-Pandect#entity-and-string-matching), [Non-English Frameworks](https://github.com/ivan-bilan/The-NLP-Pandect#non-english-oriented), [Text Annotation](https://github.com/ivan-bilan/The-NLP-Pandect#text-data-labelling) |
| [Learning NLP](https://github.com/ivan-bilan/The-NLP-Pandect#-13) | [Courses](https://github.com/ivan-bilan/The-NLP-Pandect#courses), [Books](https://github.com/ivan-bilan/The-NLP-Pandect#books), [Tutorials](https://github.com/ivan-bilan/The-NLP-Pandect#tutorials) |
| [NLP Communities](https://github.com/ivan-bilan/The-NLP-Pandect#-14) | - |
| [Other NLP Topics](https://github.com/ivan-bilan/The-NLP-Pandect#-15) | [Tokenization](https://github.com/ivan-bilan/The-NLP-Pandect#tokenization), [Data Augmentation](https://github.com/ivan-bilan/The-NLP-Pandect#data-augmentation-and-weak-supervision), [Named Entity Recognition](https://github.com/ivan-bilan/The-NLP-Pandect#named-entity-recognition-ner), [Error Correction](https://github.com/ivan-bilan/The-NLP-Pandect#spell-correction--error-correction), [AutoML/AutoNLP](https://github.com/ivan-bilan/The-NLP-Pandect#automl--autonlp), [Text Generation](https://github.com/ivan-bilan/The-NLP-Pandect#text-generation) |

![The-NLP-Resources](./Resources/Images/pandect_resources.png)
-----
> __Note__
> Section keywords: paper summaries, compendium, awesome list

#### Compendiums and awesome lists on the topic of NLP:
* 🗂️ [The NLP Index](https://index.quantumstat.com) - Searchable Index of NLP Papers by Quantum Stat / NLP Cypher
* ⭐ [Awesome NLP](https://github.com/keon/awesome-nlp) by [keon](https://github.com/keon) [GitHub, 13963 stars]
* ⭐ [Speech and Natural Language Processing Awesome List](https://github.com/edobashira/speech-language-processing#readme) by [elaboshira](https://github.com/edobashira) [GitHub, 2121 stars]
* ⭐ [Awesome Deep Learning for Natural Language Processing (NLP)](https://github.com/brianspiering/awesome-dl4nlp) [GitHub, 1094 stars]
* ⭐ [Text Mining and Natural Language Processing Resources](https://github.com/stepthom/text_mining_resources) by [stepthom](https://github.com/stepthom) [GitHub, 505 stars]
* 🗂️ [Made with ML List](https://madewithml.com/topics/#nlp) by [madewithml.com](https://madewithml.com)
* 🗂️ [Brainsources for #NLP enthusiasts](https://www.notion.so/634eba1a37d34e2baec1bb574a8a5482) by [Philip Vollet](https://www.linkedin.com/in/philipvollet/)
* ⭐ [Awesome AI/ML/DL - NLP Section](https://github.com/neomatrix369/awesome-ai-ml-dl/tree/master/natural-language-processing#natural-language-processing-nlp) [GitHub, 1142 stars]
* 🗂️ [Resources on various machine learning topics](https://www.backprop.org) by Backprop
* 🗂️ [NLP articles](https://devopedia.org/site-map/browse-articles/natural+language+processing) by [Devopedia](https://devopedia.org)

#### NLP Conferences, Paper Summaries and Paper Compendiums:
##### Papers and Paper Summaries
* ⭐ [100 Must-Read NLP Papers](https://github.com/mhagiwara/100-nlp-papers) 100 Must-Read NLP Papers [GitHub, 3446 stars]
* ⭐ [NLP Paper Summaries](https://github.com/dair-ai/nlp_paper_summaries) by [dair-ai](https://github.com/dair-ai) [GitHub, 1431 stars]
* ⭐ [Curated collection of papers for the NLP practitioner](https://github.com/mihail911/nlp-library) [GitHub, 1059 stars]
* ⭐ [Papers on Textual Adversarial Attack and Defense](https://github.com/thunlp/TAADpapers) [GitHub, 1182 stars]
* ⭐ [Recent Deep Learning papers in NLU and RL](https://github.com/madrugado/deep-learning-nlp-rl-papers) by Valentin Malykh [GitHub, 291 stars]
* ⭐ [A Survey of Surveys (NLP & ML): Collection of NLP Survey Papers](https://github.com/NiuTrans/ABigSurvey) [GitHub, 1713 stars]
* ⭐ [A Paper List for Style Transfer in Text](https://github.com/fuzhenxin/Style-Transfer-in-Text) [GitHub, 1456 stars]
* 🎥 [Video recordings index for papers](https://papertalk.org/index)

##### Conference Summaries
* ⭐ [NLP top 10 conferences Compendium](https://github.com/soulbliss/NLP-conference-compendium) by [soulbliss](https://github.com/soulbliss) [GitHub, 439 stars]
* 📙 [ICLR 2020 Trends](https://gsarti.com/post/iclr2020-transformers/)
* 📙 [SpacyIRL 2019 Conference in Overview](https://www.linkedin.com/pulse/spacyirl-2019-conference-overview-ivan-bilan/)
* 📙 [Paper Digest](https://www.paperdigest.org/category/nlp/) - Conferences and Papers in Overview
* 🎥 [Video Recordings from Conferences](https://crossminds.ai/explore/)

#### NLP Progress and NLP Tasks:
* ⭐ [NLP Progress](https://github.com/sebastianruder/NLP-progress) by [sebastianruder](https://github.com/sebastianruder) [GitHub, 21123 stars]
* ⭐ [NLP Tasks](https://github.com/Kyubyong/nlp_tasks) by [Kyubyong](https://github.com/Kyubyong) [GitHub, 2984 stars]

#### NLP Datasets:
* ⭐ [NLP Datasets](https://github.com/niderhoff/nlp-datasets) by [niderhoff](https://github.com/niderhoff) [GitHub, 5225 stars]
* ⭐ [Datasets](https://github.com/huggingface/datasets) by Huggingface [GitHub, 14838 stars]
* 🗂️ [Big Bad NLP Database](https://datasets.quantumstat.com)
* ⭐ [UWA Unambiguous Word Annotations](http://danlou.github.io/uwa/) - Word Sense Disambiguation Dataset
* ⭐ [MLDoc](https://github.com/facebookresearch/MLDoc) - Corpus for Multilingual Document Classification in Eight Language [GitHub, 145 stars]

#### Word and Sentence embeddings:
* ⭐ [Awesome Embedding Models](https://github.com/Hironsan/awesome-embedding-models) by [Hironsan](https://github.com/Hironsan) [GitHub, 1544 stars]
* ⭐ [Awesome list of Sentence Embeddings](https://github.com/Separius/awesome-sentence-embedding) by [Separius](https://github.com/Separius) [GitHub, 2086 stars]
* ⭐ [Awesome BERT](https://github.com/Jiakui/awesome-bert) by [Jiakui](https://github.com/Jiakui) [GitHub, 1797 stars]

#### Notebooks, Scripts and Repositories
* ⭐ [The Super Duper NLP Repo](https://notebooks.quantumstat.com) [Website, 2020]

#### Non-English resources and Compendiums
* ⭐ [NLP Resources for Bahasa Indonesian](https://github.com/louisowen6/NLP_bahasa_resources) [GitHub, 329 stars]
* ⭐ [Indic NLP Catalog](https://github.com/AI4Bharat/indicnlp_catalog) [GitHub, 381 stars]
* ⭐ [Pre-trained language models for Vietnamese](https://github.com/VinAIResearch/PhoBERT) [GitHub, 491 stars]
* ⭐ [Natural Language Toolkit for Indic Languages (iNLTK)](https://github.com/goru001/inltk) [GitHub, 773 stars]
* ⭐ [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) [GitHub, 448 stars]
* ⭐ [AI4Bharat-IndicNLP Portal](https://indicnlp.ai4bharat.org)
* ⭐ [ARBML](https://github.com/ARBML/ARBML) - Implementation of many Arabic NLP and ML projects [GitHub, 284 stars]
* ⭐ [zemberek-nlp](https://github.com/ahmetaa/zemberek-nlp) - NLP tools for Turkish [GitHub, 1021 stars]
* ⭐ [TDD AI](https://tdd.ai) - An open-source platform for all Turkish datasets, language models, and NLP tools.
* ⭐ [KLUE](https://github.com/KLUE-benchmark/KLUE) - Korean Language Understanding Evaluation [GitHub, 468 stars]
* ⭐ [Persian NLP Benchmark](https://github.com/Mofid-AI/persian-nlp-benchmark) - benchmark for evaluation and comparison of various NLP tasks in Persian language [GitHub, 69 stars]
* ⭐ [nlp-greek](https://github.com/Yuliya-HV/nlp-greek) - Greek language sources [GitHub, 5 stars]
* ⭐ [Awesome NLP Resources for Hungarian](https://github.com/oroszgy/awesome-hungarian-nlp) [GitHub, 160 stars]

#### Pre-trained NLP models
* ⭐ [List of pre-trained NLP models](https://github.com/balavenkatesh3322/NLP-pretrained-model) [GitHub, 163 stars]
* 📙 [General Pretrained Language Models](https://mr-nlp.github.io/posts/2022/07/general-tptlms-list/) [Blog, July 2022]
* ⭐ [Pretrained language models developed by Huawei Noah's Ark Lab](https://github.com/huawei-noah/Pretrained-Language-Model) [GitHub, 2547 stars]
* ⭐ [Spanish Language Models and resources](https://github.com/PlanTL-GOB-ES/lm-spanish) [GitHub, 202 stars]
* 🗂 [Monolingual Pretrained Language Models](https://mr-nlp.github.io/posts/2022/07/monolingual-tptlms-list/) - collection of available pre-trained models [Blog, July 2022]

#### NLP History
##### General
* ⭐ [Modern Deep Learning Techniques Applied to Natural Language Processing](https://github.com/omarsar/nlp_overview) [GitHub, 1269 stars]
* 📙 [A Review of the Neural History of Natural Language Processing](https://aylien.com/blog/a-review-of-the-recent-history-of-natural-language-processing) [Blog, October 2018]
##### 2020 Year in Review
* 📙 [Natural Language Processing in 2020: The Year In Review](https://www.linkedin.com/pulse/natural-language-processing-2020-year-review-ivan-bilan/) [Blog, December 2020]
* 📙 [ML and NLP Research Highlights of 2020](https://ruder.io/research-highlights-2020/) [Blog, January 2021]

![The-NLP-Podcasts](./Resources/Images/pandect_lyra.png)
-----
[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)
#### NLP-only podcasts
* 🎙️ [NLP Highlights](https://soundcloud.com/nlp-highlights) [Years: 2017 - now, Status: active]
* 🎙️ [The NLP Zone](https://de.player.fm/series/the-nlp-zone) [Episodes](https://player.captivate.fm/episode/e2f87641-1421-4729-a2b5-d64951c845c6) [Years: 2021 - now, Status: active]

#### Many NLP episodes
* 🎙️ [TWIML AI](https://twimlai.com) [Years: 2016 - now, Status: active]
* 🎙️ [Practical AI](https://changelog.com/practicalai) [Years: 2018 - now, Status: active]
* 🎙️ [The Data Exchange](https://thedataexchange.media) [Years: 2019 - now, Status: active]
* 🎙️ [Gradient Dissent](https://www.wandb.com/podcast) [Years: 2020 - now, Status: active]
* 🎙️ [Machine Learning Street Talk](https://open.spotify.com/show/02e6PZeIOdpmBGT9THuzwR) [Years: 2020 - now, Status: active]
* 🎙️ [DataFramed](https://www.datacamp.com/community/podcast) - latest trends and insights on how to scale the impact of data science in organizations [Years: 2019 - now, Status: active]

#### Some NLP episodes
* 🎙️ [The Super Data Science Podcast](https://www.superdatascience.com/podcast) [Years: 2016 - now, Status: active]
* 🎙️ [Data Hack Radio](https://soundcloud.com/datahack-radio) [Years: 2018 - now, Status: active]
* 🎙️ [AI Game Changers](https://podcasts.apple.com/de/podcast/ai-game-changers/id1512574291) [Years: 2020 - now, Status: active]
* 🎙️ [The Analytics Show](https://anchor.fm/analyticsshow) [Years: 2019 - now, Status: active]

![The-NLP-Newsletter](./Resources/Images/pandect_scroll.png)
-----

* 📙 [NLP News](https://ruder.io/nlp-news/) by [Sebastian Ruder](https://ruder.io)
* 📙 [dair.ai Newsletter](https://dair.ai/newsletter/) by [dair.ai](dair.ai)
* 📙 [This Week in NLP by Robert Dale](https://www.language-technology.com/twin)
* 📙 [Papers with Code](https://paperswithcode.com)
* 📙 [The Batch](https://www.deeplearning.ai/thebatch/) by [deeplearning.ai](https://www.deeplearning.ai/thebatch/)
* 📙 [Paper Digest](https://www.paperdigest.org/2020/04/recent-papers-on-question-answering/) by [PaperDigest](https://www.paperdigest.org/daily-paper-digest/)
* 📙 [NLP Cypher](https://medium.com/@quantumstat) by [QuantumStat](https://quantumstat.com)

![The-NLP-Meetups](./Resources/Images/pandect_meetups.png)
-----

* 🎥 [NLP Zurich](https://www.linkedin.com/company/nlp-zurich/) [[YouTube Recordings](https://www.youtube.com/channel/UCLLX-5j9UNYassOwS0nveDQ)]
* 🎥 [Hacking-Machine-Learning](https://www.meetup.com/Hacking-Machine-Learning) [[YouTube Recordings](https://www.youtube.com/channel/UCt5RvrC-_3X7FNAWhORVn7Q)]
* 🎥 [NY-NLP (New York)](https://www.meetup.com/NY-NLP/)

![The-NLP-Youtube](./Resources/Images/pandect_youtube.png)
-----

* 🎥 [Yannic Kilcher](https://www.youtube.com/channel/UCZHmQk67mSJgfCCTn7xBfew)
* 🎥 [HuggingFace](https://www.youtube.com/channel/UCHlNU7kIZhRgSbhHvFoy72w)
* 🎥 [Kaggle Reading Group](https://www.youtube.com/watch?v=PhTF7yJNR70&list=PLqFaTIg4myu8t5ycqvp7I07jTjol3RCl9)
* 🎥 [Rasa Paper Reading](https://www.youtube.com/channel/UCJ0V6493mLvqdiVwOKWBODQ/playlists)
* 🎥 [Stanford CS224N: NLP with Deep Learning](https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z)
* 🎥 [NLPxing](https://www.youtube.com/channel/UCuGC1JusVvbOGa__qMtH3QA/videos)
* 🎥 [ML Explained - A.I. Socratic Circles - AISC](https://www.youtube.com/channel/UCfk3pS8cCPxOgoleriIufyg)
* 🎥 [Deeplearning.ai](https://www.youtube.com/channel/UCcIXc5mJsHVYTZR1maL5l9w/featured)
* 🎥 [Machine Learning Street Talk](https://www.youtube.com/channel/UCMLtBahI5DMrt0NPvDSoIRQ/featured)

![The-NLP-Benchmarks](./Resources/Images/pandect_benchmark.png)
-----
[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

### General NLU
* ⭐ [GLUE](https://gluebenchmark.com) - General Language Understanding Evaluation (GLUE) benchmark
* ⭐ [SuperGLUE](https://super.gluebenchmark.com) - benchmark styled after GLUE with a new set of more difficult language understanding tasks
* ⭐ [decaNLP](https://decanlp.com) - The Natural Language Decathlon (decaNLP) for studying general NLP models
* ⭐ [dialoglue](https://github.com/alexa/dialoglue) - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [GitHub, 235 stars]
* ⭐ [DynaBench](https://dynabench.org/) - Dynabench is a research platform for dynamic data collection and benchmarking
* ⭐ [Big-Bench](https://github.com/google/BIG-bench) - collaborative benchmark for measuring and extrapolating the capabilities of language models [GitHub, 1228 stars]

### Summarization
* ⭐ [WikiAsp](https://github.com/neulab/wikiasp) - WikiAsp: Multi-document aspect-based summarization Dataset
* ⭐ [WikiLingua](https://github.com/esdurmus/Wikilingua) - A Multilingual Abstractive Summarization Dataset

### Question Answering
* ⭐ [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) - Stanford Question Answering Dataset (SQuAD)
* ⭐ [XQuad](https://github.com/deepmind/xquad) - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
* ⭐ [GrailQA](https://dki-lab.github.io/GrailQA/) - Strongly Generalizable Question Answering (GrailQA)
* ⭐ [CSQA](https://amritasaha1812.github.io/CSQA/) - Complex Sequential Question Answering

### Multilingual and Non-English Benchmarks
* 📙 [XTREME](https://arxiv.org/abs/2003.11080) - Massively Multilingual Multi-task Benchmark
* ⭐ [GLUECoS](https://github.com/microsoft/GLUECoS) - A benchmark for code-switched NLP
* ⭐ [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) - Natural Language Understanding Benchmark for Indic Languages
* ⭐ [LinCE](https://ritual.uh.edu/lince/) - Linguistic Code-Switching Evaluation Benchmark
* ⭐ [Russian SuperGlue](https://russiansuperglue.com) - Russian SuperGlue Benchmark

### Bio, Law, and other scientific domains
* ⭐ [BLURB](https://microsoft.github.io/BLURB/) - Biomedical Language Understanding and Reasoning Benchmark
* ⭐ [BLUE](https://github.com/ncbi-nlp/BLUE_Benchmark) - Biomedical Language Understanding Evaluation benchmark
* ⭐ [LexGLUE](https://github.com/coastalcph/lex-glue) - A Benchmark Dataset for Legal Language Understanding in English

### Transformer Efficiency
* ⭐ [Long-Range Arena](https://github.com/google-research/long-range-arena) - Long Range Arena for Benchmarking Efficient Transformers ([Pre-print](https://arxiv.org/abs/2011.04006)) [GitHub, 481 stars]

### Speech Processing
* ⭐ [SUPERB](http://superbbenchmark.org/) - Speech processing Universal PERformance Benchmark

### Other
* ⭐ [CodeXGLUE](https://www.microsoft.com/en-us/research/blog/codexglue-a-benchmark-dataset-and-open-challenge-for-code-intelligence/) - A benchmark dataset for code intelligence
* ⭐ [CrossNER](https://github.com/zliucr/CrossNER) - CrossNER: Evaluating Cross-Domain Named Entity Recognition
* ⭐ [MultiNLI](cims.nyu.edu/~sbowman/multinli/) - Multi-Genre Natural Language Inference corpus
* ⭐ [iSarcasm: A Dataset of Intended Sarcasm](https://github.com/silviu-oprea/iSarcasm) - iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic

![The-NLP-Research](./Resources/Images/pandect_quill.png)
-----
[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

### General
* 📙 [A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/) by Andrej Karpathy [Keywords: research, training, 2019]
* 📙 [Recent Advances in NLP via Large Pre-Trained Language Models: A Survey](https://arxiv.org/abs/2111.01243) [Paper, November 2021]

### Embeddings
#### Repositories
* ⭐ [Pre-trained ELMo Representations for Many Languages](https://github.com/HIT-SCIR/ELMoForManyLangs) [GitHub, 1413 stars]
* ⭐ [sense2vec](https://github.com/explosion/sense2vec) - Contextually-keyed word vectors [GitHub, 1449 stars]
* ⭐ [wikipedia2vec](https://github.com/wikipedia2vec/wikipedia2vec) [GitHub, 831 stars]
* ⭐ [StarSpace](https://github.com/facebookresearch/StarSpace) [GitHub, 3809 stars]
* ⭐ [fastText](https://github.com/facebookresearch/fastText) [GitHub, 24067 stars]

#### Blogs
* 📙 [Language Models and Contextualised Word Embeddings](http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/) by David S. Batista [Blog, 2018]
* 📙 [An Essential Guide to Pretrained Word Embeddings for NLP Practitioners](https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-nlp/?utm_source=AVLinkedin&utm_medium=post&utm_campaign=22_may_new_article) by AnalyticsVidhya [Blog, 2020]
* 📙 [Polyglot Word Embeddings Discover Language Clusters](http://blog.shriphani.com/2020/02/03/polyglot-word-embeddings-discover-language-clusters/) [Blog, 2020]
* 📙 [The Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/) by Jay Alammar [Blog, 2019]

#### Cross-lingual Word and Sentence Embeddings
* ⭐ [vecmap](https://github.com/artetxem/vecmap) - VecMap (cross-lingual word embedding mappings) [GitHub, 604 stars]
* ⭐ [sentence-transformers](https://github.com/UKPLab/sentence-transformers) - Multilingual Sentence & Image Embeddings with BERT [GitHub, 8944 stars]

#### Byte Pair Encoding
* ⭐ [bpemb](https://github.com/bheinzerling/bpemb) - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1081 stars]
* ⭐ [subword-nmt](https://github.com/rsennrich/subword-nmt) - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1972 stars]
* ⭐ [python-bpe](https://github.com/soaxelbrooke/python-bpe) - Byte Pair Encoding for Python [GitHub, 188 stars]

### Transformer-based Architectures
#### General
* 📙 [The Transformer Family](https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html) by Lilian Weng [Blog, 2020]
* 📙 [Playing the lottery with rewards and multiple languages](https://arxiv.org/abs/1906.02768) - about the effect of random initialization [ICLR 2020 Paper]
* 📙 [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html) by Lilian Weng [Blog, 2018]
* 📙 [the transformer … “explained”?](https://nostalgebraist.tumblr.com/post/185326092369/the-transformer-explained) [Blog, 2019]
* 🎥️ [Attention is all you need; Attentional Neural Network Models](https://www.youtube.com/watch?v=rBCqOTEfxvg) by Łukasz Kaiser [Talk, 2017]
* 🎥️ [Understanding and Applying Self-Attention for NLP](https://www.youtube.com/watch?v=OYygPG4d9H0) [Talk, 2018]
* 📙 [The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures](https://arxiv.org/abs/2104.10640) [Paper, April 2021]
* 📙 [Pre-Trained Models: Past, Present and Future](https://arxiv.org/abs/2106.07139) [Paper, June 2021]
* 📙 [A Survey of Transformers](https://arxiv.org/abs/2106.04554) [Paper, June 2021]

#### Transformer
* 📙 [The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html) by Harvard NLP [Blog, 2018]
* 📙 [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) by Jay Alammar [Blog, 2018]
* 📙 [Illustrated Guide to Transformers](https://towardsdatascience.com/illustrated-guide-to-transformer-cf6969ffa067) by Hong Jing [Blog, 2020]
* 📙 [Sequential Transformer with Adaptive Attention Span](https://github.com/facebookresearch/adaptive-span) by Facebook. [Blog](https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/) [Blog, 2019]
* 📙 [Evolution of Representations in the Transformer](https://lena-voita.github.io/posts/emnlp19_evolution.html) by Lena Voita [Blog, 2019]
* 📙 [Reformer: The Efficient Transformer](https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html) [Blog, 2020]
* 📙 [Longformer — The Long-Document Transformer](https://medium.com/dair-ai/longformer-what-bert-should-have-been-78f4cd595be9) by Viktor Karlsson [Blog, 2020]
* 📙 [TRANSFORMERS FROM SCRATCH](http://www.peterbloem.nl/blog/transformers) [Blog, 2019]
* 📙 [Universal Transformers](https://mostafadehghani.com/2019/05/05/universal-transformers/) by Mostafa Dehghani [Blog, 2019]
* 📙 [Transformers in Natural Language Processing — A Brief Survey](https://eigenfoo.xyz/transformers-in-nlp/) by George Ho [Blog, May 2020]
* ⭐ [Lite Transformer](https://github.com/mit-han-lab/lite-transformer) - Lite Transformer with Long-Short Range Attention [GitHub, 550 stars]
* 📙 [Transformers from Scratch](https://e2eml.school/transformers.html) [Blog, Oct 2021]

#### BERT
* 📙 [A Visual Guide to Using BERT for the First Time](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) by Jay Alammar [Blog, 2019]
* 📙 [The Dark Secrets of BERT](https://text-machine-lab.github.io/blog/2020/bert-secrets/) by Anna Rogers [Blog, 2020]
* 📙 [Understanding searches better than ever before](https://www.blog.google/products/search/search-language-understanding-bert/) [Blog, 2019]
* 📙 [Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework](https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/) [Blog, 2019]
* ⭐ [SemBERT](https://github.com/cooelf/SemBERT) - Semantics-aware BERT for Language Understanding [GitHub, 278 stars]
* ⭐ [BERTweet](https://github.com/VinAIResearch/BERTweet) - BERTweet: A pre-trained language model for English Tweets [GitHub, 487 stars]
* ⭐ [Optimal Subarchitecture Extraction for BERT](https://github.com/alexa/bort) [GitHub, 461 stars]
* ⭐ [CharacterBERT: Reconciling ELMo and BERT](https://github.com/helboukkouri/character-bert) [GitHub, 163 stars]
* 📙 [When BERT Plays The Lottery, All Tickets Are Winning](https://thegradient.pub/when-bert-plays-the-lottery-all-tickets-are-winning/) [Blog, Dec 2020]
* ⭐ [BERT-related Papers](https://github.com/tomohideshibata/BERT-related-papers) a list of BERT-related papers [GitHub, 1933 stars]

#### Other Transformer Variants
##### T5
* 📙 [T5 Understanding Transformer-Based Self-Supervised Architectures](https://medium.com/@rojagtap/t5-text-to-text-transfer-transformer-643f89e8905e) [Blog, August 2020]
* 📙 [T5: the Text-To-Text Transfer Transformer](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) [Blog, 2020]
* ⭐ [multilingual-t5](https://github.com/google-research/multilingual-t5) - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 956 stars]
##### BigBird
* 📙 [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) original paper by Google Research [Paper, July 2020]
##### Reformer / Linformer / Longformer / Performers
* 🎥️ [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) - [Paper, February 2020] [[Video](https://www.youtube.com/watch?v=xJrKIPwVwGM), October 2020]
* 🎥️ [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) - [Paper, April 2020] [[Video](https://www.youtube.com/watch?v=_8KNb5iqblE), April 2020]
* 🎥️ [Linformer: Self-Attention with Linear Complexity](https://arxiv.org/abs/2006.04768) - [Paper, June 2020] [[Video](https://www.youtube.com/watch?v=-_2AF9Lhweo), June 2020]
* 🎥️ [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794) - [Paper, September 2020] [[Video](https://www.youtube.com/watch?v=0eTULzrOztQ), September 2020]
* ⭐ [performer-pytorch](https://github.com/lucidrains/performer-pytorch) - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 898 stars]

##### Switch Transformer
* 📙 [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961) original paper by Google Research [Paper, January 2021]

#### GPT-family
##### General
* 📙 [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/) by Jay Alammar [Blog, 2019]
* 📙 [The Annotated GPT-2](https://amaarora.github.io/2020/02/18/annotatedGPT2.html) by Aman Arora
* 📙 [OpenAI’s GPT-2: the model, the hype, and the controversy](https://towardsdatascience.com/openais-gpt-2-the-model-the-hype-and-the-controversy-1109f4bfd5e8) by Ryan Lowe [Blog, 2019]
* 📙 [How to generate text](https://huggingface.co/blog/how-to-generate) by Patrick von Platen [Blog, 2020]

##### GPT-3
###### Learning Resources
* 📙 [Zero Shot Learning for Text Classification](https://amitness.com/2020/05/zero-shot-text-classification/) by Amit Chaudhary [Blog, 2020]
* 📙 [GPT-3 A Brief Summary](https://leogao.dev/2020/05/29/GPT-3-A-Brief-Summary/) by Leo Gao [Blog, 2020]
* 📙 [GPT-3, a Giant Step for Deep Learning And NLP](https://anotherdatum.com/gpt-3.html) by Yoel Zeldes [Blog, June 2020]
* 📙 [GPT-3 Language Model: A Technical Overview](https://lambdalabs.com/blog/demystifying-gpt-3/) by Chuan Li [Blog, June 2020]
* 📙 [Is it possible for language models to achieve language understanding?](https://medium.com/@ChrisGPotts/is-it-possible-for-language-models-to-achieve-language-understanding-81df45082ee2) by Christopher Potts
###### Applications
* ⭐ [Awesome GPT-3](https://github.com/elyase/awesome-gpt3) - list of all resources related to GPT-3 [GitHub, 3773 stars]
* 🗂️ [GPT-3 Projects](https://airtable.com/shrndwzEx01al2jHM/tblYMAiGeDLXe35jC) - a map of all GPT-3 start-ups and commercial projects
* 🗂️ [GPT-3 Demo Showcase](https://gpt3demo.com/) - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources
* 🔱 [OpenAI API](https://beta.openai.com) - API Demo to use GPT-3 for commercial applications
###### Open-source Efforts
* 📙 [GPT-Neo](https://eleuther.ai/projects/gpt-neo/) - in-progress GPT-3 open source replication [HuggingFace Hub](https://huggingface.co/EleutherAI)
* ⭐ [GPT-J](https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b) - A 6 billion parameter, autoregressive text generation model trained on The Pile
* 📙 [Effectively using GPT-J with few-shot learning](https://nlpcloud.io/effectively-using-gpt-j-gpt-neo-gpt-3-alternatives-few-shot-learning.html) [Blog, July 2021]

#### Other
* 📙 [What is Two-Stream Self-Attention in XLNet](https://towardsdatascience.com/what-is-two-stream-self-attention-in-xlnet-ebfe013a0cf3) by Xu LIANG [Blog, 2019]
* 📙 [Visual Paper Summary: ALBERT (A Lite BERT)](https://amitness.com/2020/02/albert-visual-summary/) by Amit Chaudhary [Blog, 2020]
* 📙 [Turing NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/) by Microsoft
* 📙 [Multi-Label Text Classification with XLNet](https://towardsdatascience.com/multi-label-text-classification-with-xlnet-b5f5755302df) by Josh Xin Jie Lee [Blog, 2019]
* ⭐ [ELECTRA](https://github.com/google-research/electra) [GitHub, 2095 stars]
* ⭐ [Performer](https://github.com/lucidrains/performer-pytorch) implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 898 stars]

#### Distillation, Pruning and Quantization
##### Reading Material
* 📙 [Distilling knowledge from Neural Networks to build smaller and faster models](https://blog.floydhub.com/knowledge-distillation/) by FloydHub [Blog, 2019]
* 📙 [Compression of Deep Learning Models for Text: A Survey](https://arxiv.org/abs/2008.05221) [Paper, April 2021]
##### Tools
* ⭐ [Bert-squeeze](https://github.com/JulesBelveze/bert-squeeze) - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 65 stars]
* ⭐ [XtremeDistil ](https://github.com/microsoft/xtreme-distil-transformers) - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 122 stars]

### Automated Summarization
* 📙 [PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization](https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html) by Google AI [Blog, June 2020]
* ⭐ [CTRLsum](https://github.com/salesforce/ctrl-sum) - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 128 stars]
* ⭐ [XL-Sum](https://github.com/csebuetnlp/xl-sum) - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 186 stars]
* ⭐ [SummerTime](https://github.com/Yale-LILY/SummerTime) - an open-source text summarization toolkit for non-experts [GitHub, 211 stars]
* ⭐ [PRIMER](https://github.com/allenai/PRIMER) - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 107 stars]
* ⭐ [summarus](https://github.com/IlyaGusev/summarus) - Models for automatic abstractive summarization [GitHub, 145 stars]

### Knowledge Graphs and NLP
* 📙 [Fusing Knowledge into Language Model](https://drive.google.com/file/d/1Zgijg9RPxF-tIGWU9nt9rBcryOIB4lOk/view) [Presentation, Oct 2021]

![The-NLP-Industry](./Resources/Images/pandect_industry.png)
-----
> __Note__
> Section keywords: best practices, MLOps

[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

### Best Practices for building NLP Projects
* 🎥 [In Search of Best Practices for NLP Projects](https://www.youtube.com/watch?v=0S9iai4Ld4I) [[Slides](https://www.dropbox.com/s/4fymdzz4yh3mlyz/NLP_Best_Practices_Bilan.pdf?dl=0), Dec. 2020]
* 🎥 [EMNLP 2020: High Performance Natural Language Processing](https://slideslive.com/38940826) by Google Research, [Recording](https://slideslive.com/38940826), Nov. 2020]
* 📙 [Practical Natural Language Processing](https://www.amazon.com/Practical-Natural-Language-Processing-Pragmatic/dp/1492054054) - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020]
* 📙 [How to Structure and Manage NLP Projects](https://neptune.ai/blog/how-to-structure-and-manage-nlp-projects-templates) [Blog, May 2021]
* 📙 [Applied NLP Thinking](https://explosion.ai/blog/applied-nlp-thinking) - Applied NLP Thinking: How to Translate Problems into Solutions [Blog, June 2021]
* 🎥 [Introduction to NLP for Industry Use](https://www.youtube.com/watch?v=VRur3xey31s) - DataTalksClub presentation on Introduction to NLP for Industry Use [Recording, December 2021]
* 📙 [Measuring Embedding Drift](https://arize.com/blog/embedding-drift/) - Best practices for monitoring drift of NLP models [Blog, December 2022]

### MLOps for NLP
MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.

In general, MLOps for NLP includes having the following processes in place:
- **Data Versioning** - make sure your training, annotation and other types of data are versioned and tracked
- **Experiment Tracking** - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced
- **Model Registry** - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them
- **Automated Testing and Behavioral Testing** - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks
- **Model Deployment and Serving** - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.
- **Data and Model Observability** - track data drift, model accuracy drift etc.

Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:
- **Feature Store** - centralized storage of all features developed for ML models than can be easily reused by any other ML project
- **Metadata Management** - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.

#### MLOps Compilations & Awesome Lists
* ⭐ [awesome-mlops](https://github.com/visenger/awesome-mlops) [GitHub, 8929 stars]
* ⭐ [best-of-ml-python](https://github.com/ml-tooling/best-of-ml-python) [GitHub, 12011 stars]
* 🗂️ [MLOps.Toys](https://mlops.toys) - a curated list of MLOps projects

#### Reading Material
* 📙 [Machine Learning Operations (MLOps): Overview, Definition, and Architecture](https://arxiv.org/abs/2205.02302) [Paper, May 2022]
* 📙 [Requirements and Reference Architecture for MLOps:Insights from Industry](https://www.techrxiv.org/articles/preprint/Requirements_and_Reference_Architecture_for_MLOps_Insights_from_Industry/21397413) [Paper, Oct 2022]
* 📙 [MLOps: What It Is, Why it Matters, and How To Implement It](https://neptune.ai/blog/mlops-what-it-is-why-it-matters-and-how-to-implement-it-from-a-data-scientist-perspective) by Neptune AI [Blog, July 2021]
* 📙 [Best MLOps Tools You Need to Know as a Data Scientist](https://neptune.ai/blog/best-mlops-tools) by Neptune AI [Blog, July 2021]
* 📙 [Robust MLOps](https://blog.verta.ai/blog/robust-mlops-with-open-source-modeldb-docker-jenkins-and-prometheus) - Robust MLOps with Open-Source: ModelDB, Docker, Jenkins and Prometheus [Blog, May 2021]
* 📙 [State of MLOps 2021](https://valohai.com/state-of-mlops/#introduction) by Valohai [Blog, August 2021]
* 📙 [The MLOps Stack](https://valohai.com/blog/the-mlops-stack/) by Valohai [Blog, October 2020]
* 📙 [Data Version Control for Machine Learning Applications](https://megagon.ai/blog/data-version-control-for-machine-learning-applications/) by Megagon AI [Blog, July 2021]
* 📙 [The Rapid Evolution of the Canonical Stack for Machine Learning](https://medium.com/@ODSC/the-rapid-evolution-of-the-canonical-stack-for-machine-learning-21b37af9c3b5) [Blog, July 2021]
* 📙 [MLOps: Comprehensive Beginner’s Guide](https://medium.com/sciforce/mlops-comprehensive-beginners-guide-c235c77f407f) [Blog, March 2021]
* 📙 [What I’ve learned about MLOps from speaking with 100+ ML practitioners](https://veselinastaneva.medium.com/what-ive-learned-about-mlops-from-speaking-with-100-ml-practitioners-3025e33458ad) [Blog, May 2021]
* 📙 [DataRobot Challenger Models](https://www.datarobot.com/blog/introducing-mlops-champion-challenger-models) - MLOps Champion/Challenger Models
* 📙 [State of MLOps Blog](https://www.stateofmlops.com/) by Dr. Ori Cohen
* 📙 [MLOps Ecosystem Overview](https://arize.com/wp-content/uploads/2021/04/Arize-AI-Ecosystem-White-Paper.pdf) [Blog, 2021]

#### Learning Material
* 🗂 [MLOps cource](https://madewithml.com/#mlops) by Made With ML
* 🗂 [GitHub MLOps](https://mlops.githubapp.com) - collection of resources on how to facilitate Machine Learning Ops with GitHub
* 🗂 [ML Observability Fundamentals Course](https://arize.com/ml-observability-fundamentals/) Learn how to monitor and root-cause issues with production NLP models

#### MLOps Communities
* [The MLOps Community](https://mlops.community/) - blogs, slack group, newsletter and more all about MLOps

#### Data Versioning
* ⭐ [DVC](https://dvc.org/) - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] [Link to GitHub](https://github.com/iterative/dvc)
* 🔱 [Weights & Biases](https://wandb.ai/site) - tools for experiment tracking and dataset versioning [Paid Service]
* 🔱 [Pachyderm](https://www.pachyderm.com/) - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]

#### Experiment Tracking
* ⭐ [mlflow](https://mlflow.org/) - open source platform for the machine learning lifecycle [Free and Open Source] [Link to GitHub](https://github.com/mlflow/mlflow/)
* 🔱 [Weights & Biases](https://wandb.ai/site) - tools for experiment tracking and dataset versioning [Paid Service]
* 🔱 [Neptune AI](https://neptune.ai/) - experiment tracking and model registry built for research and production teams [Paid Service]
* 🔱 [Comet ML](https://www.comet.ml/site/) - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
* 🔱 [SigOpt](https://sigopt.com/) - automate training & tuning, visualize & compare runs [Paid Service]
* ⭐ [Optuna](https://github.com/optuna/optuna) - hyperparameter optimization framework [GitHub, 7255 stars]
* ⭐ [Clear ML](https://clear.ml/) - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] [Link to GitHub](https://github.com/allegroai/clearml/)
* ⭐ [Metaflow](https://github.com/Netflix/metaflow) - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 6187 stars]

##### Model Registry
* ⭐ [DVC](https://dvc.org/) - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] [Link to GitHub](https://github.com/iterative/dvc)
* ⭐ [mlflow](https://mlflow.org/) - open source platform for the machine learning lifecycle [Free and Open Source] [Link to GitHub](https://github.com/mlflow/mlflow/)
* ⭐ [ModelDB](https://github.com/VertaAI/modeldb) - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1530 stars]
* 🔱 [Neptune AI](https://neptune.ai/) - experiment tracking and model registry built for research and production teams [Paid Service]
* 🔱 [Valohai](https://valohai.com/) - End-to-end ML pipelines [Paid Service]
* 🔱 [Pachyderm](https://www.pachyderm.com/) - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
* 🔱 [polyaxon](https://polyaxon.com/) - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
* 🔱 [Comet ML](https://www.comet.ml/site/) - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]

#### Automated Testing and Behavioral Testing
* ⭐ [CheckList](https://github.com/marcotcr/checklist) - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1806 stars]
* ⭐ [TextAttack](https://github.com/QData/TextAttack) - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2161 stars]
* ⭐ [WildNLP](https://github.com/MI2DataLab/WildNLP) - Corrupt an input text to test NLP models' robustness [GitHub, 74 stars]
* ⭐ [Great Expectations](https://github.com/great-expectations/great_expectations) - Write tests for your data [GitHub, 7703 stars]
* ⭐ [Deepchecks](https://github.com/deepchecks/deepchecks) - Python package for comprehensively validating your machine learning models and data [GitHub, 2254 stars]

#### Model Deployability and Serving
* ⭐ [mlflow](https://mlflow.org/) - open source platform for the machine learning lifecycle [Free and Open Source] [Link to GitHub](https://github.com/mlflow/mlflow/)
* 🔱 [Amazon SageMaker](https://aws.amazon.com/de/sagemaker/) [Paid Service]
* 🔱 [Valohai](https://valohai.com/) - End-to-end ML pipelines [Paid Service]
* 🔱 [NLP Cloud](https://nlpcloud.io/) - Production-ready NLP API [Paid Service]
* 🔱 [Saturn Cloud](https://saturncloud.io/) [Paid Service]
* 🔱 [SELDON](https://www.seldon.io/tech/) - machine learning deployment for enterprise [Paid Service]
* 🔱 [Comet ML](https://www.comet.ml/site/) - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]
* 🔱 [polyaxon](https://polyaxon.com/) - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]
* ⭐ [TorchServe](https://github.com/pytorch/serve) - flexible and easy to use tool for serving PyTorch models [GitHub, 3008 stars]
* 🔱 [Kubeflow](https://www.kubeflow.org/) - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]
* ⭐ [KFServing](https://github.com/kubeflow/kfserving) - Serverless Inferencing on Kubernetes [GitHub, 1841 stars]
* 🔱 [TFX](https://www.tensorflow.org/tfx) - TensorFlow Extended - end-to-end platform for deploying production ML pipelines [Paid Service]
* 🔱 [Pachyderm](https://www.pachyderm.com/) - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]
* 🔱 [Cortex](https://www.cortex.dev/) - containers as a service on AWS [Paid Service]
* 🔱 [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/#features) - end-to-end machine learning lifecycle [Paid Service]
* ⭐ [End2End Serverless Transformers On AWS Lambda](https://github.com/bhavsarpratik/serverless-transformers-on-aws-lambda) [GitHub, 110 stars]
* ⭐ [NLP-Service](https://github.com/karndeb/NLP-Service) - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]
* 🔱 [Dagster](https://dagster.io/) - data orchestrator for machine learning [Free and Open Source]
* 🔱 [Verta](https://www.verta.ai/) - AI and machine learning deployment and operations [Paid Service]
* ⭐ [Metaflow](https://github.com/Netflix/metaflow) - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 6187 stars]
* ⭐ [flyte](https://github.com/flyteorg/flyte) - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 2887 stars]
* ⭐ [MLRun](https://github.com/mlrun/mlrun) - Machine Learning automation and tracking [GitHub, 856 stars]
* 🔱 [DataRobot MLOps](https://www.datarobot.com/platform/mlops/) - DataRobot MLOps provides a center of excellence for your production AI

#### Model Debugging
* ⭐ [imodels](https://github.com/csinva/imodels) - package for concise, transparent, and accurate predictive modeling [GitHub, 971 stars]
* ⭐ [Cockpit](https://github.com/f-dangel/cockpit) - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 416 stars]

#### Model Accuracy Prediction
* ⭐ [WeightWatcher](https://github.com/CalculatedContent/WeightWatcher) - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 1028 stars]

#### Data and Model Observability

##### General
* ⭐ [Arize AI](https://arize.com/) - embedding drift monitoring for NLP models
* ⭐ [Arize-Phoenix](https://phoenix.arize.com/) - ML observability for LLMs, vision, language, and tabular models
* ⭐ [whylogs](https://github.com/whylabs/whylogs) - open source standard for data and ML logging [GitHub, 1907 stars]
* ⭐ [Rubrix](https://github.com/recognai/rubrix) - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 1450 stars]
* ⭐ [MLRun](https://github.com/mlrun/mlrun) - Machine Learning automation and tracking [GitHub, 856 stars]
* 🔱 [DataRobot MLOps](https://www.datarobot.com/platform/mlops/) - DataRobot MLOps provides a center of excellence for your production AI
* 🔱 [Cortex](https://www.cortex.dev/) - containers as a service on AWS [Paid Service]

##### Model Centric
* 🔱 [Algorithmia](https://algorithmia.com/) - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]
* 🔱 [Dataiku](https://www.dataiku.com/) - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]
* ⭐ [Evidently AI](https://evidentlyai.com/) - tools to analyze and monitor machine learning models [Free and Open Source] [Link to GitHub](https://github.com/evidentlyai/evidently)
* 🔱 [Fiddler](https://www.fiddler.ai/) - ML Model Performance Management Tool [Paid Service]
* 🔱 [Hydrosphere](https://hydrosphere.io/) - open-source platform for managing ML models [Paid Service]
* 🔱 [Verta](https://www.verta.ai/) - AI and machine learning deployment and operations [Paid Service]
* 🔱 [Domino Model Ops](https://www.dominodatalab.com/product/model-ops/) - Deploy and Manage Models to Drive Business Impact [Paid Service]
* 🔱 [iguazio](https://www.iguazio.com/) - deployment and management of your AI applications with MLOps and end-to-end automation of machine learning pipelines [Paid Service]

##### Data Centric
* 🔱 [Datafold](https://www.datafold.com/) - data quality through diffs, profiling, and anomaly detection [Paid Service]
* 🔱 [acceldata](https://www.acceldata.io/) - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]
* 🔱 [Bigeye](https://www.bigeye.com/) - monitoring and alerting to your datasets in minutes [Paid Service]
* 🔱 [datakin](https://datakin.com/product/) - end-to-end, real-time data lineage solution [Paid Service]
* 🔱 [Monte Carlo](https://www.montecarlodata.com/) - data integrity, drifts, schema, lineage [Paid Service]
* 🔱 [SODA](https://www.soda.io/) - data monitoring, testing and validation [Paid Service]
* 🔱 [whatify](https://whatify.ai/) - data quality and action recommendation on it [Paid Service]

#### Feature Stores
* 🔱 [Tecton](https://www.tecton.ai/) - enterprise feature store for machine learning [Paid Service]
* ⭐ [FEAST](https://github.com/feast-dev/feast) - open source feature store for machine learning [Website](https://feast.dev/) [GitHub, 3792 stars]
* 🔱 [Hopsworks Feature Store](https://www.hopsworks.ai/feature-store) - data management system for managing machine learning features [Paid Service]

#### Metadata Management
* ⭐ [ML Metadata](https://github.com/google/ml-metadata) - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 500 stars]
* 🔱 [Neptune AI](https://neptune.ai/) - experiment tracking and model registry built for research and production teams [Paid Service]

#### MLOps Frameworks
* ⭐ [Metaflow](https://github.com/Netflix/metaflow) - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 6187 stars]
* ⭐ [kedro](https://github.com/quantumblacklabs/kedro) - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 7865 stars]
* ⭐ [Seldon Core](https://github.com/SeldonIO/seldon-core) - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 3503 stars]
* ⭐ [ZenML](https://github.com/maiot-io/zenml) - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 2549 stars]
* 🔱 [Google Vertex AI](https://cloud.google.com/vertex-ai) - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]
* ⭐ [Diffgram](https://github.com/diffgram/diffgram) - Complete training data platform for machine learning delivered as a single application [GitHub, 1583 stars]
* 🔱 [Continual.ai](https://continual.ai/) - build, deploy, and operationalize ML models easier and faster with a declarative interface on cloud data warehouses like Snowflake, BigQuery, RedShift, and Databricks. [Paid Service]

### Transformer-based Architectures
[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

#### General
* 📙 [Why BERT Fails in Commercial Environments](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/bert-commercial-environments.html) by Intel AI [Blog, 2020]
* 📙 [Fine Tuning BERT for Text Classification with FARM](https://towardsdatascience.com/fine-tuning-bert-for-text-classification-with-farm-2880665065e2) by Sebastian Guggisberg [Blog, 2020]
* ⭐ [Pretrain Transformers Models in PyTorch using Hugging Face Transformers](https://github.com/gmihaila/ml_things/blob/master/notebooks/pytorch/pretrain_transformers_pytorch.ipynb) [GitHub, 186 stars]
* 🎥️ [Practical NLP for the Real World](https://www.infoq.com/presentations/practical-nlp/) [Presentation, 2019]
* 🎥️ [From Paper to Product – How we implemented BERT](https://www.youtube.com/watch?v=VnmKDPBQjJk) by Christoph Henkelmann [Talk, 2020]

##### Multi-GPU Transformers
* ⭐ [Parallelformers: An Efficient Model Parallelization Toolkit for Deployment](https://github.com/tunib-ai/parallelformers) [GitHub, 548 stars]

##### Training Transformers Effectively
* ⭐ [Training BERT with Compute/Time (Academic) Budget](https://github.com/IntelLabs/academic-budget-bert) [GitHub, 256 stars]

### Embeddings as a Service
* ⭐ [embedding-as-service](https://github.com/amansrivastava17/embedding-as-service) [GitHub, 176 stars]
* ⭐ [Bert-as-service](https://github.com/hanxiao/bert-as-service) [GitHub, 11035 stars]

### NLP Recipes Industrial Applications:
* ⭐ [NLP Recipes](https://github.com/microsoft/nlp-recipes) by [microsoft](https://github.com/microsoft) [GitHub, 6048 stars]
* ⭐ [NLP with Python](https://github.com/susanli2016/NLP-with-Python) by [susanli2016](https://github.com/susanli2016) [GitHub, 2454 stars]
* ⭐ [Basic Utilities for PyTorch NLP](https://github.com/PetrochukM/PyTorch-NLP) by [PetrochukM](https://github.com/PetrochukM) [GitHub, 2127 stars]

### NLP Applications in Bio, Finance, Legal and other industries
* ⭐ [Blackstone](https://github.com/ICLRandD/Blackstone) - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 573 stars]
* ⭐ [Sci spaCy](https://github.com/allenai/scispacy) - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1279 stars]
* ⭐ [FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks](https://github.com/psnonis/FinBERT) [GitHub, 165 stars]
* ⭐ [LexNLP](https://github.com/LexPredict/lexpredict-lexnlp) - Information retrieval and extraction for real, unstructured legal text [GitHub, 555 stars]
* ⭐ [NerDL and NerCRF](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/blogposts/data_prep.ipynb) - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
* ⭐ [Legal Text Analytics](https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics) - A list of selected resources dedicated to Legal Text Analytics [GitHub, 410 stars]
* ⭐ [BioIE](https://github.com/caufieldjh/awesome-bioie) - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 222 stars]

![The-NLP-Speech](./Resources/Images/pandect_speech.png)
-----
> __Note__
> Section keywords: speech recognition

[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

### General Speech Recognition
* ⭐ [wav2letter](https://github.com/facebookresearch/wav2letter) - Automatic Speech Recognition Toolkit [GitHub, 6149 stars]
* ⭐ [DeepSpeech](https://github.com/mozilla/DeepSpeech) - Baidu's DeepSpeech architecture [GitHub, 20639 stars]
* 📙 [Acoustic Word Embeddings](https://medium.com/@maobedkova/acoustic-word-embeddings-fc3f1a8f0519) by Maria Obedkova [Blog, 2020]
* ⭐ [kaldi](https://github.com/kaldi-asr/kaldi) - Kaldi is a toolkit for speech recognition [GitHub, 12177 stars]
* ⭐ [awesome-kaldi](https://github.com/YoavRamon/awesome-kaldi) - resources for using Kaldi [GitHub, 510 stars]
* ⭐ [ESPnet](https://github.com/espnet/espnet) - End-to-End Speech Processing Toolkit [GitHub, 5791 stars]
* 📙 [HuBERT](https://ai.facebook.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression) - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]

### Text to Speech
* ⭐ [FastSpeech](https://github.com/xcmyz/FastSpeech) - The Implementation of FastSpeech based on pytorch [GitHub, 746 stars]
* ⭐ [TTS](https://github.com/coqui-ai/TTS) - a deep learning toolkit for Text-to-Speech [GitHub, 7214 stars]

### Speech to Text
* ⭐ [whisper](https://github.com/openai/whisper) - Robust Speech Recognition via Large-Scale Weak Supervision, by OpenAI [GitHub, 17097 stars]

### Datasets
* ⭐ [VoxPopuli](https://github.com/facebookresearch/voxpopuli) - large-scale multilingual speech corpus for representation learning [GitHub, 392 stars]

![The-NLP-Topics](./Resources/Images/pandect_topics.png)
-----
> __Note__
> Section keywords: topic modeling

[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

### Blogs
* 📙 [Topic Modelling with PySpark and Spark NLP](https://medium.com/trustyou-engineering/topic-modelling-with-pyspark-and-spark-nlp-a99d063f1a6e) by Maria Obedkova [Spark, Blog, 2020]
* 📙 [A Unique Approach to Short Text Clustering (Algorithmic Theory)](https://towardsdatascience.com/a-unique-approach-to-short-text-clustering-part-1-algorithmic-theory-4d4fad0882e1) by Brittany Bowers [Blog, 2020]

### Frameworks for Topic Modeling
* ⭐ [gensim](https://github.com/RaRe-Technologies/gensim) - framework for topic modeling [GitHub, 13760 stars]
* ⭐ [Spark NLP](https://github.com/JohnSnowLabs/spark-nlp) [GitHub, 3018 stars]

### Repositories
* ⭐ [Top2Vec](https://github.com/ddangelov/Top2Vec) [GitHub, 2325 stars]
* ⭐ [Anchored Correlation Explanation Topic Modeling](https://github.com/gregversteeg/CorEx) [GitHub, 289 stars]
* ⭐ [Topic Modeling in Embedding Spaces](https://github.com/adjidieng/ETM) [GitHub, 480 stars] [Paper](https://arxiv.org/abs/1907.04907)
* ⭐ [TopicNet](https://github.com/machine-intelligence-laboratory/TopicNet) - A high-level interface for BigARTM library [GitHub, 128 stars]
* ⭐ [BERTopic](https://github.com/MaartenGr/BERTopic) - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 3426 stars]
* ⭐ [OCTIS](https://github.com/MIND-Lab/OCTIS) - A python package to optimize and evaluate topic models [GitHub, 457 stars]
* ⭐ [Contextualized Topic Models](https://github.com/MilaNLProc/contextualized-topic-models) [GitHub, 968 stars]
* ⭐ [GSDMM](https://github.com/rwalk/gsdmm) - GSDMM: Short text clustering [GitHub, 305 stars]

![Keyword-Extraction](./Resources/Images/pandect_papyrus2.png)
-----
> __Note__
> Section keywords: keyword extraction

[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

### Text Rank
* ⭐ [PyTextRank](https://github.com/DerwenAI/pytextrank) - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1933 stars]
* ⭐ [textrank](https://github.com/summanlp/textrank) - TextRank implementation for Python 3 [GitHub, 1158 stars]

### RAKE - Rapid Automatic Keyword Extraction
* ⭐ [rake-nltk](https://github.com/csurfer/rake-nltk) - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 956 stars]
* ⭐ [yake](https://github.com/LIAAD/yake) - Single-document unsupervised keyword extraction [GitHub, 1238 stars]
* ⭐ [RAKE-tutorial](https://github.com/zelandiya/RAKE-tutorial) - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 369 stars]
* ⭐ [rake-nltk](https://github.com/csurfer/rake-nltk) - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 956 stars]

### Other Approaches
* ⭐ [flashtext](https://github.com/vi3k6i5/flashtext) - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5318 stars]
* ⭐ [BERT-Keyword-Extractor](https://github.com/ibatra/BERT-Keyword-Extractor) - Deep Keyphrase Extraction using BERT [GitHub, 225 stars]
* ⭐ [keyBERT](https://github.com/MaartenGr/KeyBERT) - Minimal keyword extraction with BERT [GitHub, 1998 stars]
* ⭐ [KeyphraseVectorizers](https://github.com/TimSchopf/KeyphraseVectorizers) - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 117 stars]

### Further Reading
* 📙 [Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts](https://howard-haowen.github.io/blog.ai/keyword-extraction/spacy/textacy/ckip-transformers/jieba/textrank/rake/2021/02/16/Adding-a-custom-tokenizer-to-spaCy-and-extracting-keywords.html) by Haowen Jiang [Blog, Feb 2021]
* 📙 [How to Extract Relevant Keywords with KeyBERT](https://towardsdatascience.com/how-to-extract-relevant-keywords-with-keybert-6e7b3cf889ae) [Blog, June 2021]

![Responsible-NLP](./Resources/Images/pandect_pegasus.png)
-----
> __Note__
> Section keywords: ethics, responsible NLP

[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

### NLP and ML Interpretability

#### NLP-centric
* [Explainability for Natural Language Processing - KDD'2021 Tutorial](https://www.youtube.com/watch?v=PvKOSYGclPk&t=2s) [Slides](https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249992241) [Presentation, August 2021]
* ⭐ [ecco](https://github.com/jalammar/ecco) - Tools to visuals and explore NLP language models [GitHub, 1548 stars]
* ⭐ [NLP Profiler](https://github.com/neomatrix369/nlp_profiler) - A simple NLP library allows profiling datasets with text columns [GitHub, 223 stars]
* ⭐ [transformers-interpret](https://github.com/cdpierse/transformers-interpret) - Model explainability that works seamlessly with transformers [GitHub, 905 stars]
* ⭐ [Awesome-explainable-AI](https://github.com/wangyongjie-ntu/Awesome-explainable-AI) - collection of research materials on explainable AI/ML [GitHub, 780 stars]
* ⭐ [LAMA](https://github.com/facebookresearch/LAMA) - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 956 stars]

#### General
* ⭐ [Language Interpretability Tool (LIT)](https://github.com/PAIR-code/lit) [GitHub, 3020 stars]
* ⭐ [WhatLies](https://github.com/RasaHQ/whatlies) - Toolkit to help visualise - what lies in word embeddings [GitHub, 435 stars]
* ⭐ [Interpret-Text](https://github.com/interpretml/interpret-text) - Interpretability techniques and visualization dashboards for NLP models [GitHub, 340 stars]
* ⭐ [InterpretML](https://github.com/interpretml/interpret) - Fit interpretable models. Explain blackbox machine learning [GitHub, 5155 stars]
* ⭐ [thermostat](https://github.com/DFKI-NLP/thermostat) - Collection of NLP model explanations and accompanying analysis tools [GitHub, 126 stars]
* ⭐ [Dodrio](https://github.com/poloclub/dodrio) - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 245 stars]
* ⭐ [imodels](https://github.com/csinva/imodels) - package for concise, transparent, and accurate predictive modeling [GitHub, 971 stars]

### Ethics, Bias, and Equality in NLP
* 📙 [Bias in Natural Language Processing @EMNLP 2020](https://gaurav-maheshwari.medium.com/bias-in-natural-language-processing-emnlp-2020-8f1cb2806fcc#cc1a) [Blog, Nov 2020]
* 🎥️ [Machine Learning as a Software Engineering Enterprise](https://nips.cc/virtual/2020/public/invited_16166.html) - NeurIPS 2020 Keynote [Presentation, Dec 2020]
* 📙 [Computational Ethics for NLP](http://demo.clab.cs.cmu.edu/ethical_nlp/) - course resources from the Carnegie Mellon University [Lecture Notes, Spring 2020]
* 🗂️ [Ethics in NLP](https://aclweb.org/aclwiki/Ethics_in_NLP) - resources from ACLs Ethics in NLP track
* 🗂️ [The Institute for Ethical AI & Machine Learning](https://ethical.institute)
* 📙 [Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models](https://arxiv.org/abs/2102.02503) [Paper, Feb 2021]
* ⭐ [Fairness-in-AI](https://github.com/dreji18/Fairness-in-AI) - this package is used to detect and mitigate biases in NLP tasks [GitHub, 24 stars]
* ⭐ [nlg-bias](https://github.com/ewsheng/nlg-bias) - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 46 stars]
* 🗂️ [bias-in-nlp](https://github.com/cisnlp/bias-in-nlp) - list of papers related to bias in NLP [GitHub, 9 stars]

### Adversarial Attacks for NLP
* 📙 [Privacy Considerations in Large Language Models](https://ai.googleblog.com/2020/12/privacy-considerations-in-large.html?m=1) [Blog, Dec 2020]
* ⭐ [DeepWordBug](https://github.com/QData/deepWordBug) - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 57 stars]
* ⭐ [Adversarial-Misspellings](https://github.com/danishpruthi/Adversarial-Misspellings) - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 57 stars]

### Hate Speech Analysis
* ⭐ [HateXplain](https://github.com/hate-alert/HateXplain) - BERT for detecting abusive language [GitHub, 135 stars]

![The-NLP-Frameworks](./Resources/Images/pandect_frameworks.png)
-----
> __Note__
> Section keywords: frameworks

[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

### General Purpose
* ⭐ [spaCy](https://github.com/explosion/spaCy) by Explosion AI [GitHub, 24708 stars]
* ⭐ [flair](https://github.com/flairNLP/flair) by Zalando [GitHub, 12278 stars]
* ⭐ [AllenNLP](https://github.com/allenai/allennlp) by AI2 [GitHub, 11314 stars]
* ⭐ [stanza](https://github.com/stanfordnlp/stanza) (former Stanford NLP) [GitHub, 6413 stars]
* ⭐ [spaCy stanza](https://github.com/explosion/spacy-stanza) [GitHub, 660 stars]
* ⭐ [nltk](https://github.com/nltk/nltk) [GitHub, 11280 stars]
* ⭐ [gensim](https://github.com/RaRe-Technologies/gensim) - framework for topic modeling [GitHub, 13760 stars]
* ⭐ [pororo](https://github.com/kakaobrain/pororo) - Platform of neural models for natural language processing [GitHub, 1164 stars]
* ⭐ [NLP Architect](https://github.com/NervanaSystems/nlp-architect) - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2883 stars]
* ⭐ [FARM](https://github.com/deepset-ai/FARM) [GitHub, 1597 stars]
* ⭐ [gobbli](https://github.com/RTIInternational/gobbli) by RTI International [GitHub, 268 stars]
* ⭐ [headliner](https://github.com/as-ideas/headliner) - training and deployment of seq2seq models [GitHub, 231 stars]
* ⭐ [SyferText](https://github.com/OpenMined/SyferText) - A privacy preserving NLP framework [GitHub, 190 stars]
* ⭐ [DeText](https://github.com/linkedin/detext) - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1230 stars]
* ⭐ [TextHero](https://github.com/jbesomi/texthero) - Text preprocessing, representation and visualization [GitHub, 2635 stars]
* ⭐ [textblob](https://github.com/sloria/textblob) - TextBlob: Simplified Text Processing [GitHub, 8373 stars]
* ⭐ [AdaptNLP](https://github.com/Novetta/adaptnlp) - A high level framework and library for NLP [GitHub, 407 stars]
* ⭐ [textacy](https://github.com/chartbeat-labs/textacy) - NLP, before and after spaCy [GitHub, 1999 stars]
* ⭐ [texar](https://github.com/asyml/texar) - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2323 stars]
* ⭐ [jiant](https://github.com/nyu-mll/jiant) - jiant is an NLP toolkit [GitHub, 1449 stars]

### Data Augmentation
* ⭐ [WildNLP](https://github.com/MI2DataLab/WildNLP) Text manipulation library to test NLP models [GitHub, 74 stars]
* ⭐ [snorkel](https://github.com/snorkel-team/snorkel) Framework to generate training data [GitHub, 5338 stars]
* ⭐ [NLPAug](https://github.com/makcedward/nlpaug) Data augmentation for NLP [GitHub, 3665 stars]
* ⭐ [SentAugment](https://github.com/facebookresearch/SentAugment) Data augmentation by retrieving similar sentences from larger datasets [GitHub, 361 stars]
* ⭐ [faker](https://github.com/joke2k/faker) - Python package that generates fake data for you [GitHub, 15129 stars]
* ⭐ [textflint](https://github.com/textflint/textflint) - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 585 stars]
* ⭐ [Parrot](https://github.com/PrithivirajDamodaran/Parrot_Paraphraser) - Practical and feature-rich paraphrasing framework [GitHub, 636 stars]
* ⭐ [AugLy](https://github.com/facebookresearch/AugLy) - data augmentations library for audio, image, text, and video [GitHub, 4616 stars]
* ⭐ [TextAugment](https://github.com/dsfsi/textaugment) - Python 3 library for augmenting text for natural language processing applications [GitHub, 290 stars]

### Adversarial NLP Attacks & Behavioral Testing
* ⭐ [TextAttack](https://github.com/QData/TextAttack) - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2161 stars]
* ⭐ [CleverHans](https://github.com/tensorflow/cleverhans) - adversarial example library for constructing NLP attacks and building defenses [GitHub, 5660 stars]
* ⭐ [CheckList](https://github.com/marcotcr/checklist) - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1806 stars]

### Transformer-oriented
* ⭐ [transformers](https://github.com/huggingface/transformers) by HuggingFace [GitHub, 75428 stars]
* ⭐ [Adapter Hub](https://github.com/Adapter-Hub/adapter-transformers) and its [documentation](https://docs.adapterhub.ml/index.html) - Adapter modules for Transformers [GitHub, 1110 stars]
* ⭐ [haystack](https://github.com/deepset-ai/haystack) - Transformers at scale for question answering & neural search. [GitHub, 6147 stars]

### Dialog Systems and Speech
* ⭐ [DeepPavlov](https://github.com/deepmipt/DeepPavlov) by MIPT [GitHub, 5933 stars]
* ⭐ [ParlAI](https://github.com/facebookresearch/ParlAI) by FAIR [GitHub, 9640 stars]
* ⭐ [rasa](https://github.com/RasaHQ/rasa) - Framework for Conversational Agents [GitHub, 15150 stars]
* ⭐ [wav2letter](https://github.com/facebookresearch/wav2letter) - Automatic Speech Recognition Toolkit [GitHub, 6149 stars]
* ⭐ [ChatterBot](https://github.com/gunthercox/ChatterBot) - conversational dialog engine for creating chat bots [GitHub, 12696 stars]
* ⭐ [SpeechBrain](https://github.com/speechbrain/speechbrain) - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 4935 stars]

### Word/Sentence-embeddings oriented
* ⭐ [MUSE](https://github.com/facebookresearch/MUSE) A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3021 stars]
* ⭐ [vecmap](https://github.com/artetxem/vecmap) A framework to learn cross-lingual word embedding mappings [GitHub, 604 stars]
* ⭐ [sentence-transformers](https://github.com/UKPLab/sentence-transformers) - Multilingual Sentence & Image Embeddings with BERT [GitHub, 8944 stars]

### Social Media Oriented
* ⭐ [Ekphrasis](https://github.com/cbaziotis/ekphrasis) - text processing tool, geared towards text from social networks [GitHub, 592 stars]

### Phonetics
* ⭐ [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer) - grapheme to phoneme conversion with deep learning [GitHub, 197 stars]

### Morphology
* ⭐ [LemmInflect](https://github.com/bjascob/LemmInflect) - python module for English lemmatization and inflection [GitHub, 186 stars]
* ⭐ [Inflect](https://github.com/jaraco/inflect) - generate plurals, ordinals, indefinite articles [GitHub, 757 stars]
* ⭐ [simplemma](https://github.com/jaraco/inflect) - simple multilingual lemmatizer for Python [GitHub, 757 stars]

### Multi-lingual tools
* ⭐ [polyglot](https://github.com/aboSamoor/polyglot) - Multi-lingual NLP Framework [GitHub, 2086 stars]
* ⭐ [trankit](https://github.com/nlp-uoregon/trankit) - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 649 stars]

### Distributed NLP / Multi-GPU NLP
* ⭐ [Spark NLP](https://github.com/JohnSnowLabs/spark-nlp) [GitHub, 3018 stars]
* ⭐ [Parallelformers: An Efficient Model Parallelization Toolkit for Deployment](https://github.com/tunib-ai/parallelformers) [GitHub, 548 stars]

### Machine Translation
* ⭐ [COMET](https://github.com/Unbabel/COMET) -A Neural Framework for MT Evaluation [GitHub, 191 stars]
* ⭐ [marian-nmt](https://github.com/marian-nmt/marian) - Fast Neural Machine Translation in C++ [GitHub, 974 stars]
* ⭐ [argos-translate](https://github.com/argosopentech/argos-translate) - Open source neural machine translation in Python [GitHub, 1535 stars]
* ⭐ [Opus-MT](https://github.com/Helsinki-NLP/Opus-MT) - Open neural machine translation models and web services [GitHub, 257 stars]
* ⭐ [dl-translate](https://github.com/xhlulu/dl-translate) - A deep learning-based translation library built on Huggingface transformers [GitHub, 241 stars]

### Entity and String Matching
* ⭐ [PolyFuzz](https://github.com/MaartenGr/PolyFuzz) - Fuzzy string matching, grouping, and evaluation [GitHub, 589 stars]
* ⭐ [pyahocorasick](https://github.com/WojciechMula/pyahocorasick) - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 757 stars]
* ⭐ [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy String Matching in Python [GitHub, 8776 stars]
* ⭐ [jellyfish](https://github.com/jamesturk/jellyfish) - approximate and phonetic matching of strings [GitHub, 1759 stars]
* ⭐ [textdistance](https://github.com/life4/textdistance) - Compute distance between sequences [GitHub, 3000 stars]
* ⭐ [DeepMatcher](https://github.com/anhaidgroup/deepmatcher) - Compute distance between sequences [GitHub, 457 stars]
* ⭐ [RE2](https://github.com/alibaba-edu/simple-effective-text-matching) - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 336 stars]
* ⭐ [Machamp](https://github.com/megagonlabs/machamp) - Machamp: A Generalized Entity Matching Benchmark [GitHub, 9 stars]

### Discourse Analysis
* ⭐ [ConvoKit](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit) - Cornell Conversational Analysis Toolkit [GitHub, 399 stars]

### PII scrubbing
* ⭐ [scrubadub](https://github.com/LeapBeyond/scrubadub) - Clean personally identifiable information from dirty dirty text [GitHub, 309 stars]

### Hastag Segmentation
* ⭐ [hashformers](https://github.com/ruanchaves/hashformers) - automatically inserting the missing spaces between the words in a hashtag [GitHub, 41 stars]

### Books Analysis / Literary Analysis
* ⭐ [booknlp](https://github.com/booknlp/booknlp) - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 647 stars]
* ⭐ [bookworm](https://github.com/harrisonpim/bookworm) - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 73 stars]

### Non-English oriented
#### Japanese
* ⭐ [fugashi](https://github.com/polm/fugashi) - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 268 stars]
* ⭐ [SudachiPy](https://github.com/WorksApplications/SudachiPy) - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 330 stars]
* ⭐ [Konoha](https://github.com/himkt/konoha) - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 182 stars]
* ⭐ [jProcessing](https://github.com/kevincobain2000/jProcessing) - Japanese Natural Langauge Processing Libraries [GitHub, 142 stars]
* ⭐ [Ginza](https://github.com/megagonlabs/ginza) - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 620 stars]
* ⭐ [kuromoji](https://github.com/atilika/kuromoji) - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 847 stars]
* ⭐ [nagisa](https://github.com/taishi-i/nagisa) - Japanese tokenizer based on recurrent neural networks [GitHub, 321 stars]
* ⭐ [KyTea](https://github.com/neubig/kytea) - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 190 stars]
* ⭐ [Jigg](https://github.com/mynlp/jigg) - Pipeline framework for easy natural language processing [GitHub, 72 stars]
* ⭐ [Juman++](https://github.com/ku-nlp/jumanpp) - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 321 stars]
* ⭐ [RakutenMA](https://github.com/rakuten-nlp/rakutenma) - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 447 stars]
* ⭐ [toiro](https://github.com/taishi-i/toiro) - a comparison tool of Japanese tokenizers [GitHub, 105 stars]

#### Thai
* ⭐ [AttaCut](https://github.com/PyThaiNLP/attacut) - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 68 stars]
* ⭐ [ThaiLMCut](https://github.com/meanna/ThaiLMCUT) - Word Tokenizer for Thai Language [GitHub, 15 stars]

#### Chinese
* ⭐ [Spacy-pkuseg](https://github.com/explosion/spacy-pkuseg) - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 20 stars]

#### Other
* ⭐ [textblob-de](https://github.com/markuskiller/textblob-de) - TextBlob: Simplified Text Processing for German [GitHub, 95 stars]
* ⭐ [Kashgari](https://github.com/BrikerMan/Kashgari) Transfer Learning with focus on Chinese [GitHub, 2333 stars]
* ⭐ [Underthesea](https://github.com/undertheseanlp/underthesea) - Vietnamese NLP Toolkit [GitHub, 1057 stars]
* ⭐ [PTT5](https://github.com/unicamp-dl/PTT5) - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 62 stars]

### Text Data Labelling
* ⭐ [Small-Text](https://github.com/webis-de/small-text) - Active Learning for Text Classifcation in Python [GitHub, 369 stars]
* ⭐ [Doccano](https://github.com/doccano/doccano) - open source annotation tool for machine learning practitioners [GitHub, 7005 stars]
* 🔱 [Prodigy](https://prodi.gy/) - annotation tool powered by active learning [Paid Service]

![The-NLP-Learning](./Resources/Images/pandect_learning.png)
-----
> __Note__
> Section keywords: learn NLP

[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

#### General
* 📙 [Learn NLP the practical way](https://towardsdatascience.com/learn-nlp-the-practical-way-b854ce1035c4) [Blog, Nov. 2019]
* 📙 [Learn NLP the Stanford way](https://towardsdatascience.com/learn-nlp-the-stanford-way-lesson-1-3f1844265760) ([+Part 2](https://towardsdatascience.com/learn-nlp-the-stanford-way-lesson-2-7447f2c12b36)) [Blog, Nov 2020]
* 📙 [Choosing the right course for a Practical NLP Engineer](https://airev.us/ultimate-guide-to-natural-language-processing-courses/)
* 📙 [12 Best Natural Language Processing Courses & Tutorials to Learn Online](https://blog.coursesity.com/best-natural-language-processing-courses/)
* ⭐ [Treasure of Transformers](https://github.com/ashishpatel26/Treasure-of-Transformers) - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 563 stars]
* 🎥️ [Rasa Algorithm Whiteboard](https://www.youtube.com/playlist?list=PL75e0qA87dlG-za8eLI6t0_Pbxafk-cxb) - YouTube series by Rasa explaining various Data Science and NLP Algorithms
* 🎥️ [ExplosionAI Videos](https://www.youtube.com/c/ExplosionAI/videos) - YouTube series by ExplosionAI teaching you how to use spacy and apply it for NLP

#### Courses
* 🎥️ [CS25: Transformers United Stanford - Fall 2021](https://web.stanford.edu/class/cs25/) [Course, Fall 2021]
* 📙 [NLP Course | For You](https://lena-voita.github.io/nlp_course.html) - Great and interactive course on NLP
* 📙 [OpenClass NLP](https://openclass.ai/catalog/nlp) - Natural language processing (NLP) assignments
* 📙 [Advanced NLP with spaCy](https://course.spacy.io/en/) - how to use spaCy to build advanced natural language understanding systems
* 📙 [Transformer models for NLP](https://huggingface.co/course/chapter1) by HuggingFace
* 🎥️ [Stanford NLP Seminar](https://nlp.stanford.edu/seminar/) - slides from the Stanford NLP course

#### Books
* 📙 [Natural Language Processing with Transformers](https://www.buecher.de/shop/maschinelles-lernen/natural-language-processing-with-transformers/tunstall-lewis-von-werra-leandro-wolf-thomas/products_products/detail/prod_id/64140211/) - [Book, February 2022]
* 📙 [Applied Natural Language Processing in the Enterprise](https://www.oreilly.com/library/view/applied-natural-language/9781492062561/) - [Book, May 2021]
* 📙 [Practical Natural Language Processing](https://www.oreilly.com/library/view/practical-natural-language/9781492054047/) - [Book, June 2020]
* 📙 [Dive into Deep Learning](https://d2l.ai/index.html) - An interactive deep learning book with code, math, and discussions
* 📙 [Natural Language Processing and Computational Linguistics](https://www.amazon.de/Natural-Language-Processing-Computational-Linguistics/dp/1848218486) - Speech, Morphology and Syntax (Cognitive Science)
* 📙 [Top NLP Books to Read 2020](https://towardsdatascience.com/top-nlp-books-to-read-2020-12012ef41dc1) - Blog post by Raymong Cheng [Blog, Sep 2020]

#### Tutorials
* ⭐ [nlp-tutorial](https://github.com/lyeoni/nlp-tutorial) - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1324 stars]
* ⭐ [nlp-tutorial](https://github.com/graykode/nlp-tutorial) - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 11796 stars]
* ⭐ [Hands-On NLTK Tutorial](https://github.com/hb20007/hands-on-nltk-tutorial) [GitHub, 506 stars]
* ⭐ [Modern Practical Natural Language Processing](https://github.com/jmugan/modern_practical_nlp) [GitHub, 260 stars]
* ⭐ [Transformers-Tutorials](https://github.com/NielsRogge/Transformers-Tutorials) - demos with the Transformers library by HuggingFace [GitHub, 3408 stars]
* 🗂️ [CalmCode Tutorials](https://calmcode.io/#science) - Set of Python Data Science Tutorials

![The-NLP-Communities](./Resources/Images/pandect_communities.png)
-----
* [r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/) - NLP Reddit forum

![Other-NLP-Topics](Resources/Images/pandect_papyrus_other.png)
-----

[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)

#### Tokenization
* ⭐ [tokenizers](https://github.com/huggingface/tokenizers) - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 6064 stars]
* ⭐ [SentencePiece](https://github.com/google/sentencepiece) - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 6316 stars]
* ⭐ [SoMaJo](https://github.com/tsproisl/SoMaJo) - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 108 stars]

#### Data Augmentation and Weak Supervision
##### Libraries and Frameworks
* ⭐ [WildNLP](https://github.com/MI2DataLab/WildNLP) Text manipulation library to test NLP models [GitHub, 74 stars]
* ⭐ [NLPAug](https://github.com/makcedward/nlpaug) Data augmentation for NLP [GitHub, 3665 stars]
* ⭐ [SentAugment](https://github.com/facebookresearch/SentAugment) Data augmentation by retrieving similar sentences from larger datasets [GitHub, 361 stars]
* ⭐ [TextAttack](https://github.com/QData/TextAttack) - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2161 stars]
* ⭐ [skweak](https://github.com/NorskRegnesentral/skweak) - software toolkit for weak supervision applied to NLP tasks [GitHub, 843 stars]
* ⭐ [NL-Augmenter](https://github.com/GEM-benchmark/NL-Augmenter) - Collaborative Repository of Natural Language Transformations [GitHub, 679 stars]
* ⭐ [EDA](https://github.com/jasonwei20/eda_nlp) - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1356 stars]
* ⭐ [snorkel](https://github.com/snorkel-team/snorkel) Framework to generate training data [GitHub, 5338 stars]

##### Reading Material and Tutorials
* ⭐ [A Survey of Data Augmentation Approaches for NLP](https://arxiv.org/abs/2105.03075) [Paper, May 2021] [GitHub Link](https://github.com/styfeng/DataAug4NLP)
* 📙 [A Visual Survey of Data Augmentation in NLP](https://amitness.com/2020/05/data-augmentation-for-nlp/) [Blog, 2020]
* 📙 [Weak Supervision: A New Programming Paradigm for Machine Learning](http://ai.stanford.edu/blog/weak-supervision/) [Blog, March 2019]

#### Named Entity Recognition (NER)
* ⭐ [Datasets for Entity Recognition](https://github.com/juand-r/entity-recognition-datasets) [GitHub, 1255 stars]
* ⭐ [Datasets to train supervised classifiers for Named-Entity Recognition](https://github.com/davidsbatista/NER-datasets) [GitHub, 297 stars]
* ⭐ [Bootleg](https://github.com/HazyResearch/bootleg) - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 189 stars]
* ⭐ [Few-NERD](https://github.com/thunlp/Few-NERD) - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 318 stars]

#### Relation Extraction
* ⭐ [tacred-relation](https://github.com/yuhaozhang/tacred-relation) TACRED: position-aware attention model for relation extraction [GitHub, 336 stars]
* ⭐ [tacrev](https://github.com/DFKI-NLP/tacrev) TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 55 stars]
* ⭐ [tac-self-attention](https://github.com/ivan-bilan/tac-self-attention) Relation extraction with position-aware self-attention [GitHub, 64 stars]
* ⭐ [Re-TACRED](https://github.com/gstoica27/Re-TACRED) Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 39 stars]

#### Coreference Resolution
* ⭐ [NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks](https://github.com/huggingface/neuralcoref) by HuggingFace [GitHub, 2627 stars]
* ⭐ [coref](https://github.com/mandarjoshi90/coref) - BERT and SpanBERT for Coreference Resolution [GitHub, 399 stars]

#### Sentiment Analysis
* ⭐ [Reading list for Awesome Sentiment Analysis papers](https://github.com/declare-lab/awesome-sentiment-analysis) by [declare-lab](https://github.com/declare-lab) [GitHub, 475 stars]
* ⭐ [Awesome Sentiment Analysis](https://github.com/xiamx/awesome-sentiment-analysis) by [xiamx](https://github.com/xiamx) [GitHub, 884 stars]

#### Domain Adaptation
* ⭐ [Neural Adaptation in Natural Language Processing - curated list](https://github.com/bplank/awesome-neural-adaptation-in-NLP) [GitHub, 246 stars]

#### Low Resource NLP
* ⭐ [CMU LTI Low Resource NLP Bootcamp 2020](https://github.com/neubig/lowresource-nlp-bootcamp-2020) - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 555 stars]

#### Spell Correction / Error Correction
* ⭐ [Gramformer](https://github.com/PrithivirajDamodaran/Gramformer) - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1244 stars]
* ⭐ [NeuSpell](https://github.com/neuspell/neuspell) - A Neural Spelling Correction Toolkit [GitHub, 515 stars]
* ⭐ [SymSpellPy](https://github.com/mammothb/symspellpy) - Python port of SymSpell [GitHub, 641 stars]
* 📙 [Speller100](https://www.microsoft.com/en-us/research/blog/speller100-zero-shot-spelling-correction-at-scale-for-100-plus-languages/) by Microsoft [Blog, Feb 2021]
* ⭐ [JamSpell](https://github.com/bakwc/JamSpell) - spell checking library - accurate, fast, multi-language [GitHub, 527 stars]
* ⭐ [pycorrector](https://github.com/shibing624/pycorrector) - spell correction for Chinese [GitHub, 3714 stars]
* ⭐ [contractions](https://github.com/kootenpv/contractions) - Fixes contractions such as `you're` to you `are` [GitHub, 262 stars]

#### Style Transfer for NLP
* ⭐ [Styleformer](https://github.com/PrithivirajDamodaran/Styleformer) - Neural Language Style Transfer framework [GitHub, 427 stars]
* ⭐ [StylePTB](https://github.com/lvyiwei1/StylePTB) - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 51 stars]

#### Automata Theory for NLP
* ⭐ [pyahocorasick](https://github.com/WojciechMula/pyahocorasick) - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 757 stars]

#### Obscene words detection
* ⭐ [LDNOOBW](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1988 stars]

#### Reddit Analysis
* ⭐ [Subreddit Analyzer](https://github.com/PhantomInsights/subreddit-analyzer) - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 483 stars]

#### Skill Detection
* ⭐ [SkillNER](https://github.com/AnasAito/SkillNER) - rule based NLP module to extract job skills from text [GitHub, 71 stars]

#### Reinforcement Learning for NLP
* ⭐ [nlp-gym](https://github.com/rajcscw/nlp-gym) - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 132 stars]

#### AutoML / AutoNLP
* ⭐ [AutoNLP](https://github.com/huggingface/autonlp) - Faster and easier training and deployments of SOTA NLP models [GitHub, 689 stars]
* ⭐ [TPOT](https://github.com/EpistasisLab/tpot) - Python Automated Machine Learning tool [GitHub, 8826 stars]
* ⭐ [Auto-PyTorch](https://github.com/automl/Auto-PyTorch) - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1862 stars]
* ⭐ [HungaBunga](https://github.com/ypeleg/HungaBunga) - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 674 stars]
* 🔱 [AutoML Natural Language](https://cloud.google.com/natural-language/automl/docs) - Google's paid AutoML NLP service
* ⭐ [Optuna](https://github.com/optuna/optuna) - hyperparameter optimization framework [GitHub, 7255 stars]
* ⭐ [FLAML](https://github.com/microsoft/FLAML) - fast and lightweight AutoML library [GitHub, 2154 stars]
* ⭐ [Gradsflow](https://github.com/gradsflow/gradsflow) - open-source AutoML & PyTorch Model Training Library [GitHub, 289 stars]

#### OCR - Optical Character Recognition
* 🎥️ [A framework for designing document processing solutions](https://ljvmiranda921.github.io/notebook/2022/06/19/document-processing-framework/) [Blog, June 2022]

#### Document AI
* 📙 [Table Transformer](https://huggingface.co/docs/transformers/main/model_doc/table-transformer) + [HuggingFace Models](https://huggingface.co/models?other=table-transformer)

#### Text Generation
* ⭐ [keytotext](https://github.com/gagan3012/keytotext) - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 353 stars]
* 📙 [Controllable Neural Text Generation](https://lilianweng.github.io/lil-log/2021/01/02/controllable-neural-text-generation.html) [Blog, Jan 2021]
* ⭐ [BARTScore](https://github.com/neulab/BARTScore) Evaluating Generated Text as Text Generation [GitHub, 192 stars]

#### Title / Headlines Generation
* ⭐ [TitleStylist](https://github.com/jind11/TitleStylist) Learning to Generate Headlines with Controlled Styles [GitHub, 72 stars]

#### NLP research reproducibility
* 📙 [A Systematic Review of Reproducibility Research in Natural Language Processing](https://arxiv.org/abs/2103.07929) [Paper, March 2021]

## License [CC0](./LICENSE)

## Attributions
#### Resources
* All linked resources belong to original authors

#### Icons
* [Akropolis](https://thenounproject.com/search/?q=ancient%20greek&i=403786) by parkjisun from the [Noun Project](https://thenounproject.com)
* [Book](https://thenounproject.com/icon/304884/) of Ester by Gilad Sotil from the [Noun Project](https://thenounproject.com)
* [quill](https://thenounproject.com/term/quill/17013/) by Juan Pablo Bravo from the [Noun Project](https://thenounproject.com)
* [acting](https://thenounproject.com/term/acting/2369397/) by Flatart from the [Noun Project](https://thenounproject.com)
* [olympic](https://thenounproject.com/term/olympic/1870751/) by supalerk laipawat from the [Noun Project](https://thenounproject.com)
* [aristocracy](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3156156) by Eucalyp from the [Noun Project](https://thenounproject.com)
* [Horn](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3156640) by Eucalyp from the [Noun Project](https://thenounproject.com)
* [temple](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3156638) by Eucalyp from the [Noun Project](https://thenounproject.com)
* [constellation](https://thenounproject.com/eucalyp/collection/ancient-greece-glyph/?i=3156142) by Eucalyp from the [Noun Project](https://thenounproject.com)
* [ancient greek round pattern](https://thenounproject.com/term/ancient-greek-round-pattern/2048889/) by Olena Panasovska from the [Noun Project](https://thenounproject.com)
* Harp by Vectors Point from the [Noun Project](https://thenounproject.com)
* [Atlas](https://thenounproject.com/naripuru/collection/ancient-gods/?i=2225785) by parkjisun from the [Noun Project](https://thenounproject.com)
* [Parthenon](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3158942) by Eucalyp from the [Noun Project](https://thenounproject.com)
* [papyrus](https://thenounproject.com/iconmark/collection/greek-mythology/?i=3515982) by IconMark from the [Noun Project](https://thenounproject.com)
* [papyrus](https://thenounproject.com/search/?q=papyrus&i=2239368) by Smalllike from the [Noun Project](https://thenounproject.com)
* [pegasus](https://thenounproject.com/search/?q=pegasus&i=2266449) by Saeful Muslim from the [Noun Project](https://thenounproject.com)

#### Fonts
* [Dalek Font](https://www.dafont.com/dalek.font)

-----

The Pandect Series also includes