{"id":13665729,"url":"https://github.com/ivan-bilan/The-NLP-Pandect","last_synced_at":"2025-04-26T08:33:08.155Z","repository":{"id":37239441,"uuid":"268292479","full_name":"ivan-bilan/The-NLP-Pandect","owner":"ivan-bilan","description":"A comprehensive reference for all topics related to Natural Language Processing","archived":false,"fork":false,"pushed_at":"2024-10-06T14:42:26.000Z","size":1209,"stargazers_count":2022,"open_issues_count":2,"forks_count":283,"subscribers_count":124,"default_branch":"master","last_synced_at":"2025-04-23T09:02:05.880Z","etag":null,"topics":["awesome-list","deeplearning","natural-language-processing","naturallanguageprocessing","nlp","pandect"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ivan-bilan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-31T14:04:29.000Z","updated_at":"2025-04-21T08:57:14.000Z","dependencies_parsed_at":"2023-02-17T06:15:54.365Z","dependency_job_id":"f5657ba4-62d4-4600-b92d-09bc931679d0","html_url":"https://github.com/ivan-bilan/The-NLP-Pandect","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivan-bilan%2FThe-NLP-Pandect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivan-bilan%2FThe-NLP-Pandect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivan-bilan%2FThe-NLP-Pandect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivan-bilan%2FThe-NLP-Pandect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ivan-bilan","download_url":"https://codeload.github.com/ivan-bilan/The-NLP-Pandect/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250960797,"owners_count":21514520,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["awesome-list","deeplearning","natural-language-processing","naturallanguageprocessing","nlp","pandect"],"created_at":"2024-08-02T06:00:48.704Z","updated_at":"2025-04-26T08:33:03.143Z","avatar_url":"https://github.com/ivan-bilan.png","language":"Python","funding_links":[],"categories":["Python","Uncategorized","📖 Natural Language Processing (NLP)","Deep Learning Repositories","Natural Language Processing (NLP)","Course"],"sub_categories":["Uncategorized","Resources","Web Hosting"],"readme":"![The-NLP-Pandect](./Resources/Images/pandect.png)\n\n\u003cp align=\"center\"\u003e\nThis pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e \u003cimg src=\"https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/120/google/313/flag-ukraine_1f1fa-1f1e6.png\" alt=\"Ukraine\" width=\"50\" height=\"50\"/\u003e\n\n\u003e __Note__\n\u003e Quick legend on available resource types:\n\u003e \n\u003e ⭐ - open source project, usually a GitHub repository with its number of stars\n\u003e \n\u003e 📙 - resource you can read, usually a blog post or a paper\n\u003e \n\u003e 🗂️ - a collection of additional resources\n\u003e \n\u003e 🔱 - non-open source tool, framework or paid service\n\u003e \n\u003e 🎥️ - a resource you can watch\n\u003e \n\u003e 🎙️ - a resource you can listen to\n\n### \u003cp align=\"center\"\u003e\u003cb\u003eTable of Contents\u003c/b\u003e\u003c/p\u003e\n\n| 📇 Main Section  | 🗃️ Sub-sections Sample |\n| ------------- | ------------- |\n| [NLP Resources](https://github.com/ivan-bilan/The-NLP-Pandect#)  | [Paper Summaries](https://github.com/ivan-bilan/The-NLP-Pandect#papers-and-paper-summaries), [Conference Summaries](https://github.com/ivan-bilan/The-NLP-Pandect#conference-summaries), [NLP Datasets](https://github.com/ivan-bilan/The-NLP-Pandect#nlp-datasets) |\n| [NLP Podcasts](https://github.com/ivan-bilan/The-NLP-Pandect#-1)  | [NLP-only Podcasts](https://github.com/ivan-bilan/The-NLP-Pandect#nlp-only-podcasts), [Podcasts with many NLP Episodes](https://github.com/ivan-bilan/The-NLP-Pandect#many-nlp-episodes) |\n| [NLP Newsletters](https://github.com/ivan-bilan/The-NLP-Pandect#-2)  | -  |\n| [NLP Meetups](https://github.com/ivan-bilan/The-NLP-Pandect#-3)  | -  |\n| [NLP YouTube Channels](https://github.com/ivan-bilan/The-NLP-Pandect#-4)  | -  |\n| [NLP Benchmarks](https://github.com/ivan-bilan/The-NLP-Pandect#-5)  | [General NLU](https://github.com/ivan-bilan/The-NLP-Pandect#general-nlu), [Question Answering](https://github.com/ivan-bilan/The-NLP-Pandect#question-answering), [Multilingual](https://github.com/ivan-bilan/The-NLP-Pandect#multilingual-and-non-english-benchmarks)  |\n| [Research Resources](https://github.com/ivan-bilan/The-NLP-Pandect#-6) | [Resource on Transformer Models](https://github.com/ivan-bilan/The-NLP-Pandect#transformer-based-architectures), [Distillation and Pruning](https://github.com/ivan-bilan/The-NLP-Pandect#distillation-pruning-and-quantization), [Automated Summarization](https://github.com/ivan-bilan/The-NLP-Pandect#automated-summarization)  |\n| [Industry Resources](https://github.com/ivan-bilan/The-NLP-Pandect#-7)  | [Best Practices for NLP Systems](https://github.com/ivan-bilan/The-NLP-Pandect#best-practices-for-nlp), [MLOps for NLP](https://github.com/ivan-bilan/The-NLP-Pandect#mlops-for-nlp) |\n| [Speech Recognition](https://github.com/ivan-bilan/The-NLP-Pandect#-8)  | [General Resources](https://github.com/ivan-bilan/The-NLP-Pandect#general-speech-recognition), [Text to Speech](https://github.com/ivan-bilan/The-NLP-Pandect#text-to-speech), [Speech to Text](https://github.com/ivan-bilan/The-NLP-Pandect#speech-to-text), [Datasets](https://github.com/ivan-bilan/The-NLP-Pandect#datasets)  |\n| [Topic Modeling](https://github.com/ivan-bilan/The-NLP-Pandect#-9)  | [Blogs](https://github.com/ivan-bilan/The-NLP-Pandect#blogs-1), [Frameworks](https://github.com/ivan-bilan/The-NLP-Pandect#frameworks-for-topic-modeling), [Repositories and Projects](https://github.com/ivan-bilan/The-NLP-Pandect#repositories-1)  |\n| [Keyword Extraction](https://github.com/ivan-bilan/The-NLP-Pandect#-10)  | [Text Rank](https://github.com/ivan-bilan/The-NLP-Pandect#text-rank), [Rake](https://github.com/ivan-bilan/The-NLP-Pandect#rake---rapid-automatic-keyword-extraction), [Other Approaches](https://github.com/ivan-bilan/The-NLP-Pandect#other-approaches) |\n| [Responsible NLP](https://github.com/ivan-bilan/The-NLP-Pandect#-11)  | [NLP and ML Interpretability](https://github.com/ivan-bilan/The-NLP-Pandect#nlp-and-ml-interpretability), [Ethics, Bias, and Equality in NLP](https://github.com/ivan-bilan/The-NLP-Pandect#ethics-bias-and-equality-in-nlp), [Adversarial Attacks for NLP](https://github.com/ivan-bilan/The-NLP-Pandect#adversarial-attacks-for-nlp) |\n| [NLP Frameworks](https://github.com/ivan-bilan/The-NLP-Pandect#-12)  | [General Purpose](https://github.com/ivan-bilan/The-NLP-Pandect#general-purpose), [Data Augmentation](https://github.com/ivan-bilan/The-NLP-Pandect#data-augmentation), [Machine Translation](https://github.com/ivan-bilan/The-NLP-Pandect#machine-translation), [Adversarial Attacks](https://github.com/ivan-bilan/The-NLP-Pandect#adversarial-nlp-attacks--behavioral-testing), [Dialog Systems \u0026 Speech](https://github.com/ivan-bilan/The-NLP-Pandect#dialog-systems-and-speech), [Entity and String Matching](https://github.com/ivan-bilan/The-NLP-Pandect#entity-and-string-matching), [Non-English Frameworks](https://github.com/ivan-bilan/The-NLP-Pandect#non-english-oriented), [Text Annotation](https://github.com/ivan-bilan/The-NLP-Pandect#text-data-labelling) |\n| [Learning NLP](https://github.com/ivan-bilan/The-NLP-Pandect#-13)  | [Courses](https://github.com/ivan-bilan/The-NLP-Pandect#courses), [Books](https://github.com/ivan-bilan/The-NLP-Pandect#books), [Tutorials](https://github.com/ivan-bilan/The-NLP-Pandect#tutorials)  |\n| [NLP Communities](https://github.com/ivan-bilan/The-NLP-Pandect#-14)  | -  |\n| [Other NLP Topics](https://github.com/ivan-bilan/The-NLP-Pandect#-15)  | [Tokenization](https://github.com/ivan-bilan/The-NLP-Pandect#tokenization), [Data Augmentation](https://github.com/ivan-bilan/The-NLP-Pandect#data-augmentation-and-weak-supervision), [Named Entity Recognition](https://github.com/ivan-bilan/The-NLP-Pandect#named-entity-recognition-ner), [Error Correction](https://github.com/ivan-bilan/The-NLP-Pandect#spell-correction--error-correction), [AutoML/AutoNLP](https://github.com/ivan-bilan/The-NLP-Pandect#automl--autonlp), [Text Generation](https://github.com/ivan-bilan/The-NLP-Pandect#text-generation) |\n\n\n![The-NLP-Resources](./Resources/Images/pandect_resources.png)\n-----\n\u003e __Note__\n\u003e Section keywords: paper summaries, compendium, awesome list\n\n#### Compendiums and awesome lists on the topic of NLP:\n* 🗂️ [The NLP Index](https://index.quantumstat.com) - Searchable Index of NLP Papers by Quantum Stat / NLP Cypher\n* ⭐ [Awesome NLP](https://github.com/keon/awesome-nlp) by [keon](https://github.com/keon) [GitHub, 13963 stars]\n* ⭐ [Speech and Natural Language Processing Awesome List](https://github.com/edobashira/speech-language-processing#readme) by [elaboshira](https://github.com/edobashira) [GitHub, 2121 stars]\n* ⭐ [Awesome Deep Learning for Natural Language Processing (NLP)](https://github.com/brianspiering/awesome-dl4nlp) [GitHub, 1094 stars]\n* ⭐ [Text Mining and Natural Language Processing Resources](https://github.com/stepthom/text_mining_resources) by [stepthom](https://github.com/stepthom) [GitHub, 505 stars]\n* 🗂️ [Made with ML List](https://madewithml.com/topics/#nlp) by [madewithml.com](https://madewithml.com)\n* 🗂️ [Brainsources for #NLP enthusiasts](https://www.notion.so/634eba1a37d34e2baec1bb574a8a5482) by [Philip Vollet](https://www.linkedin.com/in/philipvollet/)\n* ⭐ [Awesome AI/ML/DL - NLP Section](https://github.com/neomatrix369/awesome-ai-ml-dl/tree/master/natural-language-processing#natural-language-processing-nlp) [GitHub, 1142 stars]\n* 🗂️ [Resources on various machine learning topics](https://www.backprop.org) by Backprop\n* 🗂️ [NLP articles](https://devopedia.org/site-map/browse-articles/natural+language+processing) by [Devopedia](https://devopedia.org)\n\n#### NLP Conferences, Paper Summaries and Paper Compendiums:\n##### Papers and Paper Summaries\n* ⭐ [100 Must-Read NLP Papers](https://github.com/mhagiwara/100-nlp-papers) 100 Must-Read NLP Papers [GitHub, 3446 stars]\n* ⭐ [NLP Paper Summaries](https://github.com/dair-ai/nlp_paper_summaries) by [dair-ai](https://github.com/dair-ai) [GitHub, 1431 stars]\n* ⭐ [Curated collection of papers for the NLP practitioner](https://github.com/mihail911/nlp-library) [GitHub, 1059 stars]\n* ⭐ [Papers on Textual Adversarial Attack and Defense](https://github.com/thunlp/TAADpapers) [GitHub, 1182 stars]\n* ⭐ [Recent Deep Learning papers in NLU and RL](https://github.com/madrugado/deep-learning-nlp-rl-papers) by Valentin Malykh [GitHub, 291 stars]\n* ⭐ [A Survey of Surveys (NLP \u0026 ML): Collection of NLP Survey Papers](https://github.com/NiuTrans/ABigSurvey) [GitHub, 1713 stars]\n* ⭐ [A Paper List for Style Transfer in Text](https://github.com/fuzhenxin/Style-Transfer-in-Text) [GitHub, 1456 stars]\n* 🎥 [Video recordings index for papers](https://papertalk.org/index)\n\n##### Conference Summaries\n* ⭐ [NLP top 10 conferences Compendium](https://github.com/soulbliss/NLP-conference-compendium) by [soulbliss](https://github.com/soulbliss) [GitHub, 439 stars]\n* 📙 [ICLR 2020 Trends](https://gsarti.com/post/iclr2020-transformers/)\n* 📙 [SpacyIRL 2019 Conference in Overview](https://www.linkedin.com/pulse/spacyirl-2019-conference-overview-ivan-bilan/)\n* 📙 [Paper Digest](https://www.paperdigest.org/category/nlp/) - Conferences and Papers in Overview\n* 🎥 [Video Recordings from Conferences](https://crossminds.ai/explore/)\n\n#### NLP Progress and NLP Tasks:\n* ⭐ [NLP Progress](https://github.com/sebastianruder/NLP-progress) by [sebastianruder](https://github.com/sebastianruder) [GitHub, 21123 stars]\n* ⭐ [NLP Tasks](https://github.com/Kyubyong/nlp_tasks) by [Kyubyong](https://github.com/Kyubyong) [GitHub, 2984 stars]\n\n#### NLP Datasets:\n* ⭐ [NLP Datasets](https://github.com/niderhoff/nlp-datasets) by [niderhoff](https://github.com/niderhoff) [GitHub, 5225 stars]\n* ⭐ [Datasets](https://github.com/huggingface/datasets) by Huggingface [GitHub, 14838 stars]\n* 🗂️ [Big Bad NLP Database](https://datasets.quantumstat.com)\n* ⭐ [UWA Unambiguous Word Annotations](http://danlou.github.io/uwa/) - Word Sense Disambiguation Dataset\n* ⭐ [MLDoc](https://github.com/facebookresearch/MLDoc) - Corpus for Multilingual Document Classification in Eight Language [GitHub, 145 stars]\n\n#### Word and Sentence embeddings:\n* ⭐ [Awesome Embedding Models](https://github.com/Hironsan/awesome-embedding-models) by [Hironsan](https://github.com/Hironsan) [GitHub, 1544 stars]\n* ⭐ [Awesome list of Sentence Embeddings](https://github.com/Separius/awesome-sentence-embedding) by [Separius](https://github.com/Separius) [GitHub, 2086 stars]\n* ⭐ [Awesome BERT](https://github.com/Jiakui/awesome-bert) by [Jiakui](https://github.com/Jiakui) [GitHub, 1797 stars]\n\n#### Notebooks, Scripts and Repositories\n* ⭐ [The Super Duper NLP Repo](https://notebooks.quantumstat.com) [Website, 2020]\n\n#### Non-English resources and Compendiums\n* ⭐ [NLP Resources for Bahasa Indonesian](https://github.com/louisowen6/NLP_bahasa_resources) [GitHub, 329 stars]\n* ⭐ [Indic NLP Catalog](https://github.com/AI4Bharat/indicnlp_catalog) [GitHub, 381 stars]\n* ⭐ [Pre-trained language models for Vietnamese](https://github.com/VinAIResearch/PhoBERT) [GitHub, 491 stars]\n* ⭐ [Natural Language Toolkit for Indic Languages (iNLTK)](https://github.com/goru001/inltk) [GitHub, 773 stars]\n* ⭐ [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) [GitHub, 448 stars]\n* ⭐ [AI4Bharat-IndicNLP Portal](https://indicnlp.ai4bharat.org)\n* ⭐ [ARBML](https://github.com/ARBML/ARBML) - Implementation of many Arabic NLP and ML projects [GitHub, 284 stars]\n* ⭐ [zemberek-nlp](https://github.com/ahmetaa/zemberek-nlp) - NLP tools for Turkish [GitHub, 1021 stars]\n* ⭐ [TDD AI](https://tdd.ai) - An open-source platform for all Turkish datasets, language models, and NLP tools.\n* ⭐ [KLUE](https://github.com/KLUE-benchmark/KLUE) - Korean Language Understanding Evaluation [GitHub, 468 stars]\n* ⭐ [Persian NLP Benchmark](https://github.com/Mofid-AI/persian-nlp-benchmark) - benchmark for evaluation and comparison of various NLP tasks in Persian language [GitHub, 69 stars]\n* ⭐ [nlp-greek](https://github.com/Yuliya-HV/nlp-greek) -  Greek language sources [GitHub, 5 stars]\n* ⭐ [Awesome NLP Resources for Hungarian](https://github.com/oroszgy/awesome-hungarian-nlp) [GitHub, 160 stars]\n\n#### Pre-trained NLP models\n* ⭐ [List of pre-trained NLP models](https://github.com/balavenkatesh3322/NLP-pretrained-model) [GitHub, 163 stars]\n* 📙 [General Pretrained Language Models](https://mr-nlp.github.io/posts/2022/07/general-tptlms-list/) [Blog, July 2022]\n* ⭐ [Pretrained language models developed by Huawei Noah's Ark Lab](https://github.com/huawei-noah/Pretrained-Language-Model) [GitHub, 2547 stars]\n* ⭐ [Spanish Language Models and resources](https://github.com/PlanTL-GOB-ES/lm-spanish) [GitHub, 202 stars]\n* 🗂 [Monolingual Pretrained Language Models](https://mr-nlp.github.io/posts/2022/07/monolingual-tptlms-list/) - collection of available pre-trained models [Blog, July 2022]\n\n#### NLP History\n##### General\n* ⭐ [Modern Deep Learning Techniques Applied to Natural Language Processing](https://github.com/omarsar/nlp_overview) [GitHub, 1269 stars]\n* 📙 [A Review of the Neural History of Natural Language Processing](https://aylien.com/blog/a-review-of-the-recent-history-of-natural-language-processing) [Blog, October 2018]\n##### 2020 Year in Review\n* 📙 [Natural Language Processing in 2020: The Year In Review](https://www.linkedin.com/pulse/natural-language-processing-2020-year-review-ivan-bilan/) [Blog, December 2020]\n* 📙 [ML and NLP Research Highlights of 2020](https://ruder.io/research-highlights-2020/) [Blog, January 2021]\n\n\n![The-NLP-Podcasts](./Resources/Images/pandect_lyra.png)\n-----\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n#### NLP-only podcasts\n* 🎙️ [NLP Highlights](https://soundcloud.com/nlp-highlights) [Years: 2017 - now, Status: active]\n* 🎙️ [The NLP Zone](https://de.player.fm/series/the-nlp-zone) [Episodes](https://player.captivate.fm/episode/e2f87641-1421-4729-a2b5-d64951c845c6) [Years: 2021 - now, Status: active]\n\n#### Many NLP episodes\n* 🎙️ [TWIML AI](https://twimlai.com) [Years: 2016 - now, Status: active]\n* 🎙️ [Practical AI](https://changelog.com/practicalai) [Years: 2018 - now, Status: active]\n* 🎙️ [The Data Exchange](https://thedataexchange.media) [Years: 2019 - now, Status: active]\n* 🎙️ [Gradient Dissent](https://www.wandb.com/podcast) [Years: 2020 - now, Status: active]\n* 🎙️ [Machine Learning Street Talk](https://open.spotify.com/show/02e6PZeIOdpmBGT9THuzwR) [Years: 2020 - now, Status: active]\n* 🎙️ [DataFramed](https://www.datacamp.com/community/podcast) -  latest trends and insights on how to scale the impact of data science in organizations [Years: 2019 - now, Status: active]\n\n#### Some NLP episodes\n* 🎙️ [The Super Data Science Podcast](https://www.superdatascience.com/podcast) [Years: 2016 - now, Status: active]\n* 🎙️ [Data Hack Radio](https://soundcloud.com/datahack-radio) [Years: 2018 - now, Status: active]\n* 🎙️ [AI Game Changers](https://podcasts.apple.com/de/podcast/ai-game-changers/id1512574291) [Years: 2020 - now, Status: active]\n* 🎙️ [The Analytics Show](https://anchor.fm/analyticsshow) [Years: 2019 - now, Status: active]\n\n![The-NLP-Newsletter](./Resources/Images/pandect_scroll.png)\n-----\n\n* 📙 [NLP News](https://ruder.io/nlp-news/) by [Sebastian Ruder](https://ruder.io)\n* 📙 [dair.ai Newsletter](https://dair.ai/newsletter/) by [dair.ai](dair.ai)\n* 📙 [This Week in NLP by Robert Dale](https://www.language-technology.com/twin)\n* 📙 [Papers with Code](https://paperswithcode.com)\n* 📙 [The Batch](https://www.deeplearning.ai/thebatch/) by [deeplearning.ai](https://www.deeplearning.ai/thebatch/)\n* 📙 [Paper Digest](https://www.paperdigest.org/2020/04/recent-papers-on-question-answering/) by [PaperDigest](https://www.paperdigest.org/daily-paper-digest/)\n* 📙 [NLP Cypher](https://medium.com/@quantumstat) by [QuantumStat](https://quantumstat.com)\n\n![The-NLP-Meetups](./Resources/Images/pandect_meetups.png)\n-----\n\n* 🎥 [NLP Zurich](https://www.linkedin.com/company/nlp-zurich/) [[YouTube Recordings](https://www.youtube.com/channel/UCLLX-5j9UNYassOwS0nveDQ)]\n* 🎥 [Hacking-Machine-Learning](https://www.meetup.com/Hacking-Machine-Learning) [[YouTube Recordings](https://www.youtube.com/channel/UCt5RvrC-_3X7FNAWhORVn7Q)]\n* 🎥 [NY-NLP (New York)](https://www.meetup.com/NY-NLP/)\n\n![The-NLP-Youtube](./Resources/Images/pandect_youtube.png)\n-----\n\n* 🎥 [Yannic Kilcher](https://www.youtube.com/channel/UCZHmQk67mSJgfCCTn7xBfew)\n* 🎥 [HuggingFace](https://www.youtube.com/channel/UCHlNU7kIZhRgSbhHvFoy72w)\n* 🎥 [Kaggle Reading Group](https://www.youtube.com/watch?v=PhTF7yJNR70\u0026list=PLqFaTIg4myu8t5ycqvp7I07jTjol3RCl9)\n* 🎥 [Rasa Paper Reading](https://www.youtube.com/channel/UCJ0V6493mLvqdiVwOKWBODQ/playlists)\n* 🎥 [Stanford CS224N: NLP with Deep Learning](https://www.youtube.com/watch?v=8rXD5-xhemo\u0026list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z)\n* 🎥 [NLPxing](https://www.youtube.com/channel/UCuGC1JusVvbOGa__qMtH3QA/videos)\n* 🎥 [ML Explained - A.I. Socratic Circles - AISC](https://www.youtube.com/channel/UCfk3pS8cCPxOgoleriIufyg)\n* 🎥 [Deeplearning.ai](https://www.youtube.com/channel/UCcIXc5mJsHVYTZR1maL5l9w/featured)\n* 🎥 [Machine Learning Street Talk](https://www.youtube.com/channel/UCMLtBahI5DMrt0NPvDSoIRQ/featured)\n\n![The-NLP-Benchmarks](./Resources/Images/pandect_benchmark.png)\n-----\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n### General NLU\n* ⭐ [GLUE](https://gluebenchmark.com) - General Language Understanding Evaluation (GLUE) benchmark\n* ⭐ [SuperGLUE](https://super.gluebenchmark.com) - benchmark styled after GLUE with a new set of more difficult language understanding tasks\n* ⭐ [decaNLP](https://decanlp.com) - The Natural Language Decathlon (decaNLP) for studying general NLP models\n* ⭐ [dialoglue](https://github.com/alexa/dialoglue) - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue  [GitHub, 235 stars]\n* ⭐ [DynaBench](https://dynabench.org/) - Dynabench is a research platform for dynamic data collection and benchmarking\n* ⭐ [Big-Bench](https://github.com/google/BIG-bench) -  collaborative benchmark for measuring and extrapolating the capabilities of language models [GitHub, 1228 stars]\n\n### Summarization\n* ⭐ [WikiAsp](https://github.com/neulab/wikiasp) - WikiAsp: Multi-document aspect-based summarization Dataset\n* ⭐ [WikiLingua](https://github.com/esdurmus/Wikilingua) - A Multilingual Abstractive Summarization Dataset\n\n### Question Answering\n* ⭐ [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) - Stanford Question Answering Dataset (SQuAD)\n* ⭐ [XQuad](https://github.com/deepmind/xquad) - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering\n* ⭐ [GrailQA](https://dki-lab.github.io/GrailQA/) - Strongly Generalizable Question Answering (GrailQA)\n* ⭐ [CSQA](https://amritasaha1812.github.io/CSQA/) - Complex Sequential Question Answering\n\n### Multilingual and Non-English Benchmarks\n* 📙 [XTREME](https://arxiv.org/abs/2003.11080) -  Massively Multilingual Multi-task Benchmark\n* ⭐ [GLUECoS](https://github.com/microsoft/GLUECoS) - A benchmark for code-switched NLP\n* ⭐ [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) - Natural Language Understanding Benchmark for Indic Languages\n* ⭐ [LinCE](https://ritual.uh.edu/lince/) - Linguistic Code-Switching Evaluation Benchmark\n* ⭐ [Russian SuperGlue](https://russiansuperglue.com) - Russian SuperGlue Benchmark\n\n### Bio, Law, and other scientific domains\n* ⭐ [BLURB](https://microsoft.github.io/BLURB/) - Biomedical Language Understanding and Reasoning Benchmark\n* ⭐ [BLUE](https://github.com/ncbi-nlp/BLUE_Benchmark) - Biomedical Language Understanding Evaluation benchmark\n* ⭐ [LexGLUE](https://github.com/coastalcph/lex-glue) - A Benchmark Dataset for Legal Language Understanding in English\n\n### Transformer Efficiency\n* ⭐ [Long-Range Arena](https://github.com/google-research/long-range-arena) - Long Range Arena for Benchmarking Efficient Transformers ([Pre-print](https://arxiv.org/abs/2011.04006)) [GitHub, 481 stars]\n\n### Speech Processing\n* ⭐ [SUPERB](http://superbbenchmark.org/) - Speech processing Universal PERformance Benchmark\n\n### Other\n* ⭐ [CodeXGLUE](https://www.microsoft.com/en-us/research/blog/codexglue-a-benchmark-dataset-and-open-challenge-for-code-intelligence/) - A benchmark dataset for code intelligence\n* ⭐ [CrossNER](https://github.com/zliucr/CrossNER) - CrossNER: Evaluating Cross-Domain Named Entity Recognition\n* ⭐ [MultiNLI](cims.nyu.edu/~sbowman/multinli/) - Multi-Genre Natural Language Inference corpus\n* ⭐ [iSarcasm: A Dataset of Intended Sarcasm](https://github.com/silviu-oprea/iSarcasm) - iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic\n\n![The-NLP-Research](./Resources/Images/pandect_quill.png)\n-----\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n### General\n* 📙 [A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/) by Andrej Karpathy [Keywords: research, training, 2019]\n* 📙 [Recent Advances in NLP via Large Pre-Trained Language Models: A Survey](https://arxiv.org/abs/2111.01243) [Paper, November 2021]\n\n### Embeddings\n#### Repositories\n* ⭐ [Pre-trained ELMo Representations for Many Languages](https://github.com/HIT-SCIR/ELMoForManyLangs) [GitHub, 1413 stars]\n* ⭐ [sense2vec](https://github.com/explosion/sense2vec) - Contextually-keyed word vectors [GitHub, 1449 stars]\n* ⭐ [wikipedia2vec](https://github.com/wikipedia2vec/wikipedia2vec) [GitHub, 831 stars]\n* ⭐ [StarSpace](https://github.com/facebookresearch/StarSpace) [GitHub, 3809 stars]\n* ⭐ [fastText](https://github.com/facebookresearch/fastText) [GitHub, 24067 stars]\n\n#### Blogs\n* 📙 [Language Models and Contextualised Word Embeddings](http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/) by David S. Batista [Blog, 2018]\n* 📙 [An Essential Guide to Pretrained Word Embeddings for NLP Practitioners](https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-nlp/?utm_source=AVLinkedin\u0026utm_medium=post\u0026utm_campaign=22_may_new_article) by AnalyticsVidhya [Blog, 2020]\n* 📙 [Polyglot Word Embeddings Discover Language Clusters](http://blog.shriphani.com/2020/02/03/polyglot-word-embeddings-discover-language-clusters/) [Blog, 2020]\n* 📙 [The Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/) by Jay Alammar [Blog, 2019]\n\n#### Cross-lingual Word and Sentence Embeddings\n* ⭐ [vecmap](https://github.com/artetxem/vecmap) - VecMap (cross-lingual word embedding mappings) [GitHub, 604 stars]\n* ⭐ [sentence-transformers](https://github.com/UKPLab/sentence-transformers) - Multilingual Sentence \u0026 Image Embeddings with BERT [GitHub, 8944 stars]\n\n#### Byte Pair Encoding\n* ⭐ [bpemb](https://github.com/bheinzerling/bpemb) - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1081 stars]\n* ⭐ [subword-nmt](https://github.com/rsennrich/subword-nmt) - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1972 stars]\n* ⭐ [python-bpe](https://github.com/soaxelbrooke/python-bpe) - Byte Pair Encoding for Python [GitHub, 188 stars]\n\n### Transformer-based Architectures\n#### General\n* 📙 [The Transformer Family](https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html) by Lilian Weng [Blog, 2020]\n* 📙 [Playing the lottery with rewards and multiple languages](https://arxiv.org/abs/1906.02768) - about the effect of random initialization [ICLR 2020 Paper]\n* 📙 [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html) by Lilian Weng [Blog, 2018]\n* 📙 [the transformer … “explained”?](https://nostalgebraist.tumblr.com/post/185326092369/the-transformer-explained) [Blog, 2019]\n* 🎥️ [Attention is all you need; Attentional Neural Network Models](https://www.youtube.com/watch?v=rBCqOTEfxvg) by Łukasz Kaiser [Talk, 2017]\n* 🎥️ [Understanding and Applying Self-Attention for NLP](https://www.youtube.com/watch?v=OYygPG4d9H0) [Talk, 2018]\n* 📙 [The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures](https://arxiv.org/abs/2104.10640) [Paper, April 2021]\n* 📙 [Pre-Trained Models: Past, Present and Future](https://arxiv.org/abs/2106.07139) [Paper, June 2021]\n* 📙 [A Survey of Transformers](https://arxiv.org/abs/2106.04554) [Paper, June 2021]\n\n#### Transformer\n* 📙 [The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html) by Harvard NLP [Blog, 2018]\n* 📙 [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) by Jay Alammar [Blog, 2018]\n* 📙 [Illustrated Guide to Transformers](https://towardsdatascience.com/illustrated-guide-to-transformer-cf6969ffa067) by Hong Jing [Blog, 2020]\n* 📙 [Sequential Transformer with Adaptive Attention Span](https://github.com/facebookresearch/adaptive-span) by Facebook. [Blog](https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/) [Blog, 2019]\n* 📙 [Evolution of Representations in the Transformer](https://lena-voita.github.io/posts/emnlp19_evolution.html) by Lena Voita [Blog, 2019]\n* 📙 [Reformer: The Efficient Transformer](https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html) [Blog, 2020]\n* 📙 [Longformer — The Long-Document Transformer](https://medium.com/dair-ai/longformer-what-bert-should-have-been-78f4cd595be9) by Viktor Karlsson [Blog, 2020]\n* 📙 [TRANSFORMERS FROM SCRATCH](http://www.peterbloem.nl/blog/transformers) [Blog, 2019]\n* 📙 [Universal Transformers](https://mostafadehghani.com/2019/05/05/universal-transformers/) by Mostafa Dehghani [Blog, 2019]\n* 📙 [Transformers in Natural Language Processing — A Brief Survey](https://eigenfoo.xyz/transformers-in-nlp/) by George Ho [Blog, May 2020]\n* ⭐ [Lite Transformer](https://github.com/mit-han-lab/lite-transformer) - Lite Transformer with Long-Short Range Attention [GitHub, 550 stars]\n* 📙 [Transformers from Scratch](https://e2eml.school/transformers.html) [Blog, Oct 2021]\n\n#### BERT\n* 📙 [A Visual Guide to Using BERT for the First Time](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) by Jay Alammar [Blog, 2019]\n* 📙 [The Dark Secrets of BERT](https://text-machine-lab.github.io/blog/2020/bert-secrets/) by Anna Rogers [Blog, 2020]\n* 📙 [Understanding searches better than ever before](https://www.blog.google/products/search/search-language-understanding-bert/) [Blog, 2019]\n* 📙 [Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework](https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/) [Blog, 2019]\n* ⭐ [SemBERT](https://github.com/cooelf/SemBERT) - Semantics-aware BERT for Language Understanding [GitHub, 278 stars]\n* ⭐ [BERTweet](https://github.com/VinAIResearch/BERTweet) - BERTweet: A pre-trained language model for English Tweets [GitHub, 487 stars]\n* ⭐ [Optimal Subarchitecture Extraction for BERT](https://github.com/alexa/bort) [GitHub, 461 stars]\n* ⭐ [CharacterBERT: Reconciling ELMo and BERT](https://github.com/helboukkouri/character-bert) [GitHub, 163 stars]\n* 📙 [When BERT Plays The Lottery, All Tickets Are Winning](https://thegradient.pub/when-bert-plays-the-lottery-all-tickets-are-winning/) [Blog, Dec 2020]\n* ⭐ [BERT-related Papers](https://github.com/tomohideshibata/BERT-related-papers) a list of BERT-related papers [GitHub, 1933 stars]\n\n#### Other Transformer Variants\n##### T5\n* 📙 [T5 Understanding Transformer-Based Self-Supervised Architectures](https://medium.com/@rojagtap/t5-text-to-text-transfer-transformer-643f89e8905e) [Blog, August 2020]\n* 📙 [T5: the Text-To-Text Transfer Transformer](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) [Blog, 2020]\n* ⭐ [multilingual-t5](https://github.com/google-research/multilingual-t5) - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 956 stars]\n##### BigBird\n* 📙 [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) original paper by Google Research [Paper, July 2020]\n##### Reformer / Linformer / Longformer / Performers\n* 🎥️ [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) - [Paper, February 2020] [[Video](https://www.youtube.com/watch?v=xJrKIPwVwGM), October 2020]\n* 🎥️ [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) - [Paper, April 2020] [[Video](https://www.youtube.com/watch?v=_8KNb5iqblE), April 2020]\n* 🎥️ [Linformer: Self-Attention with Linear Complexity](https://arxiv.org/abs/2006.04768) - [Paper, June 2020] [[Video](https://www.youtube.com/watch?v=-_2AF9Lhweo), June 2020]\n* 🎥️ [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794) - [Paper, September 2020] [[Video](https://www.youtube.com/watch?v=0eTULzrOztQ), September 2020]\n* ⭐ [performer-pytorch](https://github.com/lucidrains/performer-pytorch) - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 898 stars]\n\n##### Switch Transformer\n* 📙 [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961) original paper by Google Research [Paper, January 2021]\n\n#### GPT-family\n##### General\n* 📙 [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/) by Jay Alammar [Blog, 2019]\n* 📙 [The Annotated GPT-2](https://amaarora.github.io/2020/02/18/annotatedGPT2.html) by Aman Arora\n* 📙 [OpenAI’s GPT-2: the model, the hype, and the controversy](https://towardsdatascience.com/openais-gpt-2-the-model-the-hype-and-the-controversy-1109f4bfd5e8) by Ryan Lowe [Blog, 2019]\n* 📙 [How to generate text](https://huggingface.co/blog/how-to-generate) by Patrick von Platen [Blog, 2020]\n\n##### GPT-3\n###### Learning Resources\n* 📙 [Zero Shot Learning for Text Classification](https://amitness.com/2020/05/zero-shot-text-classification/) by Amit Chaudhary [Blog, 2020]\n* 📙 [GPT-3 A Brief Summary](https://leogao.dev/2020/05/29/GPT-3-A-Brief-Summary/) by Leo Gao [Blog, 2020]\n* 📙 [GPT-3, a Giant Step for Deep Learning And NLP](https://anotherdatum.com/gpt-3.html) by Yoel Zeldes [Blog, June 2020]\n* 📙 [GPT-3 Language Model: A Technical Overview](https://lambdalabs.com/blog/demystifying-gpt-3/) by Chuan Li [Blog, June 2020]\n* 📙 [Is it possible for language models to achieve language understanding?](https://medium.com/@ChrisGPotts/is-it-possible-for-language-models-to-achieve-language-understanding-81df45082ee2) by Christopher Potts\n###### Applications\n* ⭐ [Awesome GPT-3](https://github.com/elyase/awesome-gpt3) - list of all resources related to GPT-3 [GitHub, 3773 stars]\n* 🗂️ [GPT-3 Projects](https://airtable.com/shrndwzEx01al2jHM/tblYMAiGeDLXe35jC) - a map of all GPT-3 start-ups and commercial projects\n* 🗂️ [GPT-3 Demo Showcase](https://gpt3demo.com/) - GPT-3 Demo Showcase, 180+ Apps, Examples, \u0026 Resources\n* 🔱 [OpenAI API](https://beta.openai.com) - API Demo to use GPT-3 for commercial applications\n###### Open-source Efforts\n* 📙 [GPT-Neo](https://eleuther.ai/projects/gpt-neo/) - in-progress GPT-3 open source replication [HuggingFace Hub](https://huggingface.co/EleutherAI)\n* ⭐ [GPT-J](https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b) - A 6 billion parameter, autoregressive text generation model trained on The Pile\n* 📙 [Effectively using GPT-J with few-shot learning](https://nlpcloud.io/effectively-using-gpt-j-gpt-neo-gpt-3-alternatives-few-shot-learning.html) [Blog, July 2021]\n\n#### Other\n* 📙 [What is Two-Stream Self-Attention in XLNet](https://towardsdatascience.com/what-is-two-stream-self-attention-in-xlnet-ebfe013a0cf3) by Xu LIANG [Blog, 2019]\n* 📙 [Visual Paper Summary: ALBERT (A Lite BERT)](https://amitness.com/2020/02/albert-visual-summary/) by Amit Chaudhary [Blog, 2020]\n* 📙 [Turing NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/) by Microsoft\n* 📙 [Multi-Label Text Classification with XLNet](https://towardsdatascience.com/multi-label-text-classification-with-xlnet-b5f5755302df) by Josh Xin Jie Lee [Blog, 2019]\n* ⭐ [ELECTRA](https://github.com/google-research/electra) [GitHub, 2095 stars]\n* ⭐ [Performer](https://github.com/lucidrains/performer-pytorch) implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 898 stars]\n\n#### Distillation, Pruning and Quantization\n##### Reading Material\n* 📙 [Distilling knowledge from Neural Networks to build smaller and faster models](https://blog.floydhub.com/knowledge-distillation/) by FloydHub [Blog, 2019]\n* 📙 [Compression of Deep Learning Models for Text: A Survey](https://arxiv.org/abs/2008.05221) [Paper, April 2021]\n##### Tools\n* ⭐ [Bert-squeeze](https://github.com/JulesBelveze/bert-squeeze) - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 65 stars]\n* ⭐ [XtremeDistil ](https://github.com/microsoft/xtreme-distil-transformers) - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 122 stars]\n\n### Automated Summarization\n* 📙 [PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization](https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html) by Google AI [Blog, June 2020]\n* ⭐ [CTRLsum](https://github.com/salesforce/ctrl-sum) - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 128 stars]\n* ⭐ [XL-Sum](https://github.com/csebuetnlp/xl-sum) - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 186 stars]\n* ⭐ [SummerTime](https://github.com/Yale-LILY/SummerTime) - an open-source text summarization toolkit for non-experts [GitHub, 211 stars]\n* ⭐ [PRIMER](https://github.com/allenai/PRIMER) - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 107 stars]\n* ⭐ [summarus](https://github.com/IlyaGusev/summarus) - Models for automatic abstractive summarization [GitHub, 145 stars]\n\n### Knowledge Graphs and NLP\n* 📙 [Fusing Knowledge into Language Model](https://drive.google.com/file/d/1Zgijg9RPxF-tIGWU9nt9rBcryOIB4lOk/view) [Presentation, Oct 2021]\n\n\n![The-NLP-Industry](./Resources/Images/pandect_industry.png)\n-----\n\u003e __Note__\n\u003e Section keywords: best practices, MLOps\n \n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n### Best Practices for building NLP Projects\n* 🎥 [In Search of Best Practices for NLP Projects](https://www.youtube.com/watch?v=0S9iai4Ld4I) [[Slides](https://www.dropbox.com/s/4fymdzz4yh3mlyz/NLP_Best_Practices_Bilan.pdf?dl=0), Dec. 2020]\n* 🎥 [EMNLP 2020: High Performance Natural Language Processing](https://slideslive.com/38940826) by Google Research, [Recording](https://slideslive.com/38940826), Nov. 2020]\n* 📙 [Practical Natural Language Processing](https://www.amazon.com/Practical-Natural-Language-Processing-Pragmatic/dp/1492054054) - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020]\n* 📙 [How to Structure and Manage NLP Projects](https://neptune.ai/blog/how-to-structure-and-manage-nlp-projects-templates) [Blog, May 2021]\n* 📙 [Applied NLP Thinking](https://explosion.ai/blog/applied-nlp-thinking) - Applied NLP Thinking: How to Translate Problems into Solutions [Blog, June 2021]\n* 🎥 [Introduction to NLP for Industry Use](https://www.youtube.com/watch?v=VRur3xey31s) - DataTalksClub presentation on Introduction to NLP for Industry Use [Recording, December 2021]\n* 📙 [Measuring Embedding Drift](https://arize.com/blog/embedding-drift/) - Best practices for monitoring drift of NLP models [Blog, December 2022]\n\n### MLOps for NLP\nMLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines.\n\nIn general, MLOps for NLP includes having the following processes in place:\n- **Data Versioning** - make sure your training, annotation and other types of data are versioned and tracked\n- **Experiment Tracking** - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced\n- **Model Registry** - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them\n- **Automated Testing and Behavioral Testing** - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks\n- **Model Deployment and Serving** - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc.\n- **Data and Model Observability** - track data drift, model accuracy drift etc.\n\nAdditionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI:\n- **Feature Store** - centralized storage of all features developed for ML models than can be easily reused by any other ML project\n- **Metadata Management** - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc.\n\n#### MLOps Compilations \u0026 Awesome Lists\n* ⭐ [awesome-mlops](https://github.com/visenger/awesome-mlops) [GitHub, 8929 stars]\n* ⭐ [best-of-ml-python](https://github.com/ml-tooling/best-of-ml-python) [GitHub, 12011 stars]\n* 🗂️ [MLOps.Toys](https://mlops.toys) - a curated list of MLOps projects\n\n#### Reading Material\n* 📙 [Machine Learning Operations (MLOps): Overview, Definition, and Architecture](https://arxiv.org/abs/2205.02302) [Paper, May 2022]\n* 📙 [Requirements and Reference Architecture for MLOps:Insights from Industry](https://www.techrxiv.org/articles/preprint/Requirements_and_Reference_Architecture_for_MLOps_Insights_from_Industry/21397413) [Paper, Oct 2022]\n* 📙 [MLOps: What It Is, Why it Matters, and How To Implement It](https://neptune.ai/blog/mlops-what-it-is-why-it-matters-and-how-to-implement-it-from-a-data-scientist-perspective) by Neptune AI [Blog, July 2021]\n* 📙 [Best MLOps Tools You Need to Know as a Data Scientist](https://neptune.ai/blog/best-mlops-tools)  by Neptune AI [Blog, July 2021]\n* 📙 [Robust MLOps](https://blog.verta.ai/blog/robust-mlops-with-open-source-modeldb-docker-jenkins-and-prometheus) - Robust MLOps with Open-Source: ModelDB, Docker, Jenkins and Prometheus [Blog, May 2021]\n* 📙 [State of MLOps 2021](https://valohai.com/state-of-mlops/#introduction) by Valohai [Blog, August 2021]\n* 📙 [The MLOps Stack](https://valohai.com/blog/the-mlops-stack/) by Valohai [Blog, October 2020]\n* 📙 [Data Version Control for Machine Learning Applications](https://megagon.ai/blog/data-version-control-for-machine-learning-applications/) by Megagon AI [Blog, July 2021]\n* 📙 [The Rapid Evolution of the Canonical Stack for Machine Learning](https://medium.com/@ODSC/the-rapid-evolution-of-the-canonical-stack-for-machine-learning-21b37af9c3b5) [Blog, July 2021]\n* 📙 [MLOps: Comprehensive Beginner’s Guide](https://medium.com/sciforce/mlops-comprehensive-beginners-guide-c235c77f407f) [Blog, March 2021]\n* 📙 [What I’ve learned about MLOps from speaking with 100+ ML practitioners](https://veselinastaneva.medium.com/what-ive-learned-about-mlops-from-speaking-with-100-ml-practitioners-3025e33458ad) [Blog, May 2021]\n* 📙 [DataRobot Challenger Models](https://www.datarobot.com/blog/introducing-mlops-champion-challenger-models) - MLOps Champion/Challenger Models\n* 📙 [State of MLOps Blog](https://www.stateofmlops.com/) by Dr. Ori Cohen\n* 📙 [MLOps Ecosystem Overview](https://arize.com/wp-content/uploads/2021/04/Arize-AI-Ecosystem-White-Paper.pdf) [Blog, 2021]\n\n#### Learning Material\n* 🗂 [MLOps cource](https://madewithml.com/#mlops) by Made With ML\n* 🗂 [GitHub MLOps](https://mlops.githubapp.com) - collection of resources on how to facilitate Machine Learning Ops with GitHub\n* 🗂 [ML Observability Fundamentals Course](https://arize.com/ml-observability-fundamentals/) Learn how to monitor and root-cause issues with production NLP models\n\n#### MLOps Communities\n* [The MLOps Community](https://mlops.community/) - blogs, slack group, newsletter and more all about MLOps\n\n#### Data Versioning\n* ⭐ [DVC](https://dvc.org/) - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] [Link to GitHub](https://github.com/iterative/dvc)\n* 🔱 [Weights \u0026 Biases](https://wandb.ai/site) - tools for experiment tracking and dataset versioning [Paid Service]\n* 🔱 [Pachyderm](https://www.pachyderm.com/) - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]\n\n#### Experiment Tracking\n* ⭐ [mlflow](https://mlflow.org/) - open source platform for the machine learning lifecycle [Free and Open Source] [Link to GitHub](https://github.com/mlflow/mlflow/)\n* 🔱 [Weights \u0026 Biases](https://wandb.ai/site) - tools for experiment tracking and dataset versioning [Paid Service]\n* 🔱 [Neptune AI](https://neptune.ai/) - experiment tracking and model registry built for research and production teams [Paid Service]\n* 🔱 [Comet ML](https://www.comet.ml/site/) - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]\n* 🔱 [SigOpt](https://sigopt.com/) - automate training \u0026 tuning, visualize \u0026 compare runs [Paid Service]\n* ⭐ [Optuna](https://github.com/optuna/optuna) - hyperparameter optimization framework [GitHub, 7255 stars]\n* ⭐ [Clear ML](https://clear.ml/) - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] [Link to GitHub](https://github.com/allegroai/clearml/)\n* ⭐ [Metaflow](https://github.com/Netflix/metaflow) - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 6187 stars]\n\n##### Model Registry\n* ⭐ [DVC](https://dvc.org/) - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] [Link to GitHub](https://github.com/iterative/dvc)\n* ⭐ [mlflow](https://mlflow.org/) - open source platform for the machine learning lifecycle [Free and Open Source] [Link to GitHub](https://github.com/mlflow/mlflow/)\n* ⭐ [ModelDB](https://github.com/VertaAI/modeldb) - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1530 stars]\n* 🔱 [Neptune AI](https://neptune.ai/) - experiment tracking and model registry built for research and production teams [Paid Service]\n* 🔱 [Valohai](https://valohai.com/) - End-to-end ML pipelines [Paid Service]\n* 🔱 [Pachyderm](https://www.pachyderm.com/) - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]\n* 🔱 [polyaxon](https://polyaxon.com/) - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]\n* 🔱 [Comet ML](https://www.comet.ml/site/) - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]\n\n#### Automated Testing and Behavioral Testing\n* ⭐ [CheckList](https://github.com/marcotcr/checklist) - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1806 stars]\n* ⭐ [TextAttack](https://github.com/QData/TextAttack) - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2161 stars]\n* ⭐ [WildNLP](https://github.com/MI2DataLab/WildNLP) - Corrupt an input text to test NLP models' robustness [GitHub, 74 stars]\n* ⭐ [Great Expectations](https://github.com/great-expectations/great_expectations) - Write tests for your data [GitHub, 7703 stars]\n* ⭐ [Deepchecks](https://github.com/deepchecks/deepchecks) - Python package for comprehensively validating your machine learning models and data [GitHub, 2254 stars]\n\n#### Model Deployability and Serving\n* ⭐ [mlflow](https://mlflow.org/) - open source platform for the machine learning lifecycle [Free and Open Source] [Link to GitHub](https://github.com/mlflow/mlflow/)\n* 🔱 [Amazon SageMaker](https://aws.amazon.com/de/sagemaker/) [Paid Service]\n* 🔱 [Valohai](https://valohai.com/) - End-to-end ML pipelines [Paid Service]\n* 🔱 [NLP Cloud](https://nlpcloud.io/) - Production-ready NLP API [Paid Service]\n* 🔱 [Saturn Cloud](https://saturncloud.io/) [Paid Service]\n* 🔱 [SELDON](https://www.seldon.io/tech/) - machine learning deployment for enterprise [Paid Service]\n* 🔱 [Comet ML](https://www.comet.ml/site/) - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service]\n* 🔱 [polyaxon](https://polyaxon.com/) - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service]\n* ⭐ [TorchServe](https://github.com/pytorch/serve) - flexible and easy to use tool for serving PyTorch models [GitHub, 3008 stars]\n* 🔱 [Kubeflow](https://www.kubeflow.org/) - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars]\n* ⭐ [KFServing](https://github.com/kubeflow/kfserving) - Serverless Inferencing on Kubernetes [GitHub, 1841 stars]\n* 🔱 [TFX](https://www.tensorflow.org/tfx) - TensorFlow Extended - end-to-end platform for deploying production ML pipelines [Paid Service]\n* 🔱 [Pachyderm](https://www.pachyderm.com/) - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier]\n* 🔱 [Cortex](https://www.cortex.dev/) - containers as a service on AWS [Paid Service]\n* 🔱 [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/#features) - end-to-end machine learning lifecycle [Paid Service]\n* ⭐ [End2End Serverless Transformers On AWS Lambda](https://github.com/bhavsarpratik/serverless-transformers-on-aws-lambda) [GitHub, 110 stars]\n* ⭐ [NLP-Service](https://github.com/karndeb/NLP-Service) - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars]\n* 🔱 [Dagster](https://dagster.io/) - data orchestrator for machine learning [Free and Open Source]\n* 🔱 [Verta](https://www.verta.ai/) - AI and machine learning deployment and operations [Paid Service]\n* ⭐ [Metaflow](https://github.com/Netflix/metaflow) - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 6187 stars]\n* ⭐ [flyte](https://github.com/flyteorg/flyte) - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 2887 stars]\n* ⭐ [MLRun](https://github.com/mlrun/mlrun) - Machine Learning automation and tracking [GitHub, 856 stars]\n* 🔱 [DataRobot MLOps](https://www.datarobot.com/platform/mlops/) - DataRobot MLOps provides a center of excellence for your production AI\n\n#### Model Debugging\n* ⭐ [imodels](https://github.com/csinva/imodels) - package for concise, transparent, and accurate predictive modeling [GitHub, 971 stars]\n* ⭐ [Cockpit](https://github.com/f-dangel/cockpit) - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 416 stars]\n\n#### Model Accuracy Prediction\n* ⭐ [WeightWatcher](https://github.com/CalculatedContent/WeightWatcher) - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 1028 stars]\n\n#### Data and Model Observability\n\n##### General\n* ⭐ [Arize AI](https://arize.com/) - embedding drift monitoring for NLP models\n* ⭐ [Arize-Phoenix](https://phoenix.arize.com/) - ML observability for LLMs, vision, language, and tabular models\n* ⭐ [whylogs](https://github.com/whylabs/whylogs) - open source standard for data and ML logging [GitHub, 1907 stars]\n* ⭐ [Rubrix](https://github.com/recognai/rubrix) - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 1450 stars]\n* ⭐ [MLRun](https://github.com/mlrun/mlrun) - Machine Learning automation and tracking [GitHub, 856 stars]\n* 🔱 [DataRobot MLOps](https://www.datarobot.com/platform/mlops/) - DataRobot MLOps provides a center of excellence for your production AI\n* 🔱 [Cortex](https://www.cortex.dev/) - containers as a service on AWS [Paid Service]\n\n##### Model Centric\n* 🔱 [Algorithmia](https://algorithmia.com/) - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service]\n* 🔱 [Dataiku](https://www.dataiku.com/) - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service]\n* ⭐ [Evidently AI](https://evidentlyai.com/) - tools to analyze and monitor machine learning models [Free and Open Source] [Link to GitHub](https://github.com/evidentlyai/evidently)\n* 🔱 [Fiddler](https://www.fiddler.ai/) - ML Model Performance Management Tool [Paid Service]\n* 🔱 [Hydrosphere](https://hydrosphere.io/) - open-source platform for managing ML models [Paid Service]\n* 🔱 [Verta](https://www.verta.ai/) - AI and machine learning deployment and operations [Paid Service]\n* 🔱 [Domino Model Ops](https://www.dominodatalab.com/product/model-ops/) - Deploy and Manage Models to Drive Business Impact [Paid Service]\n* 🔱 [iguazio](https://www.iguazio.com/) - deployment and management of your AI applications with MLOps and end-to-end automation of machine learning pipelines [Paid Service]\n\n##### Data Centric\n* 🔱 [Datafold](https://www.datafold.com/) - data quality through diffs, profiling, and anomaly detection [Paid Service]\n* 🔱 [acceldata](https://www.acceldata.io/) - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service]\n* 🔱 [Bigeye](https://www.bigeye.com/) - monitoring and alerting to your datasets in minutes [Paid Service]\n* 🔱 [datakin](https://datakin.com/product/) - end-to-end, real-time data lineage solution [Paid Service]\n* 🔱 [Monte Carlo](https://www.montecarlodata.com/) - data integrity, drifts, schema, lineage [Paid Service]\n* 🔱 [SODA](https://www.soda.io/) - data monitoring, testing and validation [Paid Service]\n* 🔱 [whatify](https://whatify.ai/) - data quality and action recommendation on it [Paid Service]\n\n#### Feature Stores\n* 🔱 [Tecton](https://www.tecton.ai/) - enterprise feature store for machine learning [Paid Service]\n* ⭐ [FEAST](https://github.com/feast-dev/feast) - open source feature store for machine learning [Website](https://feast.dev/) [GitHub, 3792 stars]\n* 🔱 [Hopsworks Feature Store](https://www.hopsworks.ai/feature-store) - data management system for managing machine learning features [Paid Service]\n\n#### Metadata Management\n* ⭐ [ML Metadata](https://github.com/google/ml-metadata) - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 500 stars]\n* 🔱 [Neptune AI](https://neptune.ai/) - experiment tracking and model registry built for research and production teams [Paid Service]\n\n#### MLOps Frameworks\n* ⭐ [Metaflow](https://github.com/Netflix/metaflow) - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 6187 stars]\n* ⭐ [kedro](https://github.com/quantumblacklabs/kedro) - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 7865 stars]\n* ⭐ [Seldon Core](https://github.com/SeldonIO/seldon-core) - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 3503 stars]\n* ⭐ [ZenML](https://github.com/maiot-io/zenml) - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 2549 stars]\n* 🔱 [Google Vertex AI](https://cloud.google.com/vertex-ai) - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service]\n* ⭐ [Diffgram](https://github.com/diffgram/diffgram) - Complete training data platform for machine learning delivered as a single application [GitHub, 1583 stars]\n* 🔱 [Continual.ai](https://continual.ai/) - build, deploy, and operationalize ML models easier and faster with a declarative interface on cloud data warehouses like Snowflake, BigQuery, RedShift, and Databricks. [Paid Service]\n\n### Transformer-based Architectures\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n#### General\n* 📙 [Why BERT Fails in Commercial Environments](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/bert-commercial-environments.html) by Intel AI [Blog, 2020]\n* 📙 [Fine Tuning BERT for Text Classification with FARM](https://towardsdatascience.com/fine-tuning-bert-for-text-classification-with-farm-2880665065e2) by Sebastian Guggisberg [Blog, 2020]\n* ⭐ [Pretrain Transformers Models in PyTorch using Hugging Face Transformers](https://github.com/gmihaila/ml_things/blob/master/notebooks/pytorch/pretrain_transformers_pytorch.ipynb) [GitHub, 186 stars]\n* 🎥️ [Practical NLP for the Real World](https://www.infoq.com/presentations/practical-nlp/) [Presentation, 2019]\n* 🎥️ [From Paper to Product – How we implemented BERT](https://www.youtube.com/watch?v=VnmKDPBQjJk) by Christoph Henkelmann [Talk, 2020]\n\n##### Multi-GPU Transformers\n* ⭐ [Parallelformers: An Efficient Model Parallelization Toolkit for Deployment](https://github.com/tunib-ai/parallelformers) [GitHub, 548 stars]\n\n##### Training Transformers Effectively\n* ⭐ [Training BERT with Compute/Time (Academic) Budget](https://github.com/IntelLabs/academic-budget-bert) [GitHub, 256 stars]\n\n### Embeddings as a Service\n* ⭐ [embedding-as-service](https://github.com/amansrivastava17/embedding-as-service) [GitHub, 176 stars]\n* ⭐ [Bert-as-service](https://github.com/hanxiao/bert-as-service) [GitHub, 11035 stars]\n\n### NLP Recipes Industrial Applications:\n* ⭐ [NLP Recipes](https://github.com/microsoft/nlp-recipes) by [microsoft](https://github.com/microsoft) [GitHub, 6048 stars]\n* ⭐ [NLP with Python](https://github.com/susanli2016/NLP-with-Python) by [susanli2016](https://github.com/susanli2016) [GitHub, 2454 stars]\n* ⭐ [Basic Utilities for PyTorch NLP](https://github.com/PetrochukM/PyTorch-NLP) by [PetrochukM](https://github.com/PetrochukM) [GitHub, 2127 stars]\n\n### NLP Applications in Bio, Finance, Legal and other industries\n* ⭐ [Blackstone](https://github.com/ICLRandD/Blackstone) - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 573 stars]\n* ⭐ [Sci spaCy](https://github.com/allenai/scispacy) - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1279 stars]\n* ⭐ [FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks](https://github.com/psnonis/FinBERT) [GitHub, 165 stars]\n* ⭐ [LexNLP](https://github.com/LexPredict/lexpredict-lexnlp) - Information retrieval and extraction for real, unstructured legal text [GitHub, 555 stars]\n* ⭐ [NerDL and NerCRF](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/blogposts/data_prep.ipynb) - Tutorial on Named Entity Recognition for Healthcare with SparkNLP\n* ⭐ [Legal Text Analytics](https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics) - A list of selected resources dedicated to Legal Text Analytics [GitHub, 410 stars]\n* ⭐ [BioIE](https://github.com/caufieldjh/awesome-bioie) - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 222 stars]\n\n\n\n![The-NLP-Speech](./Resources/Images/pandect_speech.png)\n-----\n\u003e __Note__\n\u003e Section keywords: speech recognition\n\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n### General Speech Recognition\n* ⭐ [wav2letter](https://github.com/facebookresearch/wav2letter) - Automatic Speech Recognition Toolkit [GitHub, 6149 stars]\n* ⭐ [DeepSpeech](https://github.com/mozilla/DeepSpeech) - Baidu's DeepSpeech architecture [GitHub, 20639 stars]\n* 📙 [Acoustic Word Embeddings](https://medium.com/@maobedkova/acoustic-word-embeddings-fc3f1a8f0519) by Maria Obedkova [Blog, 2020]\n* ⭐ [kaldi](https://github.com/kaldi-asr/kaldi) - Kaldi is a toolkit for speech recognition [GitHub, 12177 stars]\n* ⭐ [awesome-kaldi](https://github.com/YoavRamon/awesome-kaldi) - resources for using Kaldi [GitHub, 510 stars]\n* ⭐ [ESPnet](https://github.com/espnet/espnet) - End-to-End Speech Processing Toolkit [GitHub, 5791 stars]\n* 📙 [HuBERT](https://ai.facebook.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression) - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021]\n\n### Text to Speech\n* ⭐ [FastSpeech](https://github.com/xcmyz/FastSpeech) - The Implementation of FastSpeech based on pytorch [GitHub, 746 stars]\n* ⭐ [TTS](https://github.com/coqui-ai/TTS) - a deep learning toolkit for Text-to-Speech [GitHub, 7214 stars]\n\n### Speech to Text\n* ⭐ [whisper](https://github.com/openai/whisper) - Robust Speech Recognition via Large-Scale Weak Supervision, by OpenAI [GitHub, 17097 stars]\n\n### Datasets\n* ⭐ [VoxPopuli](https://github.com/facebookresearch/voxpopuli) - large-scale multilingual speech corpus for representation learning [GitHub, 392 stars]\n\n![The-NLP-Topics](./Resources/Images/pandect_topics.png)\n-----\n\u003e __Note__\n\u003e Section keywords: topic modeling\n\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n### Blogs\n* 📙 [Topic Modelling with PySpark and Spark NLP](https://medium.com/trustyou-engineering/topic-modelling-with-pyspark-and-spark-nlp-a99d063f1a6e) by Maria Obedkova [Spark, Blog, 2020]\n* 📙 [A Unique Approach to Short Text Clustering (Algorithmic Theory)](https://towardsdatascience.com/a-unique-approach-to-short-text-clustering-part-1-algorithmic-theory-4d4fad0882e1) by Brittany Bowers [Blog, 2020]\n\n### Frameworks for Topic Modeling\n* ⭐ [gensim](https://github.com/RaRe-Technologies/gensim) - framework for topic modeling [GitHub, 13760 stars]\n* ⭐ [Spark NLP](https://github.com/JohnSnowLabs/spark-nlp) [GitHub, 3018 stars]\n\n### Repositories\n* ⭐ [Top2Vec](https://github.com/ddangelov/Top2Vec) [GitHub, 2325 stars]\n* ⭐ [Anchored Correlation Explanation Topic Modeling](https://github.com/gregversteeg/CorEx) [GitHub, 289 stars]\n* ⭐ [Topic Modeling in Embedding Spaces](https://github.com/adjidieng/ETM) [GitHub, 480 stars] [Paper](https://arxiv.org/abs/1907.04907)\n* ⭐ [TopicNet](https://github.com/machine-intelligence-laboratory/TopicNet) - A high-level interface for BigARTM library [GitHub, 128 stars]\n* ⭐ [BERTopic](https://github.com/MaartenGr/BERTopic) - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 3426 stars]\n* ⭐ [OCTIS](https://github.com/MIND-Lab/OCTIS) - A python package to optimize and evaluate topic models [GitHub, 457 stars]\n* ⭐ [Contextualized Topic Models](https://github.com/MilaNLProc/contextualized-topic-models) [GitHub, 968 stars]\n* ⭐ [GSDMM](https://github.com/rwalk/gsdmm) - GSDMM: Short text clustering [GitHub, 305 stars]\n\n![Keyword-Extraction](./Resources/Images/pandect_papyrus2.png)\n-----\n\u003e __Note__\n\u003e Section keywords: keyword extraction\n\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n### Text Rank\n* ⭐ [PyTextRank](https://github.com/DerwenAI/pytextrank) - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1933 stars]\n* ⭐ [textrank](https://github.com/summanlp/textrank) - TextRank implementation for Python 3 [GitHub, 1158 stars]\n\n### RAKE - Rapid Automatic Keyword Extraction\n* ⭐ [rake-nltk](https://github.com/csurfer/rake-nltk) - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 956 stars]\n* ⭐ [yake](https://github.com/LIAAD/yake) - Single-document unsupervised keyword extraction [GitHub, 1238 stars]\n* ⭐ [RAKE-tutorial](https://github.com/zelandiya/RAKE-tutorial) - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 369 stars]\n* ⭐ [rake-nltk](https://github.com/csurfer/rake-nltk) - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 956 stars]\n\n### Other Approaches\n* ⭐ [flashtext](https://github.com/vi3k6i5/flashtext) - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5318 stars]\n* ⭐ [BERT-Keyword-Extractor](https://github.com/ibatra/BERT-Keyword-Extractor) - Deep Keyphrase Extraction using BERT [GitHub, 225 stars]\n* ⭐ [keyBERT](https://github.com/MaartenGr/KeyBERT) - Minimal keyword extraction with BERT [GitHub, 1998 stars]\n* ⭐ [KeyphraseVectorizers](https://github.com/TimSchopf/KeyphraseVectorizers) - vectorizers that extract keyphrases with part-of-speech patterns  [GitHub, 117 stars]\n\n### Further Reading\n* 📙 [Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts](https://howard-haowen.github.io/blog.ai/keyword-extraction/spacy/textacy/ckip-transformers/jieba/textrank/rake/2021/02/16/Adding-a-custom-tokenizer-to-spaCy-and-extracting-keywords.html) by Haowen Jiang [Blog, Feb 2021]\n* 📙 [How to Extract Relevant Keywords with KeyBERT](https://towardsdatascience.com/how-to-extract-relevant-keywords-with-keybert-6e7b3cf889ae) [Blog, June 2021]\n\n![Responsible-NLP](./Resources/Images/pandect_pegasus.png)\n-----\n\u003e __Note__\n\u003e Section keywords: ethics, responsible NLP\n\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n### NLP and ML Interpretability\n\n#### NLP-centric\n* [Explainability for Natural Language Processing - KDD'2021 Tutorial](https://www.youtube.com/watch?v=PvKOSYGclPk\u0026t=2s) [Slides](https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249992241) [Presentation, August 2021]\n* ⭐ [ecco](https://github.com/jalammar/ecco) - Tools to visuals and explore NLP language models [GitHub, 1548 stars]\n* ⭐ [NLP Profiler](https://github.com/neomatrix369/nlp_profiler) - A simple NLP library allows profiling datasets with text columns [GitHub, 223 stars]\n* ⭐ [transformers-interpret](https://github.com/cdpierse/transformers-interpret) - Model explainability that works seamlessly with transformers [GitHub, 905 stars]\n* ⭐ [Awesome-explainable-AI](https://github.com/wangyongjie-ntu/Awesome-explainable-AI) - collection of research materials on explainable AI/ML [GitHub, 780 stars]\n* ⭐ [LAMA](https://github.com/facebookresearch/LAMA) - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 956 stars]\n\n#### General\n* ⭐ [Language Interpretability Tool (LIT)](https://github.com/PAIR-code/lit) [GitHub, 3020 stars]\n* ⭐ [WhatLies](https://github.com/RasaHQ/whatlies) - Toolkit to help visualise - what lies in word embeddings [GitHub, 435 stars]\n* ⭐ [Interpret-Text](https://github.com/interpretml/interpret-text) - Interpretability techniques and visualization dashboards for NLP models [GitHub, 340 stars]\n* ⭐ [InterpretML](https://github.com/interpretml/interpret) - Fit interpretable models. Explain blackbox machine learning [GitHub, 5155 stars]\n* ⭐ [thermostat](https://github.com/DFKI-NLP/thermostat) - Collection of NLP model explanations and accompanying analysis tools [GitHub, 126 stars]\n* ⭐ [Dodrio](https://github.com/poloclub/dodrio) - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 245 stars]\n* ⭐ [imodels](https://github.com/csinva/imodels) - package for concise, transparent, and accurate predictive modeling [GitHub, 971 stars]\n\n### Ethics, Bias, and Equality in NLP\n* 📙 [Bias in Natural Language Processing @EMNLP 2020](https://gaurav-maheshwari.medium.com/bias-in-natural-language-processing-emnlp-2020-8f1cb2806fcc#cc1a) [Blog, Nov 2020]\n* 🎥️ [Machine Learning as a Software Engineering Enterprise](https://nips.cc/virtual/2020/public/invited_16166.html) - NeurIPS 2020 Keynote [Presentation, Dec 2020]\n* 📙 [Computational Ethics for NLP](http://demo.clab.cs.cmu.edu/ethical_nlp/) - course resources from the Carnegie Mellon University [Lecture Notes, Spring 2020]\n* 🗂️ [Ethics in NLP](https://aclweb.org/aclwiki/Ethics_in_NLP) - resources from ACLs Ethics in NLP track\n* 🗂️ [The Institute for Ethical AI \u0026 Machine Learning](https://ethical.institute)\n* 📙 [Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models](https://arxiv.org/abs/2102.02503) [Paper, Feb 2021]\n* ⭐ [Fairness-in-AI](https://github.com/dreji18/Fairness-in-AI) - this package is used to detect and mitigate biases in NLP tasks [GitHub, 24 stars]\n* ⭐ [nlg-bias](https://github.com/ewsheng/nlg-bias) - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 46 stars]\n* 🗂️ [bias-in-nlp](https://github.com/cisnlp/bias-in-nlp) - list of papers related to bias in NLP [GitHub, 9 stars]\n\n### Adversarial Attacks for NLP\n* 📙 [Privacy Considerations in Large Language Models](https://ai.googleblog.com/2020/12/privacy-considerations-in-large.html?m=1) [Blog, Dec 2020]\n* ⭐ [DeepWordBug](https://github.com/QData/deepWordBug) - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 57 stars]\n* ⭐ [Adversarial-Misspellings](https://github.com/danishpruthi/Adversarial-Misspellings) - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 57 stars]\n\n### Hate Speech Analysis\n* ⭐ [HateXplain](https://github.com/hate-alert/HateXplain) - BERT for detecting abusive language [GitHub, 135 stars]\n\n![The-NLP-Frameworks](./Resources/Images/pandect_frameworks.png)\n-----\n\u003e __Note__\n\u003e Section keywords: frameworks\n\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n### General Purpose\n* ⭐ [spaCy](https://github.com/explosion/spaCy) by Explosion AI [GitHub, 24708 stars]\n* ⭐ [flair](https://github.com/flairNLP/flair) by Zalando [GitHub, 12278 stars]\n* ⭐ [AllenNLP](https://github.com/allenai/allennlp) by AI2 [GitHub, 11314 stars]\n* ⭐ [stanza](https://github.com/stanfordnlp/stanza) (former Stanford NLP) [GitHub, 6413 stars]\n* ⭐ [spaCy stanza](https://github.com/explosion/spacy-stanza) [GitHub, 660 stars]\n* ⭐ [nltk](https://github.com/nltk/nltk) [GitHub, 11280 stars]\n* ⭐ [gensim](https://github.com/RaRe-Technologies/gensim) - framework for topic modeling [GitHub, 13760 stars]\n* ⭐ [pororo](https://github.com/kakaobrain/pororo) - Platform of neural models for natural language processing [GitHub, 1164 stars]\n* ⭐ [NLP Architect](https://github.com/NervanaSystems/nlp-architect) - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2883 stars]\n* ⭐ [FARM](https://github.com/deepset-ai/FARM) [GitHub, 1597 stars]\n* ⭐ [gobbli](https://github.com/RTIInternational/gobbli) by RTI International [GitHub, 268 stars]\n* ⭐ [headliner](https://github.com/as-ideas/headliner) - training and deployment of seq2seq models [GitHub, 231 stars]\n* ⭐ [SyferText](https://github.com/OpenMined/SyferText) - A privacy preserving NLP framework [GitHub, 190 stars]\n* ⭐ [DeText](https://github.com/linkedin/detext) - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1230 stars]\n* ⭐ [TextHero](https://github.com/jbesomi/texthero) - Text preprocessing, representation and visualization [GitHub, 2635 stars]\n* ⭐ [textblob](https://github.com/sloria/textblob) - TextBlob: Simplified Text Processing [GitHub, 8373 stars]\n* ⭐ [AdaptNLP](https://github.com/Novetta/adaptnlp) - A high level framework and library for NLP [GitHub, 407 stars]\n* ⭐ [textacy](https://github.com/chartbeat-labs/textacy) - NLP, before and after spaCy [GitHub, 1999 stars]\n* ⭐ [texar](https://github.com/asyml/texar) - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2323 stars]\n* ⭐ [jiant](https://github.com/nyu-mll/jiant) - jiant is an NLP toolkit [GitHub, 1449 stars]\n\n### Data Augmentation\n* ⭐ [WildNLP](https://github.com/MI2DataLab/WildNLP) Text manipulation library to test NLP models [GitHub, 74 stars]\n* ⭐ [snorkel](https://github.com/snorkel-team/snorkel) Framework to generate training data [GitHub, 5338 stars]\n* ⭐ [NLPAug](https://github.com/makcedward/nlpaug) Data augmentation for NLP [GitHub, 3665 stars]\n* ⭐ [SentAugment](https://github.com/facebookresearch/SentAugment) Data augmentation by retrieving similar sentences from larger datasets [GitHub, 361 stars]\n* ⭐ [faker](https://github.com/joke2k/faker) - Python package that generates fake data for you [GitHub, 15129 stars]\n* ⭐ [textflint](https://github.com/textflint/textflint) - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 585 stars]\n* ⭐ [Parrot](https://github.com/PrithivirajDamodaran/Parrot_Paraphraser) - Practical and feature-rich paraphrasing framework [GitHub, 636 stars]\n* ⭐ [AugLy](https://github.com/facebookresearch/AugLy) - data augmentations library for audio, image, text, and video [GitHub, 4616 stars]\n* ⭐ [TextAugment](https://github.com/dsfsi/textaugment) - Python 3 library for augmenting text for natural language processing applications [GitHub, 290 stars]\n\n### Adversarial NLP Attacks \u0026 Behavioral Testing\n* ⭐ [TextAttack](https://github.com/QData/TextAttack) - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2161 stars]\n* ⭐ [CleverHans](https://github.com/tensorflow/cleverhans) - adversarial example library for constructing NLP attacks and building defenses [GitHub, 5660 stars]\n* ⭐ [CheckList](https://github.com/marcotcr/checklist) - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1806 stars]\n\n### Transformer-oriented\n* ⭐ [transformers](https://github.com/huggingface/transformers) by HuggingFace [GitHub, 75428 stars]\n* ⭐ [Adapter Hub](https://github.com/Adapter-Hub/adapter-transformers) and its [documentation](https://docs.adapterhub.ml/index.html) - Adapter modules for Transformers [GitHub, 1110 stars]\n* ⭐ [haystack](https://github.com/deepset-ai/haystack) - Transformers at scale for question answering \u0026 neural search. [GitHub, 6147 stars]\n\n### Dialog Systems and Speech\n* ⭐ [DeepPavlov](https://github.com/deepmipt/DeepPavlov) by MIPT [GitHub, 5933 stars]\n* ⭐ [ParlAI](https://github.com/facebookresearch/ParlAI) by FAIR [GitHub, 9640 stars]\n* ⭐ [rasa](https://github.com/RasaHQ/rasa) - Framework for Conversational Agents [GitHub, 15150 stars]\n* ⭐ [wav2letter](https://github.com/facebookresearch/wav2letter) - Automatic Speech Recognition Toolkit [GitHub, 6149 stars]\n* ⭐ [ChatterBot](https://github.com/gunthercox/ChatterBot) - conversational dialog engine for creating chat bots [GitHub, 12696 stars]\n* ⭐ [SpeechBrain](https://github.com/speechbrain/speechbrain) - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 4935 stars]\n\n### Word/Sentence-embeddings oriented\n* ⭐ [MUSE](https://github.com/facebookresearch/MUSE) A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3021 stars]\n* ⭐ [vecmap](https://github.com/artetxem/vecmap) A framework to learn cross-lingual word embedding mappings [GitHub, 604 stars]\n* ⭐ [sentence-transformers](https://github.com/UKPLab/sentence-transformers) - Multilingual Sentence \u0026 Image Embeddings with BERT [GitHub, 8944 stars]\n\n### Social Media Oriented\n* ⭐ [Ekphrasis](https://github.com/cbaziotis/ekphrasis) - text processing tool, geared towards text from social networks [GitHub, 592 stars]\n\n### Phonetics\n* ⭐ [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer) - grapheme to phoneme conversion with deep learning [GitHub, 197 stars]\n\n### Morphology\n* ⭐ [LemmInflect](https://github.com/bjascob/LemmInflect) - python module for English lemmatization and inflection [GitHub, 186 stars]\n* ⭐ [Inflect](https://github.com/jaraco/inflect) - generate plurals, ordinals, indefinite articles [GitHub, 757 stars]\n* ⭐ [simplemma](https://github.com/jaraco/inflect) - simple multilingual lemmatizer for Python [GitHub, 757 stars]\n\n### Multi-lingual tools\n* ⭐ [polyglot](https://github.com/aboSamoor/polyglot) - Multi-lingual NLP Framework [GitHub, 2086 stars]\n* ⭐ [trankit](https://github.com/nlp-uoregon/trankit) - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 649 stars]\n\n### Distributed NLP / Multi-GPU NLP\n* ⭐ [Spark NLP](https://github.com/JohnSnowLabs/spark-nlp) [GitHub, 3018 stars]\n* ⭐ [Parallelformers: An Efficient Model Parallelization Toolkit for Deployment](https://github.com/tunib-ai/parallelformers) [GitHub, 548 stars]\n\n### Machine Translation\n* ⭐ [COMET](https://github.com/Unbabel/COMET) -A Neural Framework for MT Evaluation [GitHub, 191 stars]\n* ⭐ [marian-nmt](https://github.com/marian-nmt/marian) - Fast Neural Machine Translation in C++ [GitHub, 974 stars]\n* ⭐ [argos-translate](https://github.com/argosopentech/argos-translate) - Open source neural machine translation in Python [GitHub, 1535 stars]\n* ⭐ [Opus-MT](https://github.com/Helsinki-NLP/Opus-MT) - Open neural machine translation models and web services [GitHub, 257 stars]\n* ⭐ [dl-translate](https://github.com/xhlulu/dl-translate) - A deep learning-based translation library built on Huggingface transformers [GitHub, 241 stars]\n\n### Entity and String Matching\n* ⭐ [PolyFuzz](https://github.com/MaartenGr/PolyFuzz) - Fuzzy string matching, grouping, and evaluation [GitHub, 589 stars]\n* ⭐ [pyahocorasick](https://github.com/WojciechMula/pyahocorasick) - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 757 stars]\n* ⭐ [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy String Matching in Python [GitHub, 8776 stars]\n* ⭐ [jellyfish](https://github.com/jamesturk/jellyfish) - approximate and phonetic matching of strings [GitHub, 1759 stars]\n* ⭐ [textdistance](https://github.com/life4/textdistance) - Compute distance between sequences [GitHub, 3000 stars]\n* ⭐ [DeepMatcher](https://github.com/anhaidgroup/deepmatcher) - Compute distance between sequences [GitHub, 457 stars]\n* ⭐ [RE2](https://github.com/alibaba-edu/simple-effective-text-matching) - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 336 stars]\n* ⭐ [Machamp](https://github.com/megagonlabs/machamp) - Machamp: A Generalized Entity Matching Benchmark [GitHub, 9 stars]\n\n### Discourse Analysis\n* ⭐ [ConvoKit](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit) - Cornell Conversational Analysis Toolkit [GitHub, 399 stars]\n\n### PII scrubbing\n* ⭐ [scrubadub](https://github.com/LeapBeyond/scrubadub) - Clean personally identifiable information from dirty dirty text [GitHub, 309 stars]\n\n### Hastag Segmentation\n* ⭐ [hashformers](https://github.com/ruanchaves/hashformers) - automatically inserting the missing spaces between the words in a hashtag [GitHub, 41 stars]\n\n### Books Analysis / Literary Analysis\n* ⭐ [booknlp](https://github.com/booknlp/booknlp) - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 647 stars]\n* ⭐ [bookworm](https://github.com/harrisonpim/bookworm) - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 73 stars]\n\n### Non-English oriented\n#### Japanese\n* ⭐ [fugashi](https://github.com/polm/fugashi) - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 268 stars]\n* ⭐ [SudachiPy](https://github.com/WorksApplications/SudachiPy) - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 330 stars]\n* ⭐ [Konoha](https://github.com/himkt/konoha) - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 182 stars]\n* ⭐ [jProcessing](https://github.com/kevincobain2000/jProcessing) - Japanese Natural Langauge Processing Libraries [GitHub, 142 stars]\n* ⭐ [Ginza](https://github.com/megagonlabs/ginza) - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 620 stars]\n* ⭐ [kuromoji](https://github.com/atilika/kuromoji) - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 847 stars]\n* ⭐ [nagisa](https://github.com/taishi-i/nagisa) - Japanese tokenizer based on recurrent neural networks [GitHub, 321 stars]\n* ⭐ [KyTea](https://github.com/neubig/kytea) - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 190 stars]\n* ⭐ [Jigg](https://github.com/mynlp/jigg) - Pipeline framework for easy natural language processing [GitHub, 72 stars]\n* ⭐ [Juman++](https://github.com/ku-nlp/jumanpp) - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 321 stars]\n* ⭐ [RakutenMA](https://github.com/rakuten-nlp/rakutenma) - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 447 stars]\n* ⭐ [toiro](https://github.com/taishi-i/toiro) - a comparison tool of Japanese tokenizers [GitHub, 105 stars]\n\n#### Thai\n* ⭐ [AttaCut](https://github.com/PyThaiNLP/attacut) - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 68 stars] \n* ⭐ [ThaiLMCut](https://github.com/meanna/ThaiLMCUT) - Word Tokenizer for Thai Language [GitHub, 15 stars] \n\n#### Chinese\n* ⭐ [Spacy-pkuseg](https://github.com/explosion/spacy-pkuseg) - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 20 stars] \n\n#### Other\n* ⭐ [textblob-de](https://github.com/markuskiller/textblob-de) - TextBlob: Simplified Text Processing for German [GitHub, 95 stars]\n* ⭐ [Kashgari](https://github.com/BrikerMan/Kashgari) Transfer Learning with focus on Chinese [GitHub, 2333 stars]\n* ⭐ [Underthesea](https://github.com/undertheseanlp/underthesea) - Vietnamese NLP Toolkit [GitHub, 1057 stars]\n* ⭐ [PTT5](https://github.com/unicamp-dl/PTT5) - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 62 stars]\n\n### Text Data Labelling\n* ⭐ [Small-Text](https://github.com/webis-de/small-text) - Active Learning for Text Classifcation in Python [GitHub, 369 stars]\n* ⭐ [Doccano](https://github.com/doccano/doccano) - open source annotation tool for machine learning practitioners [GitHub, 7005 stars]\n* 🔱 [Prodigy](https://prodi.gy/) - annotation tool powered by active learning [Paid Service]\n\n![The-NLP-Learning](./Resources/Images/pandect_learning.png)\n-----\n\u003e __Note__\n\u003e Section keywords: learn NLP\n\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n#### General\n* 📙 [Learn NLP the practical way](https://towardsdatascience.com/learn-nlp-the-practical-way-b854ce1035c4) [Blog, Nov. 2019]\n* 📙 [Learn NLP the Stanford way](https://towardsdatascience.com/learn-nlp-the-stanford-way-lesson-1-3f1844265760) ([+Part 2](https://towardsdatascience.com/learn-nlp-the-stanford-way-lesson-2-7447f2c12b36)) [Blog, Nov 2020]\n* 📙 [Choosing the right course for a Practical NLP Engineer](https://airev.us/ultimate-guide-to-natural-language-processing-courses/)\n* 📙 [12 Best Natural Language Processing Courses \u0026 Tutorials to Learn Online](https://blog.coursesity.com/best-natural-language-processing-courses/)\n* ⭐ [Treasure of Transformers](https://github.com/ashishpatel26/Treasure-of-Transformers) - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 563 stars]\n* 🎥️ [Rasa Algorithm Whiteboard](https://www.youtube.com/playlist?list=PL75e0qA87dlG-za8eLI6t0_Pbxafk-cxb) - YouTube series by Rasa explaining various Data Science and NLP Algorithms\n* 🎥️ [ExplosionAI Videos](https://www.youtube.com/c/ExplosionAI/videos) - YouTube series by ExplosionAI teaching you how to use spacy and apply it for NLP\n\n#### Courses\n* 🎥️ [CS25: Transformers United Stanford - Fall 2021](https://web.stanford.edu/class/cs25/) [Course, Fall 2021]\n* 📙 [NLP Course | For You](https://lena-voita.github.io/nlp_course.html) - Great and interactive course on NLP\n* 📙 [OpenClass NLP](https://openclass.ai/catalog/nlp) - Natural language processing (NLP) assignments\n* 📙 [Advanced NLP with spaCy](https://course.spacy.io/en/) - how to use spaCy to build advanced natural language understanding systems\n* 📙 [Transformer models for NLP](https://huggingface.co/course/chapter1) by HuggingFace\n* 🎥️ [Stanford NLP Seminar](https://nlp.stanford.edu/seminar/) - slides from the Stanford NLP course\n\n#### Books\n* 📙 [Natural Language Processing with Transformers](https://www.buecher.de/shop/maschinelles-lernen/natural-language-processing-with-transformers/tunstall-lewis-von-werra-leandro-wolf-thomas/products_products/detail/prod_id/64140211/) - [Book, February 2022]\n* 📙 [Applied Natural Language Processing in the Enterprise](https://www.oreilly.com/library/view/applied-natural-language/9781492062561/) - [Book, May 2021]\n* 📙 [Practical Natural Language Processing](https://www.oreilly.com/library/view/practical-natural-language/9781492054047/) - [Book, June 2020]\n* 📙 [Dive into Deep Learning](https://d2l.ai/index.html) - An interactive deep learning book with code, math, and discussions\n* 📙 [Natural Language Processing and Computational Linguistics](https://www.amazon.de/Natural-Language-Processing-Computational-Linguistics/dp/1848218486) - Speech, Morphology and Syntax (Cognitive Science)\n* 📙 [Top NLP Books to Read 2020](https://towardsdatascience.com/top-nlp-books-to-read-2020-12012ef41dc1) - Blog post by Raymong Cheng [Blog, Sep 2020]\n\n#### Tutorials\n* ⭐ [nlp-tutorial](https://github.com/lyeoni/nlp-tutorial) - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1324 stars]\n* ⭐ [nlp-tutorial](https://github.com/graykode/nlp-tutorial) - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 11796 stars]\n* ⭐ [Hands-On NLTK Tutorial](https://github.com/hb20007/hands-on-nltk-tutorial) [GitHub, 506 stars]\n* ⭐ [Modern Practical Natural Language Processing](https://github.com/jmugan/modern_practical_nlp) [GitHub, 260 stars]\n* ⭐ [Transformers-Tutorials](https://github.com/NielsRogge/Transformers-Tutorials) - demos with the Transformers library by HuggingFace [GitHub, 3408 stars]\n* 🗂️ [CalmCode Tutorials](https://calmcode.io/#science) - Set of Python Data Science Tutorials\n\n![The-NLP-Communities](./Resources/Images/pandect_communities.png)\n-----\n* [r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/) - NLP Reddit forum\n\n![Other-NLP-Topics](Resources/Images/pandect_papyrus_other.png)\n-----\n\n[🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents)\n\n#### Tokenization\n* ⭐ [tokenizers](https://github.com/huggingface/tokenizers) - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 6064 stars]\n* ⭐ [SentencePiece](https://github.com/google/sentencepiece) - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 6316 stars]\n* ⭐ [SoMaJo](https://github.com/tsproisl/SoMaJo) - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 108 stars]\n\n#### Data Augmentation and Weak Supervision\n##### Libraries and Frameworks\n* ⭐ [WildNLP](https://github.com/MI2DataLab/WildNLP) Text manipulation library to test NLP models [GitHub, 74 stars]\n* ⭐ [NLPAug](https://github.com/makcedward/nlpaug) Data augmentation for NLP [GitHub, 3665 stars]\n* ⭐ [SentAugment](https://github.com/facebookresearch/SentAugment) Data augmentation by retrieving similar sentences from larger datasets [GitHub, 361 stars]\n* ⭐ [TextAttack](https://github.com/QData/TextAttack) - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2161 stars]\n* ⭐ [skweak](https://github.com/NorskRegnesentral/skweak) - software toolkit for weak supervision applied to NLP tasks [GitHub, 843 stars]\n* ⭐ [NL-Augmenter](https://github.com/GEM-benchmark/NL-Augmenter) - Collaborative Repository of Natural Language Transformations [GitHub, 679 stars]\n* ⭐ [EDA](https://github.com/jasonwei20/eda_nlp) - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1356 stars]\n* ⭐ [snorkel](https://github.com/snorkel-team/snorkel) Framework to generate training data [GitHub, 5338 stars]\n\n##### Reading Material and Tutorials\n* ⭐ [A Survey of Data Augmentation Approaches for NLP](https://arxiv.org/abs/2105.03075) [Paper, May 2021] [GitHub Link](https://github.com/styfeng/DataAug4NLP)\n* 📙 [A Visual Survey of Data Augmentation in NLP](https://amitness.com/2020/05/data-augmentation-for-nlp/) [Blog, 2020]\n* 📙 [Weak Supervision: A New Programming Paradigm for Machine Learning](http://ai.stanford.edu/blog/weak-supervision/) [Blog, March 2019]\n\n#### Named Entity Recognition (NER)\n* ⭐ [Datasets for Entity Recognition](https://github.com/juand-r/entity-recognition-datasets) [GitHub, 1255 stars]\n* ⭐ [Datasets to train supervised classifiers for Named-Entity Recognition](https://github.com/davidsbatista/NER-datasets) [GitHub, 297 stars]\n* ⭐ [Bootleg](https://github.com/HazyResearch/bootleg) - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 189 stars]\n* ⭐ [Few-NERD](https://github.com/thunlp/Few-NERD) - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 318 stars]\n\n#### Relation Extraction\n* ⭐ [tacred-relation](https://github.com/yuhaozhang/tacred-relation) TACRED: position-aware attention model for relation extraction [GitHub, 336 stars]\n* ⭐ [tacrev](https://github.com/DFKI-NLP/tacrev) TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 55 stars]\n* ⭐ [tac-self-attention](https://github.com/ivan-bilan/tac-self-attention) Relation extraction with position-aware self-attention [GitHub, 64 stars]\n* ⭐ [Re-TACRED](https://github.com/gstoica27/Re-TACRED) Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 39 stars]\n\n#### Coreference Resolution\n* ⭐ [NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks](https://github.com/huggingface/neuralcoref) by HuggingFace [GitHub, 2627 stars]\n* ⭐ [coref](https://github.com/mandarjoshi90/coref) - BERT and SpanBERT for Coreference Resolution [GitHub, 399 stars]\n\n#### Sentiment Analysis\n* ⭐ [Reading list for Awesome Sentiment Analysis papers](https://github.com/declare-lab/awesome-sentiment-analysis) by [declare-lab](https://github.com/declare-lab) [GitHub, 475 stars]\n* ⭐ [Awesome Sentiment Analysis](https://github.com/xiamx/awesome-sentiment-analysis) by [xiamx](https://github.com/xiamx) [GitHub, 884 stars]\n\n#### Domain Adaptation\n* ⭐ [Neural Adaptation in Natural Language Processing - curated list](https://github.com/bplank/awesome-neural-adaptation-in-NLP) [GitHub, 246 stars]\n\n#### Low Resource NLP\n* ⭐ [CMU LTI Low Resource NLP Bootcamp 2020](https://github.com/neubig/lowresource-nlp-bootcamp-2020) - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 555 stars]\n\n#### Spell Correction / Error Correction\n* ⭐ [Gramformer](https://github.com/PrithivirajDamodaran/Gramformer) - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1244 stars]\n* ⭐ [NeuSpell](https://github.com/neuspell/neuspell) - A Neural Spelling Correction Toolkit [GitHub, 515 stars]\n* ⭐ [SymSpellPy](https://github.com/mammothb/symspellpy) - Python port of SymSpell [GitHub, 641 stars]\n* 📙 [Speller100](https://www.microsoft.com/en-us/research/blog/speller100-zero-shot-spelling-correction-at-scale-for-100-plus-languages/) by Microsoft [Blog, Feb 2021]\n* ⭐ [JamSpell](https://github.com/bakwc/JamSpell) - spell checking library - accurate, fast, multi-language [GitHub, 527 stars]\n* ⭐ [pycorrector](https://github.com/shibing624/pycorrector) - spell correction for Chinese [GitHub, 3714 stars]\n* ⭐ [contractions](https://github.com/kootenpv/contractions) - Fixes contractions such as `you're` to you `are` [GitHub, 262 stars]\n\n#### Style Transfer for NLP\n* ⭐ [Styleformer](https://github.com/PrithivirajDamodaran/Styleformer) - Neural Language Style Transfer framework [GitHub, 427 stars]\n* ⭐ [StylePTB](https://github.com/lvyiwei1/StylePTB) - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 51 stars]\n\n#### Automata Theory for NLP\n* ⭐ [pyahocorasick](https://github.com/WojciechMula/pyahocorasick) - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 757 stars]\n\n#### Obscene words detection\n* ⭐ [LDNOOBW](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1988 stars]\n\n#### Reddit Analysis\n* ⭐ [Subreddit Analyzer](https://github.com/PhantomInsights/subreddit-analyzer) - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 483 stars]\n\n#### Skill Detection\n* ⭐ [SkillNER](https://github.com/AnasAito/SkillNER) - rule based NLP module to extract job skills from text [GitHub, 71 stars]\n\n#### Reinforcement Learning for NLP\n* ⭐ [nlp-gym](https://github.com/rajcscw/nlp-gym) - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 132 stars]\n\n#### AutoML / AutoNLP\n* ⭐ [AutoNLP](https://github.com/huggingface/autonlp) - Faster and easier training and deployments of SOTA NLP models [GitHub, 689 stars]\n* ⭐ [TPOT](https://github.com/EpistasisLab/tpot) - Python Automated Machine Learning tool [GitHub, 8826 stars]\n* ⭐ [Auto-PyTorch](https://github.com/automl/Auto-PyTorch) - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1862 stars]\n* ⭐ [HungaBunga](https://github.com/ypeleg/HungaBunga) - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 674 stars]\n* 🔱 [AutoML Natural Language](https://cloud.google.com/natural-language/automl/docs) - Google's paid AutoML NLP service\n* ⭐ [Optuna](https://github.com/optuna/optuna) - hyperparameter optimization framework [GitHub, 7255 stars]\n* ⭐ [FLAML](https://github.com/microsoft/FLAML) - fast and lightweight AutoML library [GitHub, 2154 stars]\n* ⭐ [Gradsflow](https://github.com/gradsflow/gradsflow) - open-source AutoML \u0026 PyTorch Model Training Library [GitHub, 289 stars]\n\n#### OCR - Optical Character Recognition\n* 🎥️ [A framework for designing document processing solutions](https://ljvmiranda921.github.io/notebook/2022/06/19/document-processing-framework/) [Blog, June 2022]\n\n#### Document AI\n* 📙 [Table Transformer](https://huggingface.co/docs/transformers/main/model_doc/table-transformer) + [HuggingFace Models](https://huggingface.co/models?other=table-transformer)\n\n#### Text Generation\n* ⭐ [keytotext](https://github.com/gagan3012/keytotext) - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 353 stars]\n* 📙 [Controllable Neural Text Generation](https://lilianweng.github.io/lil-log/2021/01/02/controllable-neural-text-generation.html) [Blog, Jan 2021]\n* ⭐ [BARTScore](https://github.com/neulab/BARTScore) Evaluating Generated Text as Text Generation [GitHub, 192 stars]\n\n#### Title / Headlines Generation\n* ⭐ [TitleStylist](https://github.com/jind11/TitleStylist) Learning to Generate Headlines with Controlled Styles [GitHub, 72 stars]\n\n#### NLP research reproducibility\n* 📙 [A Systematic Review of Reproducibility Research in Natural Language Processing](https://arxiv.org/abs/2103.07929) [Paper, March 2021]\n\n## License [CC0](./LICENSE)\n\n## Attributions\n#### Resources\n* All linked resources belong to original authors\n\n#### Icons\n* [Akropolis](https://thenounproject.com/search/?q=ancient%20greek\u0026i=403786) by parkjisun from the [Noun Project](https://thenounproject.com)\n* [Book](https://thenounproject.com/icon/304884/) of Ester by Gilad Sotil from the [Noun Project](https://thenounproject.com)\n* [quill](https://thenounproject.com/term/quill/17013/) by Juan Pablo Bravo from the [Noun Project](https://thenounproject.com)\n* [acting](https://thenounproject.com/term/acting/2369397/) by Flatart from the [Noun Project](https://thenounproject.com)\n* [olympic](https://thenounproject.com/term/olympic/1870751/) by supalerk laipawat from the [Noun Project](https://thenounproject.com)\n* [aristocracy](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3156156) by Eucalyp from the [Noun Project](https://thenounproject.com)\n* [Horn](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3156640) by Eucalyp from the [Noun Project](https://thenounproject.com)\n* [temple](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3156638) by Eucalyp from the [Noun Project](https://thenounproject.com)\n* [constellation](https://thenounproject.com/eucalyp/collection/ancient-greece-glyph/?i=3156142) by Eucalyp from the [Noun Project](https://thenounproject.com)\n* [ancient greek round pattern](https://thenounproject.com/term/ancient-greek-round-pattern/2048889/) by Olena Panasovska from the [Noun Project](https://thenounproject.com)\n* Harp by Vectors Point from the [Noun Project](https://thenounproject.com)\n* [Atlas](https://thenounproject.com/naripuru/collection/ancient-gods/?i=2225785) by parkjisun from the [Noun Project](https://thenounproject.com)\n* [Parthenon](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3158942) by Eucalyp from the [Noun Project](https://thenounproject.com)\n* [papyrus](https://thenounproject.com/iconmark/collection/greek-mythology/?i=3515982) by IconMark from the [Noun Project](https://thenounproject.com)\n* [papyrus](https://thenounproject.com/search/?q=papyrus\u0026i=2239368) by Smalllike from the [Noun Project](https://thenounproject.com)\n* [pegasus](https://thenounproject.com/search/?q=pegasus\u0026i=2266449) by Saeful Muslim from the [Noun Project](https://thenounproject.com)\n\n#### Fonts\n* [Dalek Font](https://www.dafont.com/dalek.font) \n\n-----\n\n\u003ch3 align=\"center\"\u003eThe Pandect Series also includes\u003c/h3\u003e\n\n\u003cp align=\"middle\"\u003e\n\u003ca href=\"https://github.com/ivan-bilan/The-Microservices-Pandect\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/ivan-bilan/The-Engineering-Manager-Pandect/main/Resources/Images/microservices_pandect_promo.png\" width=\"390\" /\u003e\n\u003c/a\u003e\n  \u0026nbsp; \u0026nbsp; \u0026nbsp;\n\u003ca href=\"https://github.com/ivan-bilan/The-Engineering-Manager-Pandect\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/ivan-bilan/The-Engineering-Manager-Pandect/main/Resources/Images/em_pandect_promo.png\" width=\"370\" /\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fivan-bilan%2FThe-NLP-Pandect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fivan-bilan%2FThe-NLP-Pandect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fivan-bilan%2FThe-NLP-Pandect/lists"}