https://github.com/irfnrdh/Awesome-Indonesia-NLP
Resource NLP & Bahasa
https://github.com/irfnrdh/Awesome-Indonesia-NLP
List: Awesome-Indonesia-NLP
awesome indonesian-language nlp-datasets nlp-resources
Last synced: 3 months ago
JSON representation
Resource NLP & Bahasa
- Host: GitHub
- URL: https://github.com/irfnrdh/Awesome-Indonesia-NLP
- Owner: irfnrdh
- Created: 2019-11-13T09:16:04.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-11-13T11:08:30.000Z (over 5 years ago)
- Last Synced: 2024-05-19T01:00:35.104Z (9 months ago)
- Topics: awesome, indonesian-language, nlp-datasets, nlp-resources
- Homepage:
- Size: 52.7 KB
- Stars: 241
- Watchers: 7
- Forks: 60
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- ultimate-awesome - Awesome-Indonesia-NLP - Resource NLP & Bahasa . (Other Lists / Julia Lists)
README
Awesome Indonesia NLP
====================[](https://github.com/sindresorhus/awesome)
> Resouse kumpulan dataset, thesis, paper, dan artikel tentang NLP (_Natural Language Processing_) Bahasa Indonesia.
> Terinpirasi oleh para pendahulu.### Daftar Isi
- [NLP Bahasa Indonesia](#nlp-bahasa-indonesia)
- [Memulai](#memulai)
- [Buku](#buku)
- [Tutorial](#tutorial)
- [Dataset](#dataset)
- [Automatic Summarization](#automatic-summarizatio)
- [Parsing](#parsing)
- [Part-of-speech Tagging](#part-of-speech-tagging)
- [Stemming](#stemming)
- [Word Sense Disambiguation](#word-sense-disambiguation)
- [Lain-lain](#lain-lain)
- [Software, Library, Kamus](#software-library-kamus)
- [Dataset Berita Bahasa Indonesia](#Dataset-Berita-Indonesia)
- [Berkontribusi](#berkontribusi)Memulai
=======### Materi Pengantar NLP
- [Pengantar NLP](http://lecturer.eepis-its.edu/~kangedi/materi%20kuliah/Kecerdasan%20Buatan/Bab%205%20Natural%20Language%20Processing.pdf) - Kang Edi, PENS.
- [NLTK Book](http://nltk.org/book)
- [Text Mining with R - Julia Silge and David Robinson](https://www.tidytextmining.com/)### Artikel-Artikel Tentang NLP
- [Karena Data Gak Mungkin Bohong](https://medium.com/karena-x/karena-data-gak-mungkin-bohong-a17ff90cef87) - Jim Geovedi. 2014.
- [NLP Trend 2019](https://towardsdatascience.com/major-trends-in-nlp-a-review-of-20-years-of-acl-research-56f5520d473) - Janna, Towards Data Science.### Jurnal
- Indonesian News Classification using Support Vector Machine (https://zenodo.org/record/1074439)
Dataset & Language modeling
=======### Words dataset
1. [Word Sastrawi](https://github.com/sastrawi/sastrawi/tree/master/data)
1. [Word spaCy](https://github.com/explosion/spaCy/tree/master/spacy/lang/id) : id
1. [Word name](https://github.com/dominictarr/random-name) : random-name
1. [Word Indo name](https://github.com/seuriously/genderprediction/blob/master/namatraining.txt) : genderprediction
1. [Word Indo place](https://github.com/edwardsamuel/Wilayah-Administratif-Indonesia) : Wilayah-Administratif-Indonesia
1. [Word Indo place](https://github.com/pentagonal/Indonesia-Postal-Code) : Indonesia-Postal-Code
1. [Word Wiktionary](https://id.wiktionary.org/wiki) : word id
1. [Word sentiment](https://github.com/ramaprakoso/analisis-sentimen/tree/master/kamus) : analisis-sentimen
1. [Word sentiment](https://github.com/prasastoadi/ID-OpinionWords) : ID-OpinionWords
1. [Word sentiment](https://github.com/riochr17/Analisis-Sentimen-ID/tree/master/data) : Analisis-Sentimen-ID
1. [Word Acronims](https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/acronym.txt)
1. [word](https://github.com/prasastoadi/serangkai/tree/master/serangkai/kamus/data) : serangkai### Sentences Dataset
1. [leipzig indonesian sentence collectoin](http://wortschatz.uni-leipzig.de/en/download) news articles, web articles, wikipedia data from 2008-2016
1. [wn-msa.sourceforge.net](https://sourceforge.net/p/wn-msa/tab/HEAD/tree/trunk/) Wordnet Bahasa
1. [Quran](http://tanzil.net/trans/id.indonesian) indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
1. [Kompas online collection](http://ilps.science.uva.nl/ilps/wp-content/uploads/sites/6/files/bahasaindonesia/kompas.zip).
This corpus contains [Kompas online](http://www.kompas.com/) news articles from 2001-2002. See
[here](http://ilps.science.uva.nl/resources/bahasa/) for more info and citations.
1. [Tempo online collection](http://ilps.science.uva.nl/ilps/wp-content/uploads/sites/6/files/bahasaindonesia/tempo.zip).
This corpus contains [Tempo online](https://www.tempo.co/) news articles from 2000-2002. See
[here](http://ilps.science.uva.nl/resources/bahasa/) for more info and citations.
1. [corpus-frog-storytelling](https://github.com/davidmoeljadi/corpus-frog-storytelling) spoken text story telling
1. [TED-Multilingual-Parallel-Corpus](https://raw.githubusercontent.com/ajinkyakulkarni14/TED-Multilingual-Parallel-Corpus/master/Monolingual_data/Indonesian.txt) Monolingual_data/Indonesian
1. [Opus](http://opus.nlpl.eu/) Opus NLPL
1. [Sealang](http://sealang.net/indonesia/) Sealang dataset
1. [Indonesian News Corpus] (https://data.mendeley.com/datasets/2zpbjs22k3/1)
1. [INDONESIAN HOAX NEWS DETECTION DATASET] (https://data.mendeley.com/datasets/p3hfgr5j3m/1)
1. [Warta Berita Online Kompas dan Tempo] (https://ilps.science.uva.nl/resources/bahasa/)
1. [Raw dataset of Indonesian news articles] (https://github.com/feryandi/Dataset-Artikel)
1. [Amazon Reviews](https://snap.stanford.edu/data/web-Amazon.html)
1. [ArXiv](http://arxiv.org/help/bulk_data_s3)
1. [BimaNLP](https://github.com/drr3d/BimaNLP/tree/old_ver/dataset)### Tagged dataset
1. [NER](https://github.com/yohanesgultom/nlp-experiments) : yohanesgultom/nlp-experiments 1700 sentences
1. [NER](https://github.com/yusufsyaifudin/indonesia-ner) : yusufsyaifudin/indonesia-ner 1835 sentences
1. [POS-TAG](https://github.com/famrashel/idn-tagged-corpus) : famrashel/idn-tagged-corpus
1. [POS-TAG](https://github.com/pebbie/pebahasa/blob/master/resource/Corpus.crp) : pebbie/pebahasa ~600 sentence
1. [POS-TAG Parser](https://github.com/UniversalDependencies/UD_Indonesian-GSD) : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
1. [Sentimen](https://github.com/riochr17/Analisis-Sentimen-ID/blob/master/data/training_all_random.csv) 1506 sentences
1. [panl10n](http://www.panl10n.net/english/OutputsIndonesia2.htm) Pan Localization## Language modeling
## POS tagging
1. [PANL10N POS tagging](http://www.panl10n.net/english/outputs/Indonesia/UI/0802/UI-1M-tagged.zip).
This corpus has ~39K sentences and ~900K word tokens.
1. [IDN tagged corpus](https://github.com/famrashel/idn-tagged-corpus). This corpus contains
~10K sentences and ~250K word tokens. The POS tags are annotated manually.## Syntactic parsing
1. [Indonesian Treebank](https://github.com/famrashel/idn-treebank). This corpus contains ~1K parsed
sentences. (constituency parsing)
1. [UD Indonesian](https://github.com/UniversalDependencies/UD_Indonesian-GSD). This corpus is
provided by [Universal Dependencies](http://universaldependencies.org/). Training, development,
and testing split is already provided. (dependency parsing)## Machine translation
1. [PANL10N EN-ID news parallel corpus](http://www.panl10n.net/english/outputs/Indonesia/BPPT/0902/BPPTIndToEngCorpusHalfM.zip).
This corpus has sentences from news articles from several categories: economy (6K sentences),
international (6K sentences), science (6K sentences), and sport (4K sentences).
1. [PANL10N Indonesian translation of Penn treebank](http://www.panl10n.net/english/outputs/Indonesia/UI/0802/Parallel%20Corpus.zip).
This corpus contains Indonesian translation of the Penn treebank. In total there are ~24K
sentences.## Speech recognition
1. [TITML-IDN speech corpus](http://research.nii.ac.jp/src/en/TITML-IDN.html).
The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances.
The utterances are phonetically balanced.
The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution.
The procedure is listed [here](http://research.nii.ac.jp/src/en/register.html)1. [frankydotid/Indonesian-Speech-Recognition](https://github.com/frankydotid/Indonesian-Speech-Recognition).
A small corpus of 50 utterances by a single male speaker.Automatic Summarization
=======================- [Frequent Term based Text Summarization for Bahasa Indonesia](http://eprints.unsri.ac.id/3296/1/Frequent_Term_Based_Text_Summarization_For_Bahasa_Indonesia_-_E1213550.pdf)
M.Fachrurrozi, Novi Yusliani, and Rizky Utami Yoanita. International Conference on Innovations in Engineering and Technology (ICIET'2013) Dec. 25-26, 2013 Bangkok (Thailand).
Parsing
=======- [Analisa Struktur Kalimat Bahasa Indonesia dengan Menggunakan Pengurai Kalimat Berbasis Linguistic String Analysis](http://staf.cs.ui.ac.id/WebKuliah/IKI40921/Shelly.doc)
Shavitri, Shelly. Undergraduate Theses for computer science, University of Indonesia, 1999.
- [INAGP : Pengurai Kalimat Bahasa Indonesia Sebagai Alat Bantu Untuk Pengembangan Aplikasi PBA](http://mail.informatika.org/~ayu/2009parser.pdf)
Rosalina Paramita N., Dwi H. Widyantoro, Ayu Purwarianti. Undergraduate Theses from JBPTITBPP, Institute Technology Bandung, 2007.
- [Penguraian Bahasa Indonesia dengan Menggunakan Pengurai Collins](http://digilib.itb.ac.id/gdl.php?mod=browse&op=read&id=jbptitbpp-gdl-rosaariani-25714)
Sukamto, Rosa Ariani. Tesis untuk Magister, Institut Technology Bandung, 2009.
Part-of-speech Tagging
======================- [HMM Based Part-of-Speech Tagger for Bahasa Indonesia](http://mail.informatika.org/~ayu/2010postagger.pdf)
Wicaksono, A. Farizki dan Purwanti, Ayu. Proceeding of 4th International Malindo (Malay and Indonesian Language) Workshop (2010).
- [Penggunaan Hidden Markov Model untuk Kompresi Kalimat](http://digilib.itb.ac.id/files/disk1/627/jbptitbpp-gdl-yudiwibiso-31314-1-2008ts-r.pdf)
Yudi Wibisono. Graduate Thesis. Institute of Technology Bandung. 2008.
- [Probabilistic Part Of Speech Tagging for Bahasa Indonesia](http://www.panl10n.net/english/outputs/Indonesia/UI/0901/UI-POSTAG.pdf)Femphy Pisceldo, Mirna Adriani, Ruli Manurung. Third International MALINDO Workshop, colocated event ACL-IJCNLP 2009, Singapore, August 1, 2009.
Stemming
========- [Effective Techniques for Indonesian Text Retrieval](http://researchbank.rmit.edu.au/eserv/rmit:6312/Asian.pdf)
Asian J. (2007). PhD thesis School of Computer Science and Information Technology RMIT University Australia.
- [Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document in Indonesian Language](http://personal.its.ac.id/files/pub/2623-agusza-baru%2021%20d%20VIP%20enhanced-confix-stripping-stem.pdf)
Arifin, A.Z., I.P.A.K. Mahendra dan H.T. Ciptaningtyas. 2009. Proceeding of International Conference on Information & Communication Technology and Systems (ICTS).
- [Implementasi Modifikasi Enhanced Confix Stripping Stemmer Untuk Bahasa Indonesia dengan Metode Corpus Based Stemming](http://digilib.its.ac.id/public/ITS-Undergraduate-14255-paperpdf.pdf)
A. D. Tahitoe, D. Purwitasari. Institut Teknologi Sepuluh Nopember (ITS) – Surabaya.
Word Sense Disambiguation
=========================- [Building an Indonesian WordNet](http://bahasa.cs.ui.ac.id/pub/malindo08wordnet.pdf)
Desmond Darma Putra, Abdul Arfan and Ruli Manurung. In Proceedings of the 2nd International MALINDO Workshop. 2008.
- [English-to-Indonesian Lexical Mapping using Latent Semantic Analysis](http://bahasa.cs.ui.ac.id/pub/malindo08lsa.pdf)
Eliza Margaretha, Franky, and Ruli Manurung. In Proceedings of the 2nd International MALINDO Workshop. 2008.
Lain-lain
=========- [A survey of bahasa Indonesia NLP research conducted at the University of Indonesia](http://staf.cs.ui.ac.id/~maruli/pub/malindo08nlp.pdf)
Mirna Adriani and Ruli Manurung. Faculty of Computer Science, University of Indonesia.
- [Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus](http://ufal.mff.cuni.cz/~larasati/papers/paper6.pdf)
Septina Dian Larasati, Vladislav Kuboˇn, and Daniel Zeman. Charles University in Prague.
- [Research Report on Local Language Computing: Development of Indonesian Language Resources and Translation System](http://www.panl10n.net/english/outputs/Indonesia/FinalReportID.pdf)
Adriani, Mirna. Riza, Hammam. 2008.
- [Towards a Semantic Analysis of Bahasa Indonesia for Question Answering](http://bahasa.cs.ui.ac.id/pub/pacling07.pdf)
Septina Dian Larasati and Ruli Manurung. Faculty of Computer Science. University of Indonesia. 2007.
Software, Library, Kamus
========================- [Kateglo](http://kateglo.com/) - Kamus, Tesaurus, dan Glosarium Bahasa Indonesia.
- [Sastrawi](https://github.com/sastrawi/sastrawi) - Stemmer PHP untuk Bahasa Indonesia.# Word reference (kemdikbud) [link](https://kbbi.kemdikbud.go.id/Beranda/Statistik)
1. Entri Dasar : 48.748 (44,64 %)
1. Kata Turunan : 26.312 (24,09 %)
1. Gabungan Kata : 30.625 (28,04 %)
1. Peribahasa : 2.040 (1,87 %)
1. Kiasan : 268 (0,25 %)
1. Ungkapan : 1.129 (1,03 %)
1. Varian : 91 (0,08 %)
1. Entri Total : 109.213 (100,00 %)
1. Makna Total : 127.775
1. Contoh Total : 29.495
1. Kategori Total : 255
1. Makna Per Entri : 1,170
1. Contoh Per Makna : 0,231# Parallel corpus Eng-Ind
1. [parallel-corpora-en-id](https://github.com/prasastoadi/parallel-corpora-en-id/)
1. [Indonesian-English-Bilingual-Corpus](https://github.com/desmond86/Indonesian-English-Bilingual-Corpus)
1. [TALPCo](https://github.com/matbahasa/TALPCo)
1. [opus](http://opus.nlpl.eu/)
1. [Multi-Wiki](https://github.com/nguyenlab/Multi-Wiki)# Morph
1. [MALINDO_Morph](https://github.com/matbahasa/MALINDO_Morph)
1. [morphind](http://septinalarasati.com/morphind/)
1. [INDRA](https://github.com/davidmoeljadi/INDRA)# Crawler Data
1. [Crawler](https://github.com/harryandriyan/warta-scrap) Indonesian news portal## Sentiment Analysis
1. [Aspect and Opinion Terms Extraction for Hotel Reviews](https://github.com/jordhy97/final_project).
The corpus consists of 5000 hotel reviews from [Airy](https://www.airyrooms.com/) (78K tokens) with 5 labels. The paper is available on [arXiv](https://arxiv.org/abs/1908.04899).
1. [Aspect-Based Sentiment Analysis](https://github.com/annisanurulazhar/absa-playground).
A text classification resource for multi-label aspect categorization.## Syntactic parsing
1. [Indonesian Treebank](https://github.com/famrashel/idn-treebank). This corpus contains 1K parsed
sentences. (constituency parsing)
1. [UD Indonesian](https://github.com/UniversalDependencies/UD_Indonesian-GSD). This corpus is
provided by [Universal Dependencies](http://universaldependencies.org/). Training, development,
and testing split are already provided. (dependency parsing)## Machine translation
1. [PANL10N EN-ID news parallel corpus](http://www.panl10n.net/english/outputs/Indonesia/BPPT/0902/BPPTIndToEngCorpusHalfM.zip).
This corpus has sentences from news articles from several categories: economy (6K sentences),
international (6K sentences), science (6K sentences), and sport (4K sentences).
1. [PANL10N Indonesian translation of Penn treebank](http://www.panl10n.net/english/outputs/Indonesia/UI/0802/Parallel%20Corpus.zip).
This corpus contains Indonesian translation of the Penn treebank. In total there are 24K
sentences.## Word Normalization
1. [Colloquial Indonesian Lexicon](https://github.com/nasalsabila/kamus-alay).
This lexicon consists of 3592 unique colloquial tokens that are mapped onto 1742 unique lemmas. The full description of this lexicon can be seen in the [paper](https://ieeexplore.ieee.org/abstract/document/8629151).## Text Summarization
1. [IndoSum](https://github.com/kata-ai/indosum).
A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources.
It has both abstractive summaries and extractive labels.## Text Classification
1. [SMS Spam](http://nlp.yuliadi.pro/static/dataset_sms_spam_bhs_indonesia.zip).
This corpus contains 1143 sentences that have been labeled with normal message, fraud, promotion. It is provided by http://nlp.yuliadi.pro/dataset
1. [Hate Speech Detection](https://github.com/ialfina/id-hatespeech-detection).
This dataset consists of 713 tweets in the Indonesian language with 453 non hate speech and 260 hate speech tweets.
1. [Abusive Language Detection](https://github.com/okkyibrohim/id-abusive-language-detection).
A collection of tweets for abusive language detection in Indonesian social media. It consists of two types of labelling, abusive/not abusive and not abusive/abusive but not offensive/offensive. It also has its own colloquial Indonesian lexicon.
## Speech recognition1. [TITML-IDN speech corpus](http://research.nii.ac.jp/src/en/TITML-IDN.html).
The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances.
The utterances are phonetically balanced.
The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution.
The procedure is listed [here](http://research.nii.ac.jp/src/en/register.html).
1. [Indonesian Speech Recognition](https://github.com/frankydotid/Indonesian-Speech-Recognition).
A small corpus of 50 utterances by a single male speaker. Disclaimer: This is a school project, do not use it for any important tasks. The author is not responsible for the undesired results of using the data provided here.
1. [CMU Wilderness Multilingual Speech Dataset](https://github.com/festvox/datasets-CMU_Wilderness).
A dataset of over 700 different languages providing audio, aligned texts, and word pronunciations.
One of the languages is Indonesian. The utterances are read from the bible, which is recorded by [bible.is](bible.is).
### Free Books
### Courses
1. [Natural Language Processing - Coursera](https://www.coursera.org/learn/language-processing)
2. [Nautral Language Processing - Edx](https://www.edx.org/course/natural-language-processing-nlp)
3. [Oxford CS Deep NLP](https://github.com/oxford-cs-deepnlp-2017)### Videos and Lectures
1. [2016 CS224D Deep Learning For Natural Language Processing Lecture Videos](https://www.youtube.com/playlist?list=PLmImxx8Char9Ig0ZHSyTqGsdhb9weEGam)
2. [Natural Language Processing](https://www.youtube.com/watch?v=mieV29RVpuQ&list=PL0ap34RKaADMjqjdSkWolD-W2VSCyRUQC)### Papers
1. [Breaking Sticks and Ambiguities with Adaptive Skip-gram](http://arxiv.org/abs/1502.07257)
2. [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
3. [Learning the Dimensionality of Word Embeddings](http://arxiv.org/abs/1511.05392)
4. [Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols](https://papers.nips.cc/paper/6810-emergence-of-language-with-multi-agent-games-learning-to-communicate-with-sequences-of-symbols.pdf)
5. [Skip Thought Vectors](http://arxiv.org/abs/1506.06726)### Tutorials
1. [Natural Language Processing](http://aiplaybook.a16z.com/docs/guides/nlp)
2. [Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)
3. [Multi-Class Classification Tutorial with the Keras Deep Learning Library](https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/)
4. [Topic Modeling with Scikit Learn](https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730)
5. [Data Science with Python & R: Sentiment Classification Using Linear Methods](https://www.codementor.io/jadianes/data-science-python-r-sentiment-classification-machine-learning-du107otfg)### Sample Code
1. [Sentiment](https://github.com/vivekn/sentiment)
2. [Prediksi Gender Nama](https://github.com/vickydasta/prediksi-gender-nama)
3. [Topic Modeling](https://github.com/piskvorky/topic_modeling_tutorial)
4. [POS Tagging NLTK (Bahasa Indonesia)](https://github.com/mrrizal/POS_Tag_Indonesian)
5. [Naive Bayes Document Classifier (Bahasa Indonesia)](https://github.com/mrrizal/Document_Classifier)### Datasets
### Libraries
1. [NLTK](http://www.nltk.org/)
2. [Gensim](https://github.com/RaRe-Technologies/gensim)
3. [TextBlob](https://github.com/sloria/textblob)
4. [Spacy](https://github.com/explosion/spaCy)
5. [Sastrawi](https://github.com/sastrawi/sastrawi)
6. [Nalapa](https://github.com/anpandu/nalapa)
7. [Polyglot](https://github.com/aboSamoor/polyglot)## Contributing
Jika ingin berkontribusi dalam github ini, sangat disarankan untuk `Pull Request` namun dengan resource berbahasa indonesia.## Frequently Ask Question (FAQ)
FAQ menjawab pertanyaan pertanyaan umum terkait repository ini mulai dari _naming convention_, pertanyaan dasar hingga pertanyaan lanjut.# Awesome NLP Papers
This is a collection/reading-list of awesome Natural Language Processing papers sorted by date.
### 2018
- [X] **Unsupervised Machine Translation Using Monolingual Corpora Only**, Lample et al.
[`Paper`](https://arxiv.org/abs/1711.00043)- [X] **On the Dimensionality of Word Embeddings**, Yin et al.
[`Paper`](https://papers.nips.cc/paper/7368-on-the-dimensionality-of-word-embedding)- [X] **An efficient framework for learning sentence representations**, Logeswaran et al.
[`Paper`](https://arxiv.org/abs/1803.02893)- [X] **Refining Pretrained Word Embeddings Using Layer-wise Relevance Propagation**, Akira Utsumi
[`Paper`](http://aclweb.org/anthology/D18-1520)- [X] **Domain Adapted Word Embeddings for Improved Sentiment Classification**, Sarma et al.
[`Paper`](https://arxiv.org/abs/1805.04576)- [X] **In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition**, Sheikhshab et al.
[`Paper`](http://www.aclweb.org/anthology/W18-5618)- [X] **Generalizing Word Embeddings using Bag of Subwords**, Zhao et al.
[`Paper`](https://arxiv.org/abs/1809.04259)- [X] **What's in Your Embedding, And How It Predicts Task Performance**, Rogers et al.
[`Paper`](http://www.aclweb.org/anthology/C18-1228)- [X] **On Learning Better Word Embeddings from Chinese Clinical Records: Study on Combining In-Domain and Out-Domain Data** Wang et al.
[`Paper`](http://www.aclweb.org/anthology/W18-2323)- [X] **Predicting and interpreting embeddings for out of vocabulary words in downstream tasks**, Garneau et al.
[`Paper`](http://www.aclweb.org/anthology/W18-5439)- [X] **Addressing Low-Resource Scenarios with Character-aware Embeddings**, Papay et al.
[`Paper`](http://www.aclweb.org/anthology/W18-1204)- [X] **Domain Adaptation for Disease Phrase Matching with Adversarial Networks**, Liu et al.
[`Paper`](http://www.aclweb.org/anthology/W18-2315)- [X] **Investigating Effective Parameters for Fine-tuning of Word Embeddings Using Only a Small Corpus**, Komiya et al.
[`Paper`](http://www.aclweb.org/anthology/W18-3408)- [X] **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding**, Devlin et al.
[`Paper`](https://arxiv.org/abs/1810.04805)- [X] **Adapting Word Embeddings from Multiple Domains to Symptom Recognition from Psychiatric Notes**, Zhang et al.
[`Paper`](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961810/)- [ ] **Evaluation of sentence embeddings in downstream and linguistic probing tasks**, Perone et al.
[`Paper`](https://arxiv.org/abs/1806.06259)- [ ] **Universal Sentence Encoder**, Cer et al.
[`Paper`](https://arxiv.org/abs/1803.11175)- [X] **Deep Contextualized Word Representations**, Peters et al.
[`Paper`](https://arxiv.org/abs/1802.05365)- [X] **Learned in Translation: Contextualized Word Vectors**, McCann et al.
[`Paper`](https://arxiv.org/abs/1708.00107)- [X] **Concatenated p-mean Word Embeddings as Universal Cross-Lingual Sentence Representations**, Rücklé et al.
[`paper`](https://arxiv.org/abs/1803.01400)- [X] **A Compressed Sensing View of Unsupervised Text Embeddings, Bag-Of-n-Grams, and LSTMs**, Arora et al.
[`Paper`](https://openreview.net/pdf?id=B1e5ef-C-)### 2017:
- [X] **Attention Is All You Need**, Vaswani et al.
[`Paper`](http://papers.nips.cc/paper/7181-attention-is-all-you-need)- [X] **Skip-Gram – Zipf + Uniform = Vector Additivity**, Gittens et al.
[`Paper`](http://www.aclweb.org/anthology/P17-1007)- [X] **A Simple but Tough-to-beat Baseline for Sentence Embeddings**, Arora et al.
[`Paper`](https://openreview.net/pdf?id=SyK00v5xx)- [X] **Fast and Accurate Entity Recognition with Iterated Dilated Convolutions**, Strubell et al.
[`Paper`](https://arxiv.org/abs/1702.02098)- [X] **Advances in Pre-Training Distributed Word Representations**, Mikolov et al.
[`Paper`](https://arxiv.org/abs/1712.09405)- [X] **Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets**, Dror et al.
[`Paper`](https://arxiv.org/abs/1709.09500)### 2016:
- [X] **Towards Universal Paraphrastic Sentence Embeddings**, Wieting et al.
[`Paper`](https://arxiv.org/abs/1511.08198)- [X] **Bag of Tricks for Efficient Text Classification**, Joulin et al.
[`Paper`](https://arxiv.org/abs/1607.01759)- [X] **Enriching Word Vectors with Subword Information**, Bojanowski et al.
[`Paper`](https://arxiv.org/abs/1607.04606)- [X] **Assessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP**, Kirk Roberts
[`Paper`](http://www.aclweb.org/anthology/W16-4208)- [X] **How to Train Good Word Embeddings for Biomedical NLP**, Chiu et al.
[`Paper`](http://www.aclweb.org/anthology/W16-2922)- [X] **Log-Linear Models, MEMMs, and CRFs**, Michael Collins
[`Paper`](http://www.cs.columbia.edu/~mcollins/crf.pdf)- [X] **Counter-fitting Word Vectors to Linguistic Constraints**, Mrkšić et al.
[`Paper`](https://arxiv.org/abs/1603.00892)- [X] **Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation**, Wu et al.
[`Paper`](https://arxiv.org/abs/1609.08144)### 2015:
- [ ] **Semi-supervised Sequence Learning**, Dai et al.
[`Paper`](https://arxiv.org/abs/1511.01432)- [X] **Evaluating distributed word representations for capturing semantics of biomedical concepts**, Th et al.
[`Paper`](http://www.aclweb.org/anthology/W15-3820)### 2014:
- [X] **GloVe: Global Vectors for Word Representation**, Pennington et al.
[`Paper`](https://www.aclweb.org/anthology/D14-1162)- [X] **Linguistic Regularities in Sparse and Explicit Word Representations**, Levy and Goldberg.
[`Paper`](https://www.cs.bgu.ac.il/~yoavg/publications/conll2014analogies.pdf)- [X] **Neural Word Embedding as Implicit Matrix Factorization**, Levy and Goldberg.
[`Paper`](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf)- [X] **word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method**, Goldberg and Levy.
[`Paper`](https://arxiv.org/abs/1402.3722)- [X] **What’s in a p-value in NLP?**, Søgaard et al.
[`Paper`](http://www.aclweb.org/anthology/W14-1601)- [X] **How transferable are features in deep neural networks?**, Yosinski et al.
[`Paper`](http://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-n%E2%80%A6)- [X] **Improving lexical embeddings with semantic knowledge**, Yu et al.
[`Paper`](http://www.aclweb.org/anthology/P14-2089)- [X] **Retrofitting word vectors to semantic lexicons**, Faruqui et al.
[`Paper`](https://arxiv.org/abs/1411.4166)### 2013:
- [X] **Efficient Estimation of Word Representations in Vector Space**, Mikolov et al.
[`Paper`](https://arxiv.org/pdf/1301.3781.pdf)- [X] **Linguistic Regularities in Continuous Space Word Representations**, Mikolov et al.
[`Paper`](https://www.aclweb.org/anthology/N13-1090)- [X] **Distributed Representations of Words and Phrases and their Compositionality**, Mikolov et al.
[`Paper`](https://arxiv.org/abs/1310.4546)### 2012:
- [X] **An Empirical Investigation of Statistical Significance in NLP**, Berg-Kirkpatrick et al.
[`Paper`](https://dl.acm.org/citation.cfm?id=2391058)### 2010:
- [X] **Word representations: A simple and general method for semi-supervised learning**, Turian et al.
[`Paper`](https://dl.acm.org/citation.cfm?id=1858721)### 2008:
- [ ] **A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning**, Collobert and Weston.
[`Paper`](https://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf)### 2006:
- [X] **Domain adaptation with structural correspondence learning**, Blitzer et al.
[`Paper`](https://dl.acm.org/citation.cfm?id=1610094)### 2003:
- [X] **A Neural Probabilistic Language Model**, Bengio et al.
[`Paper`](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)### 1986:
- [ ] **Distributed Representations**, Hinton et al.
[`Paper`](https://web.stanford.edu/~jlmcc/papers/PDP/Chapter3.pdf)