Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-nlp-resource

awesome nlp resource
https://github.com/luyaojie/awesome-nlp-resource

Last synced: 5 days ago
JSON representation

  • Uncategorized

    • Uncategorized

      • CC-CEDICT
      • CMUdict - source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations.
      • PDEV
      • VerbNet
      • FrameNet
      • PropBank - verb basis.
      • NomBank
      • SemLink
      • Framester - Zero, as well as other resources. Framester does not simply creates a strongly connected knowledge graph, but also applies a rigorous formal treatment for Fillmore's frame semantics, enabling full-fledged OWL querying and reasoning on the created joint frame-based knowledge graph.
      • PTB
      • PDTB2.0
      • PDTB3.0
      • PTB
      • WikiText - 2 is over 2 times larger and WikiText-103 is over 110 times larger.
      • UNCorpus
      • CWMT - EN data collected and shared by China Workshop on Machine Translation (CWMT) community. There are three types of data for Chinese-English machine translation: Monolingual Chinese text, Parallel Chinese-English text, Multiple-Reference text.
      • WMT
      • Wikipedia Person and Animal Dataset
      • AG's corpus of news articles
      • Google-Snippets
      • MPQA 3.0
      • SentiWordNet
      • NRC Word-Emotion Association Lexicon
      • Stanford Sentiment TreeBank
      • SemEval-2013 Twitter - level sentiment annotation.
      • Sentihood - based sentiment analysis, which contains 5215 sentences. *SentiHood: Targeted Aspect Based Sentiment Analysis Dataset for Urban Neighbourhoods, COLING 2016*.
      • Google News Word2vec - dimensional vectors for 3 million words and phrases which trained on part of Google News dataset (about 100 billion words).
      • GloVe Pre-trained - trained word vectors using GloVe. Wikipedia + Gigaword 5, Common Crawl, Twitter.
      • fastText Pre-trained - trained word vectors for 294 languages, trained on Wikipedia using fastText.
      • Dependency-based Word Embedding - trained word embeddings based on **Dependency** information, from *Dependency-Based Word Embeddings, ACL 2014.*.
      • Meta-Embeddings - Embeddings: Higher-quality word embeddings via ensembles of Embedding Sets, ACL 2016.*
      • charNgram2vec - implemented code for pre-training character n-gram embeddings presented in Joint Many-Task (JMT) paper, *A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks, EMNLP2017*.
      • ELMo - trained contextual representations from large scale bidirectional language models provide large improvements for nearly all supervised NLP tasks.
      • TriviaQA - answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. This dataset is from the Wikipedia domain and Web domain.
      • NewsQA - sourced machine reading comprehension dataset of 120K Q&A pairs.
      • HarvestingQA - level QA-pairs dataset (split into Train, Dev and Test set) described in: *Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia* (ACL 2018).
      • ProPara
      • Quora Question Pairs - question-pairs/data)
      • CoNLL-2003 - 2003 concerns language-independent named entity recognition. It concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.
      • Ultra-Fine Entity Typing - form phrases (e.g. skyscraper, songwriter, or criminal) that describe appropriate types for the target entity.
      • Named Entity Recognition on Code-switched Data - switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. It contains the training and development data for tuning and testing systems in the following language pairs: Spanish-English (SPA-ENG), and Modern Standard Arabic-Egyptian (MSA-EGY).
      • TACRED - scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Details in *Position-aware Attention and Supervised Data Improve Slot Filling, EMNLP 2017*.
      • SemEval 2018 Task7
      • TAC-KBP - track in TAC Knowledge Base Population (KBP), which started from 2015. The goal of TAC Knowledge Base Population (KBP) is to develop and evaluate technologies for populating knowledge bases (KBs) from unstructured text.
      • SemEval-2015 Task 4 - Document Event Ordering. Given a set of documents and a target entity, the task is to build an event TimeLine related to that entity, i.e. to detect, anchor in time and order the events involving the target entity.
      • RED - event relations (temporal, causal, subevent and reporting relations) annotations over 95 English newswire, discussion forum and narrative text documents, covering all events, times and non-eventive entities within each document.
      • MEANTIME - document and cross-document event and entity coreference.
      • BioNLP-ST 2013 - ST 2013 features the six event extraction tasks: Genia Event Extraction for NFkB knowledge base construction, Cancer Genetics, Pathway Curation, Corpus Annotation with Gene Regulation Ontology, Gene Regulation Network in Bacteria, and Bacteria Biotopes (semantic annotation by an ontology).
      • RAMS - Sentence Argument Linking. It contains 9,124 annotated events from news based on an ontology of 139 event types and 65 roles.
      • M2E2 - media event extraction, which consists of 245 fully annotated news articles.
      • CaTeRS - prehensive set of temporal and causal relations between events. CaTeRS contains a total of 1,600 sentences in the context of 320 five-sentence short stories sampled from ROCStories corpus.
      • Causal-TimeBank - TimeBank is the TimeBank corpus taken from TempEval-3 task, which puts new information about causality in the form of C-SIGNALs and CLINKs annotation. 6,811 EVENTs (only instantiated events by MAKEINSTANCE tag of TimeML), 5,118 TLINKs (temporal links), 171 CSIGNALs (causal signals), 318 CLINKs (causal links).
      • EventCausalityData
      • TempEval-3 - 3 shared task aims to advance research on temporal information processing.
      • TimeBank
      • TimeBank-EventTime Corpus - darmstadt.de/fileadmin/user_upload/Group_UKP/publikationen/2016/2016_Reimers_Temporal_Anchoring_of_Events.pdf).
      • UW Event Factuality Dataset - 3 corpus with factuality assessment labels.
      • FactBank 1.0
      • UDS
      • DLEF - and sentence-level event factuality.
      • ECB 1.0 - and cross-document event coreference information. The documents are grouped according to the Google News Cluster, each group of documents representing the same seminal event (or topic).
      • EECB 1.0
      • ECB+
      • Narrative Cloze Evaluation Data
      • Event Tensor - based Compositions, AAAI 2018.*
      • NeuralOpenIE
      • SNLI - written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE).
      • MultiNLI - Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers **a range of genres** of spoken and written text, and supports a distinctive cross-genre generalization evaluation.
      • Scitail - choice science exams and web sentences. The domain makes this dataset different in nature from previous datasets, and it consists of more factual sentences rather than scene descriptions.
      • Commonsense Knowledge Representation - related resources. Details in *Commonsense Knowledge Base Completion. Proc. of ACL, 2016*
      • ATOMIC - then relations with variables.
      • SenticNet
      • S2ORC: Semantic Scholar Open Research Corpus - language academic papers spanning many academic disciplines. Rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers.
      • SCIERC
      • QA-SRL - answer pairs to model verbal predicate-argument structure. The questions start with wh-words (Who, What, Where, What, etc.) and contains a verb predicate in the sentence; the answers are phrases in the sentence.
      • CoNLL 2010 Uncertainty Detection
      • COLING 2018 automatic identification of verbal MWE
      • Tencent Automatic Article Commenting - scale Chinese dataset with millions of real comments and a human-annotated subset characterizing the comments’ varying quality. This dataset consists of around 200K news articles and 4.5M human comments along with rich meta data for article categories and user votes of comments.
      • Shimaoka Fine-grained - grained Entity Classification, provided in a preprocessed tokenized format, details in *Neural architectures for fine-grained entity type classification, EACL 2017*.
      • MIT Restaurant Corpus
      • MultiNLI - Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers **a range of genres** of spoken and written text, and supports a distinctive cross-genre generalization evaluation.
      • PDEV
      • InScript - specific events and participants labels.
Sub Categories