Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-datasets

A comprehensive list of annotated training datasets classified by use case.
https://github.com/kili-technology/awesome-datasets

Last synced: about 4 hours ago
JSON representation

  • Speech Recognition

  • Document Classification

    • English

      • RVL-CDIP Dataset - CDIP dataset consists of 400,000 grayscale images in 16 classes (letter, memo, email, file folder, form, handwritten, invoice, advertisement, budget, news article, presentation, scientific publication, questionnaire, resume, scientific report, specification), with 25,000 images per class.
      • Top Streamers on Twitch
  • Document Question Answering

    • English

      • Multi-Domain Wizard-of-Oz dataset (MultiWOZ) - oriented annotated corpus.
      • AmbigQA - domain question answering; especially when exploring new topics, it can be difficult to ask questions that have a single, unambiguous answer. AmbigQA is a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question.
      • chatterbot/english
      • Coached Conversational Preference Elicitation
      • ConvAI2 dataset
      • Customer Support on Twitter
      • HotpotQA - skip questions, with a strong emphasis on supporting facts to allow for more explicit question answering systems. The data set consists of 113,000 Wikipedia-based QA pairs.
      • Maluuba goal-oriented dialogue - in particular, finding flights and a hotel. The data set contains complex conversations and decisions covering over 250 hotels, flights and destinations.
      • Natural Questions (NQ), - scale corpus for training and evaluating open-ended question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. NQ is a large corpus, consisting of 300,000 questions of natural origin, as well as human-annotated answers from Wikipedia pages, for use in training in quality assurance systems. In addition, we have included 16,000 examples where the answers (to the same questions) are provided by 5 different annotators, useful for evaluating the performance of the QA systems learned.
      • QuAC - seeking QI dialogues (100K questions in total). Question Answering in Context is a dataset for modeling, understanding, and participating in information-seeking dialogues.
      • RecipeQA - by-step instructions and images. Each RecipeQA question involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) a common understanding of images and text, (ii) capturing the temporal flow of events, and (iii) understanding procedural knowledge.
      • Relational Strategies in Customer Service (RSiCS) Dataset - related customer service data from four sources. Conversation logs from three commercial customer service VIAs and airline forums on TripAdvisor.com during the month of August 2016.
      • Santa Barbara Corpus of Spoken American English
      • TREC QA Collection - domain and closed-domain questions.
      • The WikiQA corpus - domain questions. In order to reflect the true information needs of general users, they used Bing query logs as a source of questions. Each question is linked to a Wikipedia page that potentially contains the answer.
      • Ubuntu Dialogue Corpus - person conversations from Ubuntu discussion logs, used to receive technical support for various Ubuntu-related issues. The dataset contains 930,000 dialogs and over 100,000,000 words.
      • NarrativeQA - and-answer pairs.
      • OpenBookQA - book exams to assess human understanding of a subject. The open book that accompanies our questions is a set of 1329 elementary level scientific facts. Approximately 6,000 questions focus on understanding these facts and applying them to new situations.
      • QASC - and-answer data set that focuses on sentence composition. It consists of 9,980 8-channel multiple-choice questions on elementary school science (8,134 train, 926 dev, 920 test), and is accompanied by a corpus of 17M sentences.
      • SGD (Schema-Guided Dialogue) dataset - domain conversations covering 16 domains.
      • ConvAI2 dataset
      • Customer Support on Twitter
      • Maluuba goal-oriented dialogue - in particular, finding flights and a hotel. The data set contains complex conversations and decisions covering over 250 hotels, flights and destinations.
      • Break
    • Multilingual

  • Key Information Extraction

    • English

      • NIST - and-white images of synthesized documents: 900 simulated tax submissions, 5,590 images of completed structured form faces, 5,590 text files containing entry field answers.
      • The Kleister NDA dataset - disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.
      • The Kleister Charity dataset
      • CORD - level semantic labels for parsing. Labels are the bouding box position and the text of the key informations.
    • Multilingual

      • XFUND - labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
      • GHEGA - sheets of electronic components and 136 patents. Each group is further divided in classes: data-sheets classes share the component type and producer; patents classes share the patent source.
      • XFUND - labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
  • Optical Character Recognition

  • Document Layout Analysis

  • Instant Segmentation

  • Named-Entity Recognition