Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stanford-oval/schema2qa

Schema2QA Question Answering Dataset
https://github.com/stanford-oval/schema2qa

dataset natural-language-processing nlp semantic-parsing voice-assistant

Last synced: 14 days ago
JSON representation

Schema2QA Question Answering Dataset

Awesome Lists containing this project

README

        

# Stanford Schema2QA Dataset

## Overview
Schema2QA is the first large question answering dataset over real-world Schema.org data.
It covers 6 common domains: restaurants, hotels, people, movies, books, and music,
based on crawled Schema.org metadata from 6 different websites (Yelp, Hyatt, LinkedIn, IMDb, Goodreads, and last.fm.).
In total, there are __over 2,000,000 examples for training__, consisting of both augmented
human paraphrase data and high-quality synthetic data generated by
[Genie](https://github.com/stanford-oval/genie-toolkit).
All questions are annotated with executable virtual assistant programming language
[ThingTalk](https://wiki.almond.stanford.edu/en/thingtalk).

Schema2QA includes challenging evaluation questions collected from crowd workers.
Workers are prompted with only what the domain is and what properties are supported.
Thus, the sentences are natural and diverse. They also contain entities unseen during training.
The collected sentences are manually annotated with ThingTalk by the authors.
In total there are __over 5,000 examples for dev and test__.

An example of an evaluation question and its ThingTalk annotation is shown below:

_"What are the highest ranked burger joints in the 40 mile area around Asheville NC?"_
```js
sort(aggregateRating.ratingValue desc of @org.schema.Restaurant.Restaurant()
filter distance(geo, new Location("asheville nc" )) <= 40 mi &&
servesCuisine =~ "burger")[1] ;
```

## What's new in 2.0
The main difference is that all the examples in the dataset has been reannotated
with [ThingTalk 2.0](https://github.com/stanford-oval/thingtalk/tree/v2.0.0).
This is a major redesign of the language to make it more accessible,
less verbose, and more compatible with pre-trained neural network.
More details about the changes can be found in the [release history](https://github.com/stanford-oval/thingtalk/blob/master/HISTORY.md).
The synthetic data is regenerated with latest [Genie v0.8.0](https://github.com/stanford-oval/genie-toolkit/tree/v0.8.0),
with improvement over both quality and efficiency.
There are also minor annotation fixes, duplicated examples removed in the evaluation set.
So the size of evaluation set is actually slightly smaller for some domains,
but the diversity remains the same.

You can still find information about Schema2QA 1.0 [here](./doc/1.0.md).
However, we do not recommend using Schema2QA 1.0 any more as it contains outdated ThingTalk
annotation.

## Leaderboard
All numbers are evaluated on the Schema2QA test set which is not included in this repository.
Please contact us at [email protected] to evaluate your model(s) on the test data.
Accuracy on dev set can be found [here](doc/dev.md).
Note that the accuracy is now different from what we reported in our papers as the dataset has changed.
#### Schema2QA
Trained with the full Schema2QA training data, including synthetic data using manual natural language
annotations of the Schema.org properties, and human paraphrase data. Both are augmented with crawled
real property values.

Rank | Model | Restaurants | People | Movies | Books | Music | Hotels | Average |
---- | --------------------------------------------------------------------------------| ----------- | ------ | ------ | ----- | ----- | ------ | ------- |
1 | [BART](https://arxiv.org/pdf/2009.07968.pdf)
Stanford | 73.3% | 80.0% | 81.7% | 72.5% | 70.3% | 69.5% | 74.5% |
2 | [BERT-LSTM](https://dl.acm.org/doi/abs/10.1145/3340531.3411974)
Stanford | 64.3% | 73.8% | 66.8% | 46.7% | 58.0% | 55.9% | 60.9% |

#### AutoQA
Trained with dataset fully synthesized with [AutoQA](https://almond-static.stanford.edu/papers/autoqa-emnlp2020.pdf),
using automatically generated natural language annotations and a neural paraphraser.

Rank | Model | Restaurants | People | Movies | Books | Music | Hotels | Average |
---- | --------------------------------------------------------------------------------| ----------- | ------ | ------ | ----- | ----- | ------ | ------- |
1 | [BART](https://arxiv.org/pdf/2009.07968.pdf)
Stanford | 77.3% | 76.2% | 83.4% | 65.1% | 62.9% | 72.2% | 72.9% |
2 | [BERT-LSTM](https://dl.acm.org/doi/abs/10.1145/3340531.3411974)
Stanford | 62.6% | 58.4% | 60.4% | 44.0% | 50.3% | 60.4% | 56.0% |

## Download links
Validation data can be found under directories of each domain in this git repository.
The training sets can be downloaded from the following links:
- [Schema2QA training set (29.8 MB)](https://almond-static.stanford.edu/research/schema2qa2.0/schema2qa.tar.xz)
- [AutoQA training set (31.3 MB)](https://almond-static.stanford.edu/research/schema2qa2.0/autoqa.tar.xz)

Detailed statistics of the dataset can be found in the [stats page](doc/stats.md).

## Getting started
This repository also contains the Makefile to run the full data synthesis, training,
and evaluation of Schema2QA dataset.
Detailed instructions can be found in [installation](./doc/install.md) and [run](./doc/run.md) instructions.

## License
The dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
Please cite the following papers if use this dataset in your work:
```bib
% Schema2QA & BERT-LSTM model
@inproceedings{xu2020schema2qa,
title={Schema2QA: High-Quality and Low-Cost Q\&A Agents for the Structured Web},
author={Xu, Silei and Campagna, Giovanni and Li, Jian and Lam, Monica S},
booktitle={Proceedings of the 29th ACM International Conference on Information \& Knowledge Management},
pages={1685--1694},
year={2020}
}

% AutoQA
@inproceedings{xu2020autoqa,
title={AutoQA: From Databases to Q\&A Semantic Parsers with Only Synthetic Training Data},
author={Xu, Silei and Semnani, Sina and Campagna, Giovanni and Lam, Monica},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages={422--434},
year={2020}
}

% BART parser
@inproceedings{campagna-etal-2022-shot,
title={A Few-Shot Semantic Parser for {W}izard-of-{O}z Dialogues with the Precise {T}hing{T}alk Representation},
author={Campagna, Giovanni and Semnani, Sina and Kearns, Ryan and Koba Sato, Lucas Jun and Xu, Silei and Lam, Monica},
booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
pages={4021--4034},
month={may},
year={2022},
address={Dublin, Ireland},
publisher={Association for Computational Linguistics},
url={https://aclanthology.org/2022.findings-acl.317},
doi={10.18653/v1/2022.findings-acl.317},
}
```