Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/voidful/mctest
MCTest dataset and models.
https://github.com/voidful/mctest
Last synced: 3 months ago
JSON representation
MCTest dataset and models.
- Host: GitHub
- URL: https://github.com/voidful/mctest
- Owner: voidful
- Fork: true (mcobzarenco/mctest)
- Created: 2019-12-19T10:01:02.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2014-11-16T15:46:21.000Z (almost 10 years ago)
- Last Synced: 2024-07-20T22:58:09.969Z (4 months ago)
- Size: 1.32 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-question-answering-dataset - MC Test - Eng
README
MCTest Dataset
========Baseline models as well as more complex ones for doing question answering on the MCTest dataset.
Dependencies:
```
protobuf
numpy
pandas
nltk
```Word embeddings can be used from a model file created by [word2vec](https://github.com/danielfrg/word2vec).
## Running baseline models
First, clone the repo and compile the protobuf:
```
git clone https://github.com/mcobzarenco/mctest.git
cd mctest
protoc --python_out=. mctest.proto
```To parse the raw data (dev + train combined), remove stopwords and save it as a length delimted protobuf flat file:
```
cat data/MCTest/mc160.dev.tsv data/MCTest/mc160.train.tsv | \
./parse.py --rm-stop data/stopwords.txt -o proto > train160-stop.words
```Also create a file with the ground truth for dev + train:
```
cat data/MCTest/mc160.dev.ans data/MCTest/mc160.train.ans > train160.ans
```To run the sliding window with distance baseline:
```
./baseline.py --train train160-stop.words --truth train160.ans --distance[model]
window_size = None
distance = True[results]
All accuracy [400]: 0.5600
Single accuracy [185]: 0.5946
Multiple accuracy [215]: 0.5302
```#### Word embeddings
First, [word2vec](https://github.com/danielfrg/word2vec) should be installed and a model file with embeddings created.
Say the model file is `mctest.vec.bin`, the following command will parse the raw data (dev + train combined), replace the words with their corresponding embedding and save that to disk:
```
cat data/MCTest/mc160.dev.tsv data/MCTest/mc160.train.tsv | \
./parse.py --model-file mctest.vec.bin --rm-punct -o proto > train160-punct-mctest.embed
```
To run the sliding window model over the embeddings:
```
./baseline-embed.py --train train160-punct-mctest.embed --truth train160.ans[model]
window_size = NoneAll accuracy [400]: 0.5775
Single accuracy [185]: 0.6108
Multiple accuracy [215]: 0.5488
```