https://github.com/voidful/mctest

MCTest dataset and models.
https://github.com/voidful/mctest

Last synced: 7 months ago
JSON representation

MCTest dataset and models.

Host: GitHub
URL: https://github.com/voidful/mctest
Owner: voidful
Fork: true (mcobzarenco/mctest)
Created: 2019-12-19T10:01:02.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2014-11-16T15:46:21.000Z (over 10 years ago)
Last Synced: 2024-08-05T19:35:32.015Z (11 months ago)
Size: 1.32 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-question-answering-dataset - MC Test - Eng

README

MCTest Dataset
========

Baseline models as well as more complex ones for doing question answering on the MCTest dataset.

Dependencies:
```
protobuf
numpy
pandas
nltk
```

Word embeddings can be used from a model file created by [word2vec](https://github.com/danielfrg/word2vec).

## Running baseline models

First, clone the repo and compile the protobuf:
```
git clone https://github.com/mcobzarenco/mctest.git
cd mctest
protoc --python_out=. mctest.proto
```

To parse the raw data (dev + train combined), remove stopwords and save it as a length delimted protobuf flat file:
```
cat data/MCTest/mc160.dev.tsv data/MCTest/mc160.train.tsv | \
./parse.py --rm-stop data/stopwords.txt -o proto > train160-stop.words
```

Also create a file with the ground truth for dev + train:
```
cat data/MCTest/mc160.dev.ans data/MCTest/mc160.train.ans > train160.ans
```

To run the sliding window with distance baseline:
```
./baseline.py --train train160-stop.words --truth train160.ans --distance

[model]
window_size = None
distance = True

[results]
All accuracy [400]: 0.5600
Single accuracy [185]: 0.5946
Multiple accuracy [215]: 0.5302
```

#### Word embeddings
First, [word2vec](https://github.com/danielfrg/word2vec) should be installed and a model file with embeddings created.
Say the model file is `mctest.vec.bin`, the following command will parse the raw data (dev + train combined), replace the words with their corresponding embedding and save that to disk:
```
cat data/MCTest/mc160.dev.tsv data/MCTest/mc160.train.tsv | \
./parse.py --model-file mctest.vec.bin --rm-punct -o proto > train160-punct-mctest.embed
```
To run the sliding window model over the embeddings:
```
./baseline-embed.py --train train160-punct-mctest.embed --truth train160.ans

[model]
window_size = None

All accuracy [400]: 0.5775
Single accuracy [185]: 0.6108
Multiple accuracy [215]: 0.5488
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/voidful/mctest

Awesome Lists containing this project

README