https://github.com/madrugado/gia-corpus
Corpus of exam tests for 9-graders in Russia for NLP/ML purposes
https://github.com/madrugado/gia-corpus
corpus natural-language-processing nlp russian-corpus
Last synced: 3 months ago
JSON representation
Corpus of exam tests for 9-graders in Russia for NLP/ML purposes
- Host: GitHub
- URL: https://github.com/madrugado/gia-corpus
- Owner: madrugado
- Created: 2016-08-31T19:28:46.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2018-02-25T10:36:19.000Z (about 7 years ago)
- Last Synced: 2025-01-05T21:42:35.483Z (4 months ago)
- Topics: corpus, natural-language-processing, nlp, russian-corpus
- Size: 353 KB
- Stars: 7
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# gia-corpus
Corpus of exam tests for 9-graders in Russia for NLP/ML purposes## About contents
The two folders with files: "raw" and "processed". If "processed", so this text is alredy cleaned from the all other text, than questions and answers. In "raw" folder there are text contents of PDFs without post-processing.We're only using (mostly) first twenty multiple choice questions, since they have only one right answer.
*UPD:* Added tsv folder with files in format compatible to Kaggle Allen AI Challenge (https://www.kaggle.com/c/the-allen-ai-science-challenge/data)
## Format
Subject tag is:
* GE - geography
* OB - social studies
* IS - historyand year is in YYYY format.
For example, IS_2009.processed.txt
## License
Since the contents of this repo is scraped from PDFs from Russian Ministry of Education website and according to it the content of PDF is in public domain, so the license is *CC BY-NC*.