Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/graykode/toeicbert
TOEIC(Test of English for International Communication) solving using pytorch-pretrained-BERT model.
https://github.com/graykode/toeicbert
ai bert deep-learning lm mask nlp pytorch pytorch-pretrained toeic
Last synced: 3 months ago
JSON representation
TOEIC(Test of English for International Communication) solving using pytorch-pretrained-BERT model.
- Host: GitHub
- URL: https://github.com/graykode/toeicbert
- Owner: graykode
- License: mit
- Created: 2019-04-28T07:12:38.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-06-18T02:01:47.000Z (over 5 years ago)
- Last Synced: 2024-10-23T07:18:36.475Z (3 months ago)
- Topics: ai, bert, deep-learning, lm, mask, nlp, pytorch, pytorch-pretrained, toeic
- Language: Python
- Homepage:
- Size: 270 KB
- Stars: 119
- Watchers: 6
- Forks: 25
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-bert - graykode/toeicbert - pretrained-BERT model., (BERT QA & RC task:)
README
## TOEIC-BERT
### 76% Correct rate with ONLY Pre-Trained BERT model in TOEIC!!
This is project as topic: `TOEIC(Test of English for International Communication) problem solving using pytorch-pretrained-BERT model.` The reason why I used huggingface's [pytorch-pretrained-BERT model]() is for pre-training or to do fine-tune more easily. **I've solved the only blank problem, not the whole problem.** There are two types of blank issues:
1. Selecting Correct Grammar Type.
```
Q) The music teacher had me _ scales several times.
1. play (Answer)
2. to play
3. played
4. playing
```2. Selecting Correct Vocabulary Type.
```
Q) The wet weather _ her from going playing tennis.
1. interrupted
2. obstructed
3. impeded
4. discouraged (Answer)
```#### BERT Testing
1. input
```json
{
"1" : {
"question" : "Business experts predict that the upward trend is _ to continue until the end of next year.",
"answer" : "likely",
"1" : "potential",
"2" : "likely",
"3" : "safety",
"4" : "seemed"
}
}
```2. output
```
=============================
Question : Business experts predict that the upward trend is _ to continue until the end of next year.Real Answer : likely
1) potential 2) likely 3) safety 4) seemed
BERT's Answer => [likely]
```#### Why BERT?
In pretrained BERT, It contains contextual information. So It can find more contextual or grammatical sentences, not clear, a little bit. I was inspired by grammar checker from [blog post]().
> [Can We Use BERT as a Language Model to Assign a Score to a Sentence?]()
>
> BERT uses a bidirectional encoder to encapsulate a sentence from left to right and from right to left. Thus, it learns two representations of each word-one from left to right and one from right to left-and then concatenates them for many downstream tasks.## Evaluation
I had evaluated with only **pretrained BERT model(not fine-tuning)** to check grammatical or lexical error. Above mathematical expression, `X` is a question sentence. and `n` is number of questions : `{a, b, c, d}`. `C` subset means answer candidate tokens : `C` of `warranty` is `['warrant', '##y']`. `V` means total Vocabulary.
There's a problem with more than one token. I solved this problem by getting the average value of each tensor. ex) `is being formed` as `['is', 'being', 'formed']`
Then, we find argmax in `L_n(T_n)`.
```python
predictions = model(question_tensors, segment_tensors)# predictions : [batch_size, sequence_length, vocab_size]
predictions_candidates = predictions[0, masked_index, candidate_ids].mean()
```#### Result of Evaluation.
Fantastic result with **only pretrained BERT model**
- `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
- `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
- `bert-base-cased`: 12-layer, 768-hidden, 12-heads , 110M parameters
- `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parametersTotal 7067 datasets: make non-deterministic with `model.eval()`
| | bert-base-uncased | bert-base-cased | bert-large-uncased | bert-large-cased |
| :---------: | :---------------: | :-------------: | :----------------: | :--------------: |
| Correct Num | 5192 | 5398 | 5321 | 5148 |
| Percent | 73.46% | 76.38% | 75.29% | 72.84% |## Quick Start with Python pip Package.
**Start with pip**
```shell
$ pip install toeicbert
```**Run & Option**
```shell
$ python -m toeicbert --model bert-base-uncased --file test.json
```- `-m, --model` : bert-model name in huggingface's pytorch-pretrained-BERT : `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`.
- `-f, --file` : json file to evalution, see json format, [test.json](test.json).
**key(question, 1, 2, 3, 4) is required options, but answer not.**
`_` in question will be replaced to `[MASK]`
```json
{
"1" : {
"question" : "The music teacher had me _ scales several times.",
"answer" : "play",
"1" : "play",
"2" : "to play",
"3" : "played",
"4" : "playing"
},
"2" : {
"question" : "The music teacher had me _ scales several times.",
"1" : "play",
"2" : "to play",
"3" : "played",
"4" : "playing"
}
}
```## Author
- Tae Hwan Jung(Jeff Jung) @graykode, Kyung Hee Univ CE(Undergraduate).
- Author Email : [[email protected]](mailto:[email protected])Thanks for Hwan Suk Gang(Kyung Hee Univ.) for collecting Dataset(`7114` datasets)