{"id":13535101,"url":"https://github.com/graykode/toeicbert","last_synced_at":"2025-10-25T17:33:15.010Z","repository":{"id":57476075,"uuid":"183870801","full_name":"graykode/toeicbert","owner":"graykode","description":"TOEIC(Test of English for International Communication) solving using pytorch-pretrained-BERT model.","archived":false,"fork":false,"pushed_at":"2019-06-18T02:01:47.000Z","size":276,"stargazers_count":121,"open_issues_count":4,"forks_count":25,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-27T20:23:07.547Z","etag":null,"topics":["ai","bert","deep-learning","lm","mask","nlp","pytorch","pytorch-pretrained","toeic"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/graykode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-28T07:12:38.000Z","updated_at":"2025-02-14T12:48:31.000Z","dependencies_parsed_at":"2022-09-07T17:13:05.548Z","dependency_job_id":null,"html_url":"https://github.com/graykode/toeicbert","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graykode%2Ftoeicbert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graykode%2Ftoeicbert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graykode%2Ftoeicbert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/graykode%2Ftoeicbert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/graykode","download_url":"https://codeload.github.com/graykode/toeicbert/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248525141,"owners_count":21118620,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","bert","deep-learning","lm","mask","nlp","pytorch","pytorch-pretrained","toeic"],"created_at":"2024-08-01T08:00:49.722Z","updated_at":"2025-10-25T17:33:14.949Z","avatar_url":"https://github.com/graykode.png","language":"Python","readme":"## TOEIC-BERT\n\n### 76% Correct rate with ONLY Pre-Trained BERT model in TOEIC!!\n\n\n\nThis is project as topic: `TOEIC(Test of English for International Communication) problem solving using pytorch-pretrained-BERT model.` The reason why I used huggingface's [pytorch-pretrained-BERT model](\u003chttps://github.com/huggingface/pytorch-pretrained-BERT\u003e) is for pre-training or to do fine-tune more easily.  **I've solved the only blank problem, not the whole problem.** There are two types of blank issues:\n\n1. Selecting Correct Grammar Type.\n\n```\nQ) The music teacher had me _ scales several times.\n  1. play (Answer)\n  2. to play\n  3. played\n  4. playing\n```\n\n2. Selecting Correct Vocabulary Type.\n\n```\nQ) The wet weather _ her from going playing tennis.\n  1. interrupted\n  2. obstructed\n  3. impeded\n  4. discouraged (Answer)\n```\n\n\n\n#### BERT Testing\n\n1. input\n\n```json\n{\n    \"1\" : {\n        \"question\" : \"Business experts predict that the upward trend is _ to continue until the end of next year.\",\n        \"answer\" : \"likely\",\n        \"1\" : \"potential\",\n        \"2\" : \"likely\",\n        \"3\" : \"safety\",\n        \"4\" : \"seemed\"\n    }\n}\n```\n\n2. output\n\n```\n=============================\nQuestion : Business experts predict that the upward trend is _ to continue until the end of next year.\n\nReal Answer : likely\n\n1) potential 2) likely 3) safety 4) seemed\n\nBERT's Answer =\u003e [likely]\n```\n\n\n\n#### Why BERT?\n\nIn pretrained BERT, It contains contextual information. So It can find more contextual or grammatical sentences, not clear, a little bit. I was inspired by grammar checker from [blog post](\u003chttps://www.scribendi.ai/can-we-use-bert-as-a-language-model-to-assign-score-of-a-sentence/\u003e).\n\n\u003e [Can We Use BERT as a Language Model to Assign a Score to a Sentence?](\u003chttps://www.scribendi.ai/can-we-use-bert-as-a-language-model-to-assign-score-of-a-sentence/\u003e)\n\u003e\n\u003e BERT uses a bidirectional encoder to encapsulate a sentence from left to right and from right to left. Thus, it learns two representations of each word-one from left to right and one from right to left-and then concatenates them for many downstream tasks.\n\n\n\n## Evaluation\n\n\u003cp align=\"center\"\u003e\u003cimg width=\"500\" src=\"https://raw.githubusercontent.com/graykode/toeicbert/master/images/baseline.gif\" /\u003e\u003c/p\u003e\n\nI had evaluated with only **pretrained BERT model(not fine-tuning)** to check grammatical or lexical error. Above mathematical expression, `X` is a question sentence. and `n` is number of questions : `{a, b, c, d}`. `C` subset means answer candidate tokens : `C` of `warranty` is `['warrant', '##y']`. `V` means total Vocabulary.\n\nThere's a problem with more than one token. I solved this problem by getting the average value of each tensor. ex) `is being formed` as `['is', 'being', 'formed']` \n\nThen, we find argmax in `L_n(T_n)`.\n\n\n\n\u003cp align=\"center\"\u003e\u003cimg width=\"350\" src=\"https://raw.githubusercontent.com/graykode/toeicbert/master/images/prediction.gif\" /\u003e\u003c/p\u003e\n\n```python\npredictions = model(question_tensors, segment_tensors)\n\n# predictions : [batch_size, sequence_length, vocab_size]\npredictions_candidates = predictions[0, masked_index, candidate_ids].mean()\n```\n\n\n\n#### Result of Evaluation.\n\nFantastic result with **only pretrained BERT model**\n\n- `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters\n- `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters\n- `bert-base-cased`: 12-layer, 768-hidden, 12-heads , 110M parameters\n- `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters\n\nTotal 7067 datasets: make non-deterministic with `model.eval()`\n\n|             | bert-base-uncased | bert-base-cased | bert-large-uncased | bert-large-cased |\n| :---------: | :---------------: | :-------------: | :----------------: | :--------------: |\n| Correct Num |       5192        |      5398       |        5321        |       5148       |\n|   Percent   |      73.46%       |     76.38%      |       75.29%       |      72.84%      |\n\n\n\n## Quick Start with Python pip Package.\n\n**Start with pip**\n\n```shell\n$ pip install toeicbert\n```\n\n\n\n**Run \u0026 Option**\n\n```shell\n$ python -m toeicbert --model bert-base-uncased --file test.json\n```\n\n- `-m, --model` : bert-model name in huggingface's pytorch-pretrained-BERT : `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`.\n\n- `-f, --file` : json file to evalution, see json format, [test.json](test.json). \n\n  **key(question, 1, 2, 3, 4)  is required options, but answer not.**\n\n  `_` in question will be replaced to `[MASK]`\n\n```json\n{\n    \"1\" : {\n        \"question\" : \"The music teacher had me _ scales several times.\",\n        \"answer\" : \"play\",\n        \"1\" : \"play\",\n        \"2\" : \"to play\",\n        \"3\" : \"played\",\n        \"4\" : \"playing\"\n    },\n    \"2\" : {\n        \"question\" : \"The music teacher had me _ scales several times.\",\n        \"1\" : \"play\",\n        \"2\" : \"to play\",\n        \"3\" : \"played\",\n        \"4\" : \"playing\"\n    }\n}\n```\n\n\n\n## Author\n\n- Tae Hwan Jung(Jeff Jung) @graykode, Kyung Hee Univ CE(Undergraduate).\n- Author Email : [nlkey2022@gmail.com](mailto:nlkey2022@gmail.com)\n\nThanks for Hwan Suk Gang(Kyung Hee Univ.) for collecting Dataset(`7114` datasets)","funding_links":[],"categories":["BERT QA \u0026 RC task:"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgraykode%2Ftoeicbert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgraykode%2Ftoeicbert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgraykode%2Ftoeicbert/lists"}