{"id":13535204,"url":"https://github.com/Hoiy/berserker","last_synced_at":"2025-04-02T00:32:49.565Z","repository":{"id":96219354,"uuid":"164325882","full_name":"Hoiy/berserker","owner":"Hoiy","description":"Berserker - BERt chineSE woRd toKenizER ","archived":false,"fork":false,"pushed_at":"2019-02-25T14:02:20.000Z","size":178,"stargazers_count":17,"open_issues_count":3,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-11-02T23:32:34.063Z","etag":null,"topics":["bert","bert-chinese","chinese-nlp","chinese-word-segmentation","nlp","sequence-to-sequence","state-of-the-art","tensorflow","tokenizer","tpu"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Hoiy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-01-06T16:40:11.000Z","updated_at":"2022-07-04T04:12:18.000Z","dependencies_parsed_at":"2023-04-22T17:22:18.229Z","dependency_job_id":null,"html_url":"https://github.com/Hoiy/berserker","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hoiy%2Fberserker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hoiy%2Fberserker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hoiy%2Fberserker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hoiy%2Fberserker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Hoiy","download_url":"https://codeload.github.com/Hoiy/berserker/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246735354,"owners_count":20825221,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","bert-chinese","chinese-nlp","chinese-word-segmentation","nlp","sequence-to-sequence","state-of-the-art","tensorflow","tokenizer","tpu"],"created_at":"2024-08-01T08:00:51.256Z","updated_at":"2025-04-02T00:32:49.236Z","avatar_url":"https://github.com/Hoiy.png","language":"Python","funding_links":[],"categories":["BERT  NER  task:"],"sub_categories":[],"readme":"# Berserker\nBerserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's [BERT](https://github.com/google-research/bert) model.\n\n## Installation\n```python\npip install basaka\n```\n\n## Usage\n```python\nimport berserker\n\nberserker.load_model() # An one-off installation\nberserker.tokenize('姑姑想過過過兒過過的生活。') # ['姑姑', '想', '過', '過', '過兒', '過過', '的', '生活', '。']\n```\n\n## Benchmark\nThe table below shows that Berserker achieved state-of-the-art F1 measure on the [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip).\n\nThe result below is trained with 15 epoches on each dataset with a batch size of 64.\n\n|                    | PKU      | CITYU    | MSR      | AS       |\n|--------------------|----------|----------|----------|----------|\n| Liu et al. (2016)  | **96.8** | --       | 97.3     | --       |\n| Yang et al. (2017) | 96.3     | 96.9     | 97.5     | 95.7     |\n| Zhou et al. (2017) | 96.0     | --       | 97.8     | --       |\n| Cai et al. (2017)  | 95.8     | 95.6     | 97.1     | --       |\n| Chen et al. (2017) | 94.3     | 95.6     | 96.0     | 94.6     |\n| Wang and Xu (2017) | 96.5     | --       | 98.0     | --       |\n| Ma et al. (2018)   | 96.1     | **97.2** | 98.1     | 96.2     |\n|--------------------|----------|----------|----------|----------|\n| Berserker          | 96.6     | 97.1     | **98.4** | **96.5** |\n\nReference: [Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://arxiv.org/pdf/1808.06511.pdf)\n\n## Limitation\nSince Berserker ~~is muscular~~ is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.\n\nCurrently the default model provided is trained with [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [PKU dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip). We plan to release more pretrained model in the future.\n\n## Architecture\nBerserker is fine-tuned over TPU with [pretrained Chinese BERT model](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip). It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.\n\n## Training\nWe provided the source code for training under the `trainer` subdirectory. Feel free to contact me if you need any help reproducing the result.\n\n## Bonus Video\n[\u003cimg src=\"https://img.youtube.com/vi/H_xmyvABZnE/maxres1.jpg\" alt=\"Yachae!! BERSERKER!!\"/\u003e](https://www.youtube.com/watch?v=H_xmyvABZnE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHoiy%2Fberserker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHoiy%2Fberserker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHoiy%2Fberserker/lists"}