{"id":13488418,"url":"https://github.com/Determined22/zh-NER-TF","last_synced_at":"2025-03-28T01:35:11.908Z","repository":{"id":41268494,"uuid":"101034201","full_name":"Determined22/zh-NER-TF","owner":"Determined22","description":"A very simple BiLSTM-CRF model for Chinese Named Entity Recognition 中文命名实体识别 (TensorFlow)","archived":false,"fork":false,"pushed_at":"2022-04-18T23:01:50.000Z","size":112390,"stargazers_count":2323,"open_issues_count":78,"forks_count":938,"subscribers_count":61,"default_branch":"master","last_synced_at":"2024-10-16T09:41:32.633Z","etag":null,"topics":["bilstm-crf-model","named-entity-recognition","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Determined22.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-08-22T07:25:32.000Z","updated_at":"2024-10-13T13:26:47.000Z","dependencies_parsed_at":"2022-07-14T10:48:39.367Z","dependency_job_id":null,"html_url":"https://github.com/Determined22/zh-NER-TF","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Determined22%2Fzh-NER-TF","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Determined22%2Fzh-NER-TF/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Determined22%2Fzh-NER-TF/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Determined22%2Fzh-NER-TF/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Determined22","download_url":"https://codeload.github.com/Determined22/zh-NER-TF/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222076969,"owners_count":16927098,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bilstm-crf-model","named-entity-recognition","tensorflow"],"created_at":"2024-07-31T18:01:15.352Z","updated_at":"2025-03-28T01:35:11.892Z","avatar_url":"https://github.com/Determined22.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# A simple BiLSTM-CRF model for Chinese Named Entity Recognition\n\nThis repository includes the code for buliding a very simple __character-based BiLSTM-CRF sequence labeling model__ for Chinese Named Entity Recognition task. Its goal is to recognize three types of Named Entity: PERSON, LOCATION and ORGANIZATION.\n\nThis code works on __Python 3 \u0026 TensorFlow 1.2__ and the following repository [https://github.com/guillaumegenthial/sequence_tagging](https://github.com/guillaumegenthial/sequence_tagging) gives me much help.\n\n## Model\n\nThis model is similar to the models provided by paper [1] and [2]. Its structure looks just like the following illustration:\n\n![Network](./pics/pic1.png)\n\nFor one Chinese sentence, each character in this sentence has / will have a tag which belongs to the set {O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG}.\n\nThe first layer, __look-up layer__, aims at transforming each character representation from one-hot vector into *character embedding*. In this code I initialize the embedding matrix randomly. We could add some linguistic knowledge later. For example, do tokenization and use pre-trained word-level embedding, then augment character embedding with the corresponding token's word embedding. In addition, we can get the character embedding by combining low-level features (please see paper[2]'s section 4.1 and paper[3]'s section 3.3 for more details).\n\nThe second layer, __BiLSTM layer__, can efficiently use *both past and future* input information and extract features automatically.\n\nThe third layer, __CRF layer__,  labels the tag for each character in one sentence. If we use a Softmax layer for labeling, we might get ungrammatic tag sequences beacuse the Softmax layer labels each position independently. We know that 'I-LOC' cannot follow 'B-PER' but Softmax doesn't know. Compared to Softmax, a CRF layer can use *sentence-level tag information* and model the transition behavior of each two different tags.\n\n## Dataset\n\n|    | #sentence | #PER | #LOC | #ORG |\n| :----: | :---: | :---: | :---: | :---: |\n| train  | 46364 | 17615 | 36517 | 20571 |\n| test   | 4365  | 1973  | 2877  | 1331  |\n\nIt looks like a portion of [MSRA corpus](http://sighan.cs.uchicago.edu/bakeoff2006/). I downloaded the dataset from the link in `./data_path/original/link.txt`\n\n### data files\n\nThe directory `./data_path` contains:\n\n- the preprocessed data files, `train_data` and `test_data` \n- a vocabulary file `word2id.pkl` that maps each character to a unique id  \n\nFor generating vocabulary file, please refer to the code in `data.py`. \n\n### data format\n\nEach data file should be in the following format:\n\n```\n中\tB-LOC\n国\tI-LOC\n很\tO\n大\tO\n\n句\tO\n子\tO\n结\tO\n束\tO\n是\tO\n空\tO\n行\tO\n\n```\n\nIf you want to use your own dataset, please: \n\n- transform your corpus to the above format\n- generate a new vocabulary file\n\n## How to Run\n\n### train\n\n`python main.py --mode=train `\n\n### test\n\n`python main.py --mode=test --demo_model=1521112368`\n\nPlease set the parameter `--demo_model` to the model that you want to test. `1521112368` is the model trained by me. \n\nAn official evaluation tool for computing metrics: [here (click 'Instructions')](http://sighan.cs.uchicago.edu/bakeoff2006/)\n\nMy test performance:\n\n| P     | R     | F     | F (PER)| F (LOC)| F (ORG)|\n| :---: | :---: | :---: | :---: | :---: | :---: |\n| 0.8945 | 0.8752 | 0.8847 | 0.8688 | 0.9118 | 0.8515\n\n### demo\n\n`python main.py --mode=demo --demo_model=1521112368`\n\nYou can input one Chinese sentence and the model will return the recognition result:\n\n![demo_pic](./pics/pic2.png)\n\n## Reference\n\n\\[1\\] [Bidirectional LSTM-CRF Models for Sequence Tagging](https://arxiv.org/pdf/1508.01991v1.pdf)\n\n\\[2\\] [Neural Architectures for Named Entity Recognition](http://aclweb.org/anthology/N16-1030)\n\n\\[3\\] [Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition](https://link.springer.com/chapter/10.1007/978-3-319-50496-4_20)\n\n\\[4\\] [https://github.com/guillaumegenthial/sequence_tagging](https://github.com/guillaumegenthial/sequence_tagging)  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDetermined22%2Fzh-NER-TF","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDetermined22%2Fzh-NER-TF","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDetermined22%2Fzh-NER-TF/lists"}