{"id":20106184,"url":"https://github.com/himkt/pyner","last_synced_at":"2025-08-03T11:32:23.931Z","repository":{"id":44198539,"uuid":"103533028","full_name":"himkt/pyner","owner":"himkt","description":"🌈 Implementation of Neural Network based Named Entity Recognizer (Lample+, 2016) using Chainer.","archived":false,"fork":false,"pushed_at":"2022-12-08T06:53:04.000Z","size":364,"stargazers_count":45,"open_issues_count":4,"forks_count":6,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-11-23T16:51:39.645Z","etag":null,"topics":["chainer","deep-learning","named-entity-recognition","neural-networks","sequence-labeling"],"latest_commit_sha":null,"homepage":"https://speakerdeck.com/himkt/implement-neural-named-entity-tagger","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/himkt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-09-14T13:03:37.000Z","updated_at":"2024-01-04T16:17:13.000Z","dependencies_parsed_at":"2023-01-24T20:45:11.859Z","dependency_job_id":null,"html_url":"https://github.com/himkt/pyner","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/himkt%2Fpyner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/himkt%2Fpyner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/himkt%2Fpyner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/himkt%2Fpyner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/himkt","download_url":"https://codeload.github.com/himkt/pyner/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228540832,"owners_count":17934030,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chainer","deep-learning","named-entity-recognition","neural-networks","sequence-labeling"],"created_at":"2024-11-13T17:49:18.375Z","updated_at":"2024-12-07T00:08:35.921Z","avatar_url":"https://github.com/himkt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\u003cimg src=\"./static/image/pyner.png\" width=\"600\"/\u003e\u003c/div\u003e\n\n\n# PyNER: Toolkit for sequence labeling in Chainer\n\n[![CircleCI](https://circleci.com/gh/himkt/pyner.svg?style=svg)](https://circleci.com/gh/himkt/pyner)\n[![GitHub stars](https://img.shields.io/github/stars/himkt/pyner.svg?maxAge=2592000\u0026colorB=blue)](https://github.com/himkt/pyner/stargazers)\n[![GitHub issues](https://img.shields.io/github/issues/himkt/pyner.svg?colorB=red)](https://github.com/himkt/pyner/issues)\n[![GitHub release](https://img.shields.io/github/release/himkt/pyner.svg?maxAge=2592000\u0026colorB=yellow)](https://github.com/himkt/pyner)\n[![MIT License](http://img.shields.io/badge/license-MIT-green.svg?style=flat)](LICENSE)\n\nPyNER is a sequence labeling toolkit that allows researcher and developer to\ntrain/evaluate neural sequence labeling methods.\n\n\n# QuickStart\n\nYou can try `pyner` on a local machine or a docker container.\n\n## 1. Local Machine\n\n- setup (If you do not install [pipenv](https://github.com/pypa/pipenv), please install)\n\n```\npoetry install\n```\n\n- train\n\n```\n# If a GPU is not available, specify `--device -1`\npipenv run python pyner/named_entity/train.py config/training/conll2003.lample.yaml --device 0\n```\n\n## 2. Docker container\n\n- build container\n\n```\nmake build\n```\n\n- launch container\n\n```\nmake start\n```\n\n- train\n\nYou have to execute this command in Docker container.\n\n```\n# If a GPU is not available, specify `--device -1`\npython3 train.py config/training/conll2003.lample.yaml --device 0\n```\n\nThis experiment uses CoNLL 2003 dataset.\nPlease read the following \"Prepare dataset\" section.\n\n\n# Prepare dataset\n\nWe use a data format same as [deep-crf](https://github.com/aonotas/deep-crf).\n\n```\n$ head -n 15 data/processed/CoNLL2003_BIOES/train.txt\nEU      S-ORG\nrejects O\nGerman  S-MISC\ncall    O\nto      O\nboycott O\nBritish S-MISC\nlamb    O\n.       O\n\nPeter   B-PER\nBlackburn       E-PER\n\nBRUSSELS        S-LOC\n1996-08-22      O\n```\n\nFor reproducing results in [Lample's paper](https://aclweb.org/anthology/N16-1030),\nyou have to do some step to prepare datasets.\n\n## 1. Prepare CoNLL 2003 Dataset\n\nWe can't include CoNLL 2003 dataset in this repository due to legal limitation.\nInstead, PyNER offers the way to create dataset from CoNLL 2003 dataset\n\nIf you could prepare CoNLL 2003 dataset, you would have three files like below.\n\n- eng.iob.testa\n- eng.iob.testb\n- eng.iob.train\n\nPlease put them to on same directoy (e.g. `data/external/CoNLL2003`).\n\n```\n$ tree data/external/CoNLL2003\ndata/external/CoNLL2003\n├── eng.iob.testa\n├── eng.iob.testb\n└── eng.iob.train\n```\n\nThen, you can create the dataset for pyner by following command.\nAfter running the command, `./data/processed/CoNLL2003` will be generated for you.\n\n```\n$ python bin/parse_CoNLL2003.py \\\n  --data-dir     data/external/CoNLL2003 \\\n  --output-dir   data/processed/CoNLL2003 \\\n  --convert-rule iob2bio\n2019-09-24 23:43:39,299 INFO root :create dataset for CoNLL2003\n2019-09-24 23:43:39,299 INFO root :create corpus parser\n2019-09-24 23:43:39,300 INFO root :parsing corpus for training\n2019-09-24 23:44:02,240 INFO root :parsing corpus for validating\n2019-09-24 23:44:04,397 INFO root :parsing corpus for testing\n2019-09-24 23:44:06,507 INFO root :Create train dataset\n2019-09-24 23:44:06,705 INFO root :Create valid dataset\n2019-09-24 23:44:06,755 INFO root :Create test dataset\n2019-09-24 23:44:06,800 INFO root :Create vocabulary\n$\n$ tree data/processed/CoNLL2003\ndata/processed/CoNLL2003\n├── test.txt\n├── train.txt\n├── valid.txt\n├── vocab.chars.txt\n├── vocab.tags.txt\n└── vocab.words.txt\n```\n\n\n## 2. Prepare pre-trained Word Embeddings used in Lample's paper\n\nUsing pre-trained word embeddings significantly improve the performance of NER.\nLample et al. also use pre-trained word embeddings.\nThey use Skip-N-Gram embeddings, which can be downloaded from [Official repo's issue].\nTo use this, please run `make get-lample` before running `make build`.\n(If you want to use GloVe embeddings, please run `make get-glove`.)\n\n```\n$ make get-lample\nrm -rf data/external/GloveEmbeddings\nmkdir -p data/external/LampleEmbeddings\nmkdir -p data/processed/LampleEmbeddings\npython bin/fetch_lample_embedding.py\npython bin/prepare_embeddings.py \\\n                data/external/LampleEmbeddings/skipngram_100d.txt \\\n                data/processed/LampleEmbeddings/skipngram_100d \\\n                --format word2vec\nsaved model\n$\n$ ls -1 data/processed/LampleEmbeddings\nskipngram_100d\nskipngram_100d.vectors.npy\n```\n\nCongratulations! All preparation steps have done.\nNow you can train the Lample's LSTM-CRF.\nPlease run the command:\n- Local machine: `python3 pyner/named_entity/train.py config/training/conll2003.lample.yaml --device 0`\n- Docker container: `python3 train.py config/training/conll2003.lample.yaml --device 0`\n\n\n# Inference and Evaluate\n\nYou can test your model using `pyner/named_entity/inference.py`.\nOnly thing you have to pass to `inference.py` is path to model dir.\nModel dir is defined in config file (**output**).\n\n```\n$ cat config/training/conll2003.lample.yaml\niteration: \"./config/iteration/long.yaml\"\nexternal: \"./config/external/conll2003.yaml\"\nmodel: \"./config/model/lample.yaml\"\noptimizer: \"./config/optimizer/sgd_with_clipping.yaml\"\npreprocessing: \"./config/preprocessing/znorm.yaml\"\noutput: \"./model/conll2003.lample\"  # model dir is here!!\n```\n\nIf you successfully train the model, some files are generated on `model/conll2003.lample.skipngram.YYYY-MM-DDTxx:xx:xx.xxxxxx`.\n\n```\n$ ls -1 model/conll2003.lample.skipngram.2019-09-24T07:02:33.536822\nargs\nlog\nsnapshot_epoch_0001\nsnapshot_epoch_0002\nsnapshot_epoch_0003\nsnapshot_epoch_0004\n...\nsnapshot_epoch_0148\nsnapshot_epoch_0149\nsnapshot_epoch_0150\nvalidation.main.fscore.epoch_031.pred  # here!!\n```\n\nRunning `python3 pyner/named_entity/inference.py` will generate prediction results on `model/conll2003.lample.skipngram.YYYY-MM-DDTxx:xx:xx.xxxxxx`\nA file name would be `{metrics}.epoch_{xxx}.pred`.\n`inference.py` check a log and select a model which achieve most high f1 score on development set.\nYou can use other selection criteria such as watching loss value and specifying an epoch.\n\n- Dev loss: `python3 pyner/named_entity/inference.py --metrics validation/main/loss model/conll2003.lample.skipngram.2019-09-24T07:02:33.536822`)\n- Specific epoch: `python3 pyner/named_entity/inference.py --epoch 1 model/conll2003.lample.skipngram.2019-09-24T07:02:33.536822`\n\nIf you could generate a prediction file, it's time to evaluate a model performance.\n[conlleval](https://www.clips.uantwerpen.be/conll2000/chunking/output.html) is the standard script to evaluate CoNLL Chunking/NER tasks.\nFirst of all, we have to download `conlleval`.\nRunning the command `make get-conlleval` would download `conlleval` on current directory.\nThen, evaluate!!!\n\n```\n$ ./conlleval \u003c model/conll2003.lample.skipngram.2019-09-24T07:02:33.536822/validation.main.fscore.epoch_139.pred\nprocessed 46435 tokens with 5628 phrases; found: 5651 phrases; correct: 5134.\naccuracy:  97.82%; precision:  90.85%; recall:  91.22%; FB1:  91.04\n              LOC: precision:  93.41%; recall:  92.18%; FB1:  92.79  1640\n             MISC: precision:  80.66%; recall:  80.66%; FB1:  80.66  693\n              ORG: precision:  88.72%; recall:  89.79%; FB1:  89.26  1676\n              PER: precision:  94.76%; recall:  96.23%; FB1:  95.49  1642\n```\n\nF1 score on test set is 91.04, which is approximately the same as the result in Lample's paper! (90.94)\n\n\n### Reference\n- [Neural Architectures for Named Entity Recognition]\n  - NAACL2016, Lample et al.\n\n\n[Neural Architectures for Named Entity Recognition]: https://aclweb.org/anthology/N16-1030\n[Official repo's issue]: https://github.com/glample/tagger/issues/44\n[CoNLL 2003]: https://www.clips.uantwerpen.be/conll2003/ner/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhimkt%2Fpyner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhimkt%2Fpyner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhimkt%2Fpyner/lists"}