{"id":13436610,"url":"https://github.com/fastnlp/fastHan","last_synced_at":"2025-03-18T21:30:42.766Z","repository":{"id":45210431,"uuid":"217049502","full_name":"fastnlp/fastHan","owner":"fastnlp","description":"fastHan是基于fastNLP与pytorch实现的中文自然语言处理工具，像spacy一样调用方便。","archived":false,"fork":false,"pushed_at":"2023-12-09T08:22:51.000Z","size":6591,"stargazers_count":754,"open_issues_count":13,"forks_count":87,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-03-11T07:02:06.459Z","etag":null,"topics":["bert","cws","fastnlp","joint-model","ner","parser","pos","python","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fastnlp.png","metadata":{"files":{"readme":"README.EN.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-10-23T12:16:49.000Z","updated_at":"2025-02-27T14:37:03.000Z","dependencies_parsed_at":"2022-08-28T16:03:04.101Z","dependency_job_id":"bb4db8c3-d2f2-4b82-aa85-b383650a5b9d","html_url":"https://github.com/fastnlp/fastHan","commit_stats":{"total_commits":65,"total_committers":5,"mean_commits":13.0,"dds":0.1384615384615384,"last_synced_commit":"9444eca53c5fbad965e3157baf4b4bbfb2a27b74"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fastnlp%2FfastHan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fastnlp%2FfastHan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fastnlp%2FfastHan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fastnlp%2FfastHan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fastnlp","download_url":"https://codeload.github.com/fastnlp/fastHan/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244173542,"owners_count":20410300,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","cws","fastnlp","joint-model","ner","parser","pos","python","pytorch"],"created_at":"2024-07-31T03:00:50.701Z","updated_at":"2025-03-18T21:30:42.746Z","avatar_url":"https://github.com/fastnlp.png","language":"Python","readme":"# fastHan\n## Brief Introduction\nfastHan is developed based on [fastNLP](https://github.com/fastnlp/fastNLP) and pytorch. It is as convinient to use as spacy.\n\nIts core is a Bert-based joint model, which is trained in 15 corpora and can handle Chinese word segmentation, part-of-speech tagging, dependency analysis and named entity recognition.\n\nStarting from fastHan2.0, fastHan added the processing of ancient Chinese word segmentation and POS tagging on the basis of the original. In addition, fastHan can handle Chinese AMR tasks. fastHan performed well in each of its tasks, approaching or even surpassing the SOTA model on some datasets.\n\n**Finally, if you are very interested in ancient Chinese word segmentation and POS tagging, you can also pay attention to another work in the laboratory, [bert-ancient-chinese](https://blog.csdn.net/Ji_Huai/article/details/125209985)([paper](https://aclanthology.org/2022.lt4hala-1.25/)).**\n\n## Citation\nIf you use the fastHan toolkit in your work, you can cite this [paper](https://arxiv.org/abs/2009.08633):\nZhichao Geng, Hang Yan, Xipeng Qiu and Xuanjing Huang, fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP, ACL, 2021.\n\n```\n@inproceedings{geng-etal-2021-fasthan,\n  author = {Geng, Zhichao and Yan, Hang and Qiu, Xipeng and Huang, Xuanjing},\n  title = {fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP},\n  booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations},\n  year = {2021},\n  pages = {99--106}, \n  url = {https://aclanthology.org/2021.acl-demo.12}\n}\n\n```\n\n## Install\nTo install fastHan, the environment has to satisfy requirements below：\n\n- torch\u003e=1.8.0\n- fastNLP\u003e=1.0.0\n  - Note: **Before version 2.0**, fastHan relied on fastNLP less than version 1.0.0.\n- transformers\u003e=4.0.0\n\nYou can execute the following command to complete the installation：\n\n```\npip install fastHan\n```\n\nOr you can install fastHan from github：\n```\ngit clone git@github.com:fastnlp/fastHan.git\ncd fastHan\npython setup.py install\n```\n\n## **Quick Start**\n\nIt is quite simple to use FastHan. There are two steps: load the model, call the model.\n\n**Load the model**\n\nExecute the following code to load the model fastHan:\n\n```\nfrom fastHan import FastHan\nmodel=FastHan()\n```\n\nIf this is the first time to load the model, fastHan will download parameters from our server automatically.\n\nThe fastHan2.0 model is based on a 12-layer BERT model. If you need to use a smaller model, you can download versions prior to fastHan2.0.\n\n\n\nExecute the following code to load the **FastCAMR model ** :\n\n\n```\nfrom fastHan import FastCAMR\ncamr_model=FastCAMR()\n```\nIf this is the first time to load the model, fastHan will download parameters from our server automatically.\n\n\n\nBesides, for users download parameters manually, load the model by path is also allowed. e.g.:\n\n```\nmodel=FastHan(url=\"/remote-home/pywang/finetuned_model\")\ncamr_model=FastCAMR(url=\"/remote-home/pywang/finetuned_camr_model\")\n```\n\n\n\n**Call the model**\n\nAn example of calling the model is shown below:\n\n```\nsentence=\"郭靖是金庸笔下的男主角。\"\nanswer=model(sentence)\nprint(answer)\nanswer=model(sentence,target=\"Parsing\")\nprint(answer)\nanswer=model(sentence,target=\"NER\")\nprint(answer)\n```\nand the output will be：\n\n```\n[['郭靖', '是', '金庸', '笔', '下', '的', '男', '主角', '。']]\n[[['郭靖', 2, 'top', 'NR'], ['是', 0, 'root', 'VC'], ['金庸', 4, 'nn', 'NR'], ['笔', 5, 'lobj', 'NN'], ['下', 8, 'assmod', 'LC'], ['的', 5, 'assm', 'DEG'], ['男', 8, 'amod', 'JJ'], ['主角', 2, 'attr', 'NN'], ['。', 2, 'punct', 'PU']]]\n[[['郭靖', 'NR'], ['金庸', 'NR']]]\n```\narg list：\n- **target**: the value can be set in ['CWS', 'POS', 'CWS-guwen', 'POS-guwen', 'NER', 'Parsing'], and the default value is 'CWS'\n  - fastHan uses CTB label set for POS、Parsing, uses MSRA label set for NER.\n- **use_dict**: whether to use user lexicon，default by False.\n- **return_list**：whether to return as list, default by True.\n- **return_loc**: whether to return the start position of words, deault by False. It can be used in spanF metric\n\n\n\nA simple example of model CAMR for sentences in Chinese is as follows:\n\n```\nsentence = \"这样 的 活动 还 有 什么 意义 呢 ？\"\nanswer = camr_model(sentence)\nfor ans in answer:\n    print(ans)\n```\n\nThe model will output the following information:\n\n```\n(x5/有-03\n        :mod()(x4/还)\n        :arg1()(x7/意义\n                :mod()(x11/amr-unknown))\n        :mode(x12/interrogative)(x13/expressive)\n        :time(x2/的)(x3/活动-01\n                :arg0-of(x2/的-01)(x1/这样)))\n```\n\nIn particular, the sentences entered into the fastCAMR model must be sentences with participles separated by Spaces. If the original sentence does not have a word segmentation, you can use fastHan's word segmentation function to do the segmentation first, and then enter the sentence with the words separated by Spaces into the fastCAMR sentence.\n\n\n\n**Change device**\n\nUsers can use **set_device** function to change the device of model:\n\n```\nmodel.set_device('cuda:0')\nmodel.set_device('cpu')\n```\n\n\n\n## **Advanced Features**\n\n**Fituning**\n\nUsers can finetune fastHan on their own dataset. An example of finetuning fastHan is shown as follows:\n```\nfrom fastHan import FastHan\n\nmodel=FastHan('large')\n\n# train data file path\ncws_url='train.dat'\n\nmodel.set_device(0)\nmodel.finetune(data_path=cws_url,task='CWS',save=True,save_url='finetuned_model')\n```\nBy calling set_device, the finetuning proceed can be accelarated using GPU. When fine-tuning, the data used for training needs to be formatted into a file.\n\nFor CWS, ene line corresponds to one piece of data, and each word is separated by a space.\n\nExample:\n\n    上海 浦东 开发 与 法制 建设 同步\n    新华社 上海 二月 十日 电 （ 记者 谢金虎 、 张持坚 ）\n    ...\n\nFor NER, we use the format and label set same as MSRA dataset. \n\nExample:\n\n    札 B-NS\n    幌 E-NS\n    雪 O\n    国 O\n    庙 O\n    会 O\n    。 O\n    \n    主 O\n    道 O\n    上 O\n    的 O\n    雪 O\n    \n    ...\n\n\nFor POS and dependency parsing, we use the format and label set same as CTB9 dataset. \n\nExample:\n\n    1       印度    _       NR      NR      _       3       nn      _       _\n    2       海军    _       NN      NN      _       3       nn      _       _\n    3       参谋长  _       NN      NN      _       5       nsubjpass       _       _\n    4       被      _       SB      SB      _       5       pass    _       _\n    5       解职    _       VV      VV      _       0       root    _       _\n    \n    1       新华社  _       NR      NR      _       7       dep     _       _\n    2       新德里  _       NR      NR      _       7       dep     _       _\n    3       １２月  _       NT      NT      _       7       dep     _       _\n    ...\n\narg list:\n- **data_path**:str，the path of the file containing training data.\n- **task**：str，the task to be finetuned，can be set from 'CWS','POS','Parsing','NER'.\n- **lr**：float，learning rate, 1e-5 by default.\n- **n_epochs**：int, umber of finetuning epochs, 1 by default.\n- **batch_size**:int, batch size, 8 by default.\n- **save**:bool, whether to save the model after finetunine, False by default.\n- **save_url**:str, the path to save the model, None by default.\n\n\n\n**camr_model also has a fine tuning function **, an example of a fine tuning is shown below:\n\n```\nfrom fastHan import FastCAMR\n\ncamr_model=FastCAMR()\n\n# train data file path\ncws_url='train.dat'\n\ncamr_model.set_device(0)\ncamr_model.finetune(data_path=cws_url,save=True,save_url='finetuned_model')\n```\n\nSetting the set_device function before fine tuning is useful for GPU acceleration. Fine-tuning involves formatting the training data into a file.\n\nThe format of data set file should follow the format of Chinese AMR corpus CAMR1.0, as shown below.\n\nExample:\n\n```\n# ::id export_amr.1322 ::2017-01-04\n# ::snt 这样 的 活动 还 有 什么 意义 呢 ？\n# ::wid x1_这样 x2_的 x3_活动 x4_还 x5_有 x6_什么 x7_意义 x8_呢 x9_？ x10_\n(x5/有-03 \n    :mod()(x4/还) \n    :arg1()(x7/意义 \n        :mod()(x11/amr-unknown)) \n    :mode()(x2/的) \n    :mod-of(x12/的而)(x1/这样))\n\n\n# ::id export_amr.1327 ::2017-01-04\n# ::snt 并且 还 有 很多 高层 的 人物 哦 ！\n# ::wid x1_并且 x2_还 x3_有 x4_很多 x5_高层 x6_的 x7_人物 x8_哦 x9_！ x10_\n(x11/and \n    :op2(x1/并且)(x3/有-03 \n        :mod()(x2/还) \n        :arg1()(x7/人物 \n            :mod-of(x6/的)(x5/高层) \n            :quant()(x12/-))) \n    :mode()(x13/- \n        :expressive()(x14/-)))\n        \n...\n```\n\nFor the meaning of relevant formats, please refer to CAMR1.0 standard of Chinese AMR Corpus.\n\nThis function takes the following arguments:\n\n- :param str data_path:  The path to the data set file for fine tuning.\n\n- :param float lr:     Fine-tune the learning rate. The default value is 1e-5.\n\n- :param int n_epochs:   The number of iterations of fine tuning is set to 1 by default.\n\n- :param int batch_size:  Number of data in each batch. The default value is 8.\n\n-  :param bool save:    Whether to save the fine-tuned mode. The default value is False.\n\n- :param str save_url:   If the model is saved, this value is the path to save the model.\n\n\n\n**User lexicon**\n\nUsers can use **add_user_dict** to add their own lexicon, which will affect the weight put into CRF. The arg of this function can be list consist of words, or the path of lexicon file. (In the file,words are split by '\\n').\n\nUsers can use **set_user_dict_weight** to set the weight coefficient of user lexicon, which is default by 0.05. Users can optimize the value by construct dev set.\n\nUsers can use **remove_user_dict** to remove the lexicon added before.\n```\nsentence=\"奥利奥利奥\"\nprint(model(sentence))\nmodel.add_user_dict([\"奥利\",\"奥利奥\"])\nprint(model(sentence,use_dict=True))\n```\nThe output will be:\n```\n[['奥利奥利奥']]\n[['奥利', '奥利奥']]\n```\n\n\n\n**segmentation style**\n\nSegmentation style refers to 10 CWS corpus used in our training phase. Our model can distinguish the corpus and follow the criterion of the corpora. So the segmentation style is related to the critetion of each corpora. Users can use set_cws_style to change the style. e.g.:\n\n\n\u003e\n```\nsentence=\"一个苹果。\"\nprint(model(sentence,'CWS'))\nmodel.set_cws_style('cnc')\nprint(model(sentence,'CWS'))\n```\nthe output will be:\n\n```\n[['一', '个', '苹果', '。']]\n[['一个', '苹果', '。']]\n```\nour corpus include SIGHAN2005(MSR,PKU,AS,CITYU),SXU,CTB6,CNC,WTB,ZX,UD.\n\n\n\n**Input and Output**\n\nInput of the model can be string or list consist of strings. If the input is list, model will process the list as one batch. So users need to control the batch size.\n\nOutput of the model can be list, or Sentence and Token defined in fastHan. Model will output list by default.\n\nif \"return_list\" is False, fastHan will output a list consist of Sentence, and Sentence is consist of Token. Each token represents a word after segmentation, and have attributes: pos, head, head_label, ner and loc.\n\nAn example is shown as follows:\n\n```\nsentence=[\"我爱踢足球。\",\"林丹是冠军\"]\nanswer=model(sentence,'Parsing',return_list=False)\nfor i,sentence in enumerate(answer):\n    print(i)\n    for token in sentence:\n        print(token,token.pos,token.head,token.head_label)\n```\nthe output will be:\n\n```\n0\n我 PN 2 nsubj\n爱 VV 0 root\n踢 VV 2 ccomp\n足球 NN 3 dobj\n。 PU 2 punct\n1\n林丹 NR 2 top\n是 VC 0 root\n冠军 NN 2 attr\n！ PU 2 punct\n```\n\n\n## **Performance**\n\n### Generalization test\nGeneralization is the most important attribute for a NLP toolkit. We conducted a CWS test on the dev set and test set of the Weibo dataset, and compared fastHan with jieba, THULAC, LTP4.0, SnowNLP. The results are as follows (spanF metric):\n\n\n dataset | SnowNLP | jieba | THULAC | LTP4.0 base | fastHan large | fastHan (fine-tuned) \n--- | --- | --- | --- | --- | --- | ---\nWeibo devset|0.7999|0.8319 |0.8649|0.9182|0.9314 |0.9632\nWeibo testset|0.7965 | 0.8358 | 0.8665 | 0. 9205 | 0.9338 | 0.9664\n\nfastHan's performance is much better than SnowNLP, jieba and THULAC. Compared with LTP-4.0, fastHan's model is much smaller(262MB:492MB) and the scores is 1.3 percentage points higher.\n\n\n### Accuracy test\nWe use following corpus to train fastHan and implement accucacy test：\n\n- CWS：AS, CITYU, CNC, CTB, MSR, PKU, SXU, UDC, WTB, ZX\n- NER：MSRA、OntoNotes\n- POS \u0026 Parsing：CTB9\n\nWe also perform speed test with Intel Core i5-9400f + NVIDIA GeForce GTX 1660ti, and batch size is set to 8.\n\nResults are as follows:\n\n\n| 更多操作test | CWS   | POS   | NER MSRA | CWS-guen | POS-guwen | NER OntoNotes | Parsing | speed(sent/s),cpu | speed(sent/s)，gpu |\n| ------------ | ----- | ----- | -------- | -------- | --------- | ------------- | ------- | ----------------- | ------------------ |\n| SOTA         | 97.1  | 93.15 | 96.09    | ——       | ——        | 81.82         | 81.71   | ——                | ——                 |\n| base         | 97.27 | 94.88 | 94.33    | ——       | ——        | 82.86         | 76.71   | 25-55             | 22-111             |\n| large        | 97.41 | 95.66 | 95.50    | ——       | ——        | 83.82         | 81.38   | 14-28             | 21-97              |\n| FastHan2.0   | 97.50 | 95.92 | 95.79    | 93.29    | 86.53     | 82.76         | 81.31   | 2-10              | 20-60              |\n\n**In fastHan2.0, relevant ancient Chinese processing has reached a very high level. If you pursue better performance and have a certain understanding of BERT and transformers library, please feel free to learn about another work of the laboratory[bert-ancient-chinese](https://blog.csdn.net/Ji_Huai/article/details/125209985)。**\n\nThe SOTA results come from following papers:\n\n1. Huang W, Cheng X, Chen K, et al. Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning.[J]. arXiv: Computation and Language, 2019.\n2. Hang Yan, Xipeng Qiu, and Xuanjing Huang. \"A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing.\" Transactions of the Association for Computational Linguistics 8 (2020): 78-92.\n3. Meng Y, Wu W, Wang F, et al. Glyce: Glyph-vectors for Chinese Character Representations[J]. arXiv: Computation and Language, 2019.\n4. Xiaonan  Li,  Hang  Yan,  Xipeng  Qiu,  and  XuanjingHuang. 2020. FLAT: Chinese NER using flat-latticetransformer.InProceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics, pages 6836–6842, Online. Association forComputational Linguisti\n\n","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffastnlp%2FfastHan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffastnlp%2FfastHan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffastnlp%2FfastHan/lists"}