{"id":16269267,"url":"https://github.com/thinkwee/multiling2019_wiki","last_synced_at":"2025-08-21T05:37:42.413Z","repository":{"id":79907434,"uuid":"245739806","full_name":"thinkwee/multiling2019_wiki","owner":"thinkwee","description":"code for BUPT paper in workshop MultiLing2019@RANLP","archived":false,"fork":false,"pushed_at":"2021-07-19T07:51:04.000Z","size":4289,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-14T11:33:34.923Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thinkwee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-08T02:35:25.000Z","updated_at":"2021-07-19T07:51:07.000Z","dependencies_parsed_at":"2023-05-04T15:01:41.882Z","dependency_job_id":null,"html_url":"https://github.com/thinkwee/multiling2019_wiki","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thinkwee%2Fmultiling2019_wiki","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thinkwee%2Fmultiling2019_wiki/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thinkwee%2Fmultiling2019_wiki/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thinkwee%2Fmultiling2019_wiki/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thinkwee","download_url":"https://codeload.github.com/thinkwee/multiling2019_wiki/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247867365,"owners_count":21009240,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-10T18:07:50.631Z","updated_at":"2025-04-08T15:17:33.144Z","avatar_url":"https://github.com/thinkwee.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# multiling2019_wiki\n\n- Multiling 2019 维基百科词条提取任务\n- 基于BERT，用CRF/NMT微调\n- 论文见[Multilingual Wikipedia Summarization and Title Generation On Low Resource Corpus](https://www.aclweb.org/anthology/W19-8904.pdf)\n- 代码基于sberbank-ai的实现[ner-bert](https://github.com/sberbank-ai/ner-bert)\t\n\n# Description\n\n- 见[任务说明](http://multiling.iit.demokritos.gr/pages/view/1651/task-headline-generation)\n- 数据请从[此处下载](https://drive.google.com/open?id=1a2p5ZfnFVLp2JfYgqQQSkCndzqEztIK0)\n- 测试集请从[此处下载](https://drive.google.com/open?id=1B-r7lJYJ7Kk5qmoy0WUzk4O045Lurai5)\n- 已经处理好的测试集摘要见```/WikiTrain19/processed_summary```，和测试集一起解压到```/WikiTrain19/test/```下\n\n# Methods\n- [x] Baseline1: Spacy多语言模型（七种）进行命名实体识别，直接挑选实体作为词条\n- [x] Baseline2: Spacy单语言模型（八种）进行依存剖析，提取主语作为词条\n- [x] Model1: [BERT多语言模型+BiLSTM+Attn+CRF](https://github.com/sberbank-ai/ner-bert),对词条位置进行序列标注\n- [x] Model2: [BERT多语言模型+BiLSTM+Attn+NMT](https://github.com/sberbank-ai/ner-bert),对词条位置进行序列标注(Seq2Seq)\n- [ ] Model3: [MUSE](https://github.com/facebookresearch/MUSE)+BiLSTM+CRF，对词条位置进行序列标注\n\n# Data Pipeline\n- 下载数据集，解压到```./WikiTrain19/full   ./WikiTrain19/clipped```\n- 运行```./preprocess.py```，得到标注好的数据```./WikiTrain19/tagged_data```和纯语料```./WikiTrain19/raw_data```\n- 纯语料可以直接跑baseline```./simple_method.ipynb```\n- 若要使用标注数据，运行```./combine_shuffle_divide.sh```进行后处理，并下载[BERT预训练多语言模型](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)到```./ner-bert/datadrive/multi_cased_L-12_H-768_A-12```\n- 运行```./ner-bert/examples/multiling-2019.ipynb ./ner-bert/examples/multiling-2019-nmt.ipynb```进行BERT模型的训练，模型保存在```./ner-bert/datadrive/models/multiling-2019/```下\n\n# Results\n- Baseline1\n    - en: 0.105\n    - de: 0.48\n    - es: 0.013\n    - fr: 0.397\n    - it: 0.021\n    - pt: 0.404\n    - ru: 0.25\n    \n- Baseline2\n    - en: 0.507\n    - de: 0.490\n    - fr: 0.4\n    - es: 0.451\n    - el: 0.4\n    - pt: 0.433\n    - it: 0.523\n    - nl: 0.567\n    \n- Model1（迭代十次）\n\n```\n              precision    recall  f1-score   support\n\n      B_MISC      0.665     0.443     0.532      1788\n      I_MISC      0.733     0.379     0.500      1348\n\n   micro avg      0.690     0.415     0.519      3136\n   macro avg      0.699     0.411     0.516      3136\nweighted avg      0.694     0.415     0.518      3136\n\n              precision    recall  f1-score   support\n\n           O      0.972     0.985     0.979     35451\n        MISC      0.601     0.447     0.513      1788\n\n    accuracy                          0.959     37239\n   macro avg      0.787     0.716     0.746     37239\nweighted avg      0.955     0.959     0.956     37239\n\n```\n\n- Model2（迭代十次）\n\n```\n              precision    recall  f1-score   support\n\n      B_MISC      0.890     0.910     0.900      1802\n      I_MISC      0.888     0.902     0.895      1339\n\n   micro avg      0.889     0.907     0.898      3141\n   macro avg      0.889     0.906     0.897      3141\nweighted avg      0.889     0.907     0.898      3141\n\n              precision    recall  f1-score   support\n\n           O      0.996     0.994     0.995     34664\n        MISC      0.888     0.916     0.902      1788\n\n    accuracy                          0.990     36452\n   macro avg      0.942     0.955     0.948     36452\nweighted avg      0.990     0.990     0.990     36452\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthinkwee%2Fmultiling2019_wiki","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthinkwee%2Fmultiling2019_wiki","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthinkwee%2Fmultiling2019_wiki/lists"}