{"id":19838896,"url":"https://github.com/supermap/address-matching","last_synced_at":"2025-05-01T18:31:32.125Z","repository":{"id":150182069,"uuid":"246970541","full_name":"SuperMap/address-matching","owner":"SuperMap","description":"address participles, parsing, error recovery by named entity recognition.","archived":false,"fork":false,"pushed_at":"2020-05-22T06:34:17.000Z","size":213,"stargazers_count":17,"open_issues_count":0,"forks_count":8,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-06T16:53:19.586Z","etag":null,"topics":["bert"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SuperMap.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-13T02:16:43.000Z","updated_at":"2024-08-14T10:09:59.000Z","dependencies_parsed_at":"2023-04-08T08:55:36.602Z","dependency_job_id":null,"html_url":"https://github.com/SuperMap/address-matching","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SuperMap%2Faddress-matching","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SuperMap%2Faddress-matching/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SuperMap%2Faddress-matching/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SuperMap%2Faddress-matching/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SuperMap","download_url":"https://codeload.github.com/SuperMap/address-matching/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251924764,"owners_count":21666034,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert"],"created_at":"2024-11-12T12:19:30.296Z","updated_at":"2025-05-01T18:31:32.120Z","avatar_url":"https://github.com/SuperMap.png","language":"Python","readme":"# 地址分词\n网络的发展使得地址数量指数级增长，海量的数据对多个行业都提出了挑战。此项目旨在提供一个基于深度学习的地址分词器，使用监督学习的方式、BERT+BiLSTM+CRF技术，从语义角度对地址进行分词。\n\n---\n## 下载与安装\nWindows：\n- ~~Clone代码（目前github只有代码，模型文件太大，为了方便性，暂时不采取这种方式）~~\n- IDE打开（推荐Pycharm）\n- 安装必须的package\n\n---\n\n## 项目结构\n```\n└── .               \n    ├── bert_base                           \n        ├── bert                                # Google发布的BERT源码\n        └── chinese_L-12_H-768_A-12             # BERT模型\n    ├── data\n        └── dataset                             # 用于训练的最终数据集\n            ├── dev.txt\n            ├── test.txt\n            └── train.txt\n    ├── sample_files                            # 示例代码中用到的文件\n    ├── other\n        ├── pictures                            # README中用到的图片\n        ├── predict_base.py                     # 分词的基础代码\n        └── preprocessing.py                    # 预处理的代码，主要用于生成用于训练的最终训练集\n    ├── output                                  # 存放训练的模型、log以及中间文件\n    ├── train                                   # 网络结构、计算准确率、超参数指定         \n    ├── predict.py                              # 示例分词文件\n    └── train.py                                # 示例训练文件\n```\n---\n\n## 如何运行代码，对地址分词\n目前版本支持对单条地址、excel系列文件（xlsx、csv等）进行分词。根目录下的predict.py主方法中有两段代码，可以根据注释提示运行\n- 单条地址分词\n    - 直接运行根目录下的predict.py,就可以得到结果,结果如下\n    ![单条地址分词效果](./other/pictures/单条地址分词效果.png)\n- 对文件中的所有地址进行分词\n    - 注释掉根目录下的predict.py中 *预测单个文件代码块* ，打开 *预测整个文件代码块* ，运行predict.py\n## 如何使用自己的数据训练模型\n### 概述\n目前项目版本使用监督学习方法，为了保证标签的准确性，我们挑选了不同省份、不同特色的1000多条手工标注的地址进行模型训练。我们将地址分为如下11个地址要素：\n![地址要素说明](./other/pictures/切分地址要素层级说明.png)\n\n其次按照规定的层级对地址进行打标签操作，如下图：\n![打标签示例](./other/pictures/打标签示例.png)\n\n有了带标签的数据，结合Google发布的BERT预训练语言模型，就可以构建自己的网络，从而训练自己的模型。\n\n### 制作数据集\n深度学习中数据的质量对最终的效果有非常大的影响。除了概述中提到的纯手动打标签的方式。\n\n目前比较推荐的方式是：先使用训练好的模型对要训练的数据进行一次分词，人工对得到的结果进行部分检查以及修正，再进行训练。\n原始数据标签文件参考 **data/sample_files/手工标记好的示例地址.xlsx**。\n\n最后使用**other/preprocessing.py**中的主方法，生成最终的数据集。作为示例，已经使用**data/sample_files/手工标记好的示例地址.xlsx**生成了最终要使用的数据集，位于**data/dataset**下的dev.txt、test.txt、train.txt。\n该示例中，将所有地址切分成三部分，训练集、测试集以及验证集，占比分别为：60%、20%以及20%\n\n### 超参数调整\n训练用到的所有超参数都在 **train/helper.py** 文件中指定，请留意有中文注释的代码，我们挑选了10个常用超参数，可以对其进行修改。\n### 训练\n准备好了数据集、调好了超参数，直接运行**train.py**，模型就会在**train/helper.py**中指定的output_dir中生成。\n\n到此，训练结束。\n\n---\n## 后记\n- 后期代码维护以及升级后，可以直接在[github](https://github.com/SuperMap/address-matching)上拉代码。\n\n## 引用\n\u003ehttps://github.com/macanv/BERT-BiLSTM-CRF-NER","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsupermap%2Faddress-matching","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsupermap%2Faddress-matching","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsupermap%2Faddress-matching/lists"}