{"id":19543525,"url":"https://github.com/hitsz-ids/auto-regex","last_synced_at":"2025-04-26T17:32:30.992Z","repository":{"id":37104535,"uuid":"492668666","full_name":"hitsz-ids/auto-regex","owner":"hitsz-ids","description":"automatic regex generation tool","archived":false,"fork":false,"pushed_at":"2023-07-15T12:31:48.000Z","size":9397,"stargazers_count":60,"open_issues_count":3,"forks_count":18,"subscribers_count":2,"default_branch":"main","last_synced_at":"2023-07-15T12:40:50.018Z","etag":null,"topics":["auto-regex","data-protection","privacy","python","sensitive-data-discovery"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hitsz-ids.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-05-16T03:16:10.000Z","updated_at":"2023-05-19T10:55:59.000Z","dependencies_parsed_at":"2023-02-08T17:46:26.869Z","dependency_job_id":null,"html_url":"https://github.com/hitsz-ids/auto-regex","commit_stats":null,"previous_names":[],"tags_count":1,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitsz-ids%2Fauto-regex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitsz-ids%2Fauto-regex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitsz-ids%2Fauto-regex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hitsz-ids%2Fauto-regex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hitsz-ids","download_url":"https://codeload.github.com/hitsz-ids/auto-regex/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224041397,"owners_count":17245884,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["auto-regex","data-protection","privacy","python","sensitive-data-discovery"],"created_at":"2024-11-11T03:19:24.741Z","updated_at":"2024-11-11T03:19:25.387Z","avatar_url":"https://github.com/hitsz-ids.png","language":"Python","funding_links":[],"categories":["开发工具\u0026框架"],"sub_categories":[],"readme":"# auto-regex\n\nauto-regex是一个正则表达式智能生成工具，可以基于用户提供的少量某个类型的样本数据，学习该类数据的模式特征，自动生成识别该类型数据的正则表达式，帮助提高在数据类型识别场景中的正则表达式编写效率。\n\n\n\n## 目录\n\n- [应用场景](#应用场景)\n- [主要特性](#主要特性)\n- [安装](#安装)\n- [快速开始](#快速开始)\n- [API](#API)\n- [维护者](#维护者)\n- [如何贡献](#如何贡献)\n- [许可证](#许可证)\n- [Used by](#Used-by)\n\n\n\n## 应用场景\n\n- 数据分类分级\n\n  数据分类分级场景中，数据库中有大量数据表和字段，人工一个个查看分析标注敏感类型，效率低。通过正则表达式智能生成工具，对每一个敏感类型，只需人工查看少量表，找到一列该类型数据，提供给正则表达式智能生成工具，生成正则表达式，对数据库中其他大量的表字段进行敏感类型识别。\n\n- 数据流动过程中的敏感数据识别\n\n  数据库中的数据在应用程序间流动时在数据分类分级阶段标注的敏感标签一般不会被保留，通过正则表达式智能生成工具生成的正则表达式可以在数据流动的关键节点上进行敏感数据识别，掌握敏感数据的流向。\n\n  \n\n## 主要特性\n\n+ 基于正、负样本数据，自动学习生成正则表达式；\n+ 考虑了样本数据串中的频繁子字符串，能够捕获到数据中的细节特征。\n\n\n\n## 安装\n\n推荐使用 pip 命令进行安装：\n\n```bash\npip install auto-regex\n```\n\n将从[PyPI](https://pypi.org/)获取并安装最新的稳定版本。\n\n\n\n## 快速开始\n\n```python\nfrom auto_regex.generator import generate\nregex_name = 'id_card'\ntrain_data_file = 'tests/data/ID_CARD.csv'  # 本项目下tests目录中的数据文件\n\nresult = generate(regex_name, train_data_file, init_population_size=500, max_iterations=100)\n```\n\n会输出多个正则表达式及其在示例数据上的评估指标，供选择使用：\n\n```bash\nname: id_card, pattern: \\d{6,6}19\\d{9,9}\\w|\\d{6,6}20\\d{9,9}\\w, precision: 1.0, recall: 1.0\nname: \\d\\d\\d\\d\\d\\d19\\d\\d\\d\\d\\d\\d\\d\\d\\d\\w precision: 1.0, recall: 0.6144\nname: \\d{6,6}20\\d{9,9}\\w precision: 1.0, recall: 0.3856\n```\n\ntrain_data_file文件中有两列，第一列列名为'positive'，表示正样本，第二列列名为'negative'，表示负样本。对于身份证号码类型，正样本为身份证号码，负样本为非身份证号码数据，如电话号码。\n\n\n\n## API\n\nauto-regex提供了正则表达式生成接口，具体接口参数请参考 [API文档](https://hitsz-ids.github.io/auto-regex/docs/zh/generate)。\n\n\n\n## 实现原理\n\n主要原理基于以下论文：\n\n[Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder](https://arxiv.org/abs/2005.02558)\n\n\n\n## 测试效果\n\n使用身份证号、统一社会信用代码等数据生成的正则表达式，在新的测试数据上的评估指标如下：\n\n```\n                   precision    recall  f1-score   support\n                   \n           ID_CARD     0.9997    1.0000    0.9999     10000\nSOCIAL_CREDIT_CODE     0.4784    1.0000    0.6472     10000\n      MOBILE_PHONE     0.9890    0.0898    0.1646     10000\n              DATE     1.0000    1.0000    1.0000     10000\n         BANK_CARD     0.3204    0.7423    0.4476     10000\n       DOMAIN_NAME     1.0000    0.8446    0.9158     10000\n             EMAIL     1.0000    0.0229    0.0448     10000\n         TELEPHONE     0.8938    0.6079    0.7236     10000\n              IPV4     1.0000    0.0414    0.0795     10000\n          POSTCODE     1.0000    1.0000    1.0000     10000\n          PASSPORT     1.0000    1.0000    1.0000     10000\n               MAC     1.0000    1.0000    1.0000     10000\n     LICENSE_PLATE     1.0000    0.9494    0.9740     10000\n\n         micro avg     0.7725    0.7153    0.7428    130000\n         macro avg     0.8986    0.7153    0.6921    130000\n```\n\n\n\n##与类似工具的区别\n\n其他的如[Regex-generator](https://github.com/maojui/Regex-Generator)之类的正则表达式生成工具目标是学习一个字符串集合中的特征，不考虑同类型的集合外的新数据的识别问题。\n\n\n\n## 维护者\n\nauto-regex开源项目由**哈尔滨工业大学（深圳）数据安全研究院**发起，若您对auto-regex项目感兴趣并愿意一起完善它，欢迎加入我们的开源社区。\n\n### Owner\n\n+ Longice(zekuncao@gmail.com) \n\n### Maintainer\n\n+ Longice(zekuncao@gmail.com) \n\n您可以联系项目Owner，若您通过审核便可成为auto-regex的Maintainer成员之一。\n\n\n\n## 如何贡献\n\n非常欢迎你的加入！[提一个 Issue](https://github.com/hitsz-ids/auto-regex/issues/new) 或者提交一个 Pull Request。\n\n\n\n## 许可证\n\nauto-regex开源项目使用 Apache-2.0 license，有关协议请参考[LICENSE](https://github.com/hitsz-ids/auto-regex/blob/main/LICENSE)。\n\n\n\n## Used by\n\n\u003cimg src=\"docs/imgs/组织.png\" alt=\"组织\" style=\"zoom:50%;\" /\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhitsz-ids%2Fauto-regex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhitsz-ids%2Fauto-regex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhitsz-ids%2Fauto-regex/lists"}