{"id":13512155,"url":"https://github.com/wainshine/Company-Names-Corpus","last_synced_at":"2025-03-30T22:32:04.454Z","repository":{"id":38239505,"uuid":"152345999","full_name":"wainshine/Company-Names-Corpus","owner":"wainshine","description":"公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。","archived":false,"fork":false,"pushed_at":"2024-03-27T04:57:45.000Z","size":273664,"stargazers_count":1255,"open_issues_count":3,"forks_count":373,"subscribers_count":48,"default_branch":"master","last_synced_at":"2025-03-03T09:43:30.112Z","etag":null,"topics":["company","corpus","dataset","dict","ner"],"latest_commit_sha":null,"homepage":"https://open.namemoe.com/","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wainshine.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-10T01:42:55.000Z","updated_at":"2025-02-24T07:57:48.000Z","dependencies_parsed_at":"2024-11-01T14:30:43.362Z","dependency_job_id":"e39a1706-6f24-4495-8871-dd6304741a68","html_url":"https://github.com/wainshine/Company-Names-Corpus","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wainshine%2FCompany-Names-Corpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wainshine%2FCompany-Names-Corpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wainshine%2FCompany-Names-Corpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wainshine%2FCompany-Names-Corpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wainshine","download_url":"https://codeload.github.com/wainshine/Company-Names-Corpus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246390896,"owners_count":20769475,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["company","corpus","dataset","dict","ner"],"created_at":"2024-08-01T03:01:32.989Z","updated_at":"2025-03-30T22:31:59.440Z","avatar_url":"https://github.com/wainshine.png","language":null,"funding_links":[],"categories":["Others","Contents 列表","Corpus 中文语料"],"sub_categories":["综合内容","Multi-Modal Representation \u0026 Retrieval 多模态表征与检索"],"readme":"# 公司名语料库（Company-Names-Corpus）\n\n\u003cstrong\u003e关于萌名（NameMoe）\u003c/strong\u003e\n\n萌名是一个基于大数据和自然语言处理技术的新取名产品。\n\n通过分词工具对海量文本进行分词和词频统计。数据清洗后，得到千万级的人名词典。再对其进行性别、年龄、拼音、情感、人名指数等标记，最终形成5600万+的中文人名图谱。\n\n为剔除人名、机构名中的badcase，个人收集并建立了大量行业词典，公司名语料库（Company-Names-Corpus）即是其中之一。\n\n本子项目可用于中文分词、机构名识别等场景。\n\n---\n\nPS1：维护此项目，除个人兴趣外，主要是在此过程中，可通过任务驱动来不断学习和实践NLP、KG以及AI等相关前沿技术。\n\nPS2：正在找工作，求内部推荐～ 移动医疗/SaaS后台/人工智能方向的 高级产品经理一枚。\n\nPS3：请勿提交涉政issue，谢谢。\n\nPS4：如将本项目转存到国内的某平台，请设置成0积分下载，并保留GitHub链接。\n\n---\n\n\u003cstrong\u003e公司名语料库（Company-Names-Corpus）\u003c/strong\u003e\n\n数据大小：480万。\n\n语料来源：多个词典汇总。\n\n数据清洗：已清洗，但仍存有大量badcase。\n\n---\n\n\u003cstrong\u003e机构名语料库（Organization-Names-Corpus）\u003c/strong\u003e\n\n数据大小：110万。\n\n语料来源：多个词典汇总。\n\n数据清洗：已清洗，但仍存有大量badcase。\n\n---\n\n\u003cstrong\u003e公司简称、品牌词等（Company-Shorter-Form）\u003c/strong\u003e\n\n数据大小：28万。\n\n语料来源：多个词典汇总。\n\n数据清洗：已清洗，但仍存有大量badcase。\n\n---\n\n## Stargazers over time\n\n[![Stargazers over time](https://starchart.cc/wainshine/Company-Names-Corpus.svg)](https://starchart.cc/wainshine/Company-Names-Corpus)\n\n---\n\n\u003cstrong\u003e更新时间：\u003c/strong\u003e\n\n删除了3000余非公司名。 -2018.10.31\n\n\u003cdel\u003e新增了10万公司简称、品牌词。 -2018.12.30\u003c/del\u003e\n\n新增了28万公司简称、品牌词。 -2019.03.23\n\n删除了2万余质量不高的公司名、机构名。 -2019.04.15\n\n删除了3000余非公司名。 -2019.07.27\n\n删除了2万余质量不高的公司名、机构名。 -2019.12.25\n\n删除了2万余质量不高的公司名、机构名。 -2020.12.13\n\n删除了2万余质量不高的公司名、机构名、简称。 -2021.05.05\n\n删除了2万余质量不高的公司名、机构名、简称。 -2022.11.30\n\n---\n\n@萌名NameMoe 整理\n\n2024.03.27\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwainshine%2FCompany-Names-Corpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwainshine%2FCompany-Names-Corpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwainshine%2FCompany-Names-Corpus/lists"}