{"id":13788245,"url":"https://github.com/smoothnlp/SmoothNLP","last_synced_at":"2025-05-12T02:33:08.719Z","repository":{"id":41547094,"uuid":"167669331","full_name":"smoothnlp/SmoothNLP","owner":"smoothnlp","description":"专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference","archived":true,"fork":false,"pushed_at":"2021-02-03T08:08:42.000Z","size":7038,"stargazers_count":624,"open_issues_count":21,"forks_count":112,"subscribers_count":21,"default_branch":"master","last_synced_at":"2024-11-18T02:37:11.050Z","etag":null,"topics":["depedency-parsing","nlp","nlp-pipeline","postagging","python","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/smoothnlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-01-26T08:59:52.000Z","updated_at":"2024-10-05T02:12:45.000Z","dependencies_parsed_at":"2022-08-10T02:45:41.011Z","dependency_job_id":null,"html_url":"https://github.com/smoothnlp/SmoothNLP","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smoothnlp%2FSmoothNLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smoothnlp%2FSmoothNLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smoothnlp%2FSmoothNLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smoothnlp%2FSmoothNLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/smoothnlp","download_url":"https://codeload.github.com/smoothnlp/SmoothNLP/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253662781,"owners_count":21944129,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["depedency-parsing","nlp","nlp-pipeline","postagging","python","tokenizer"],"created_at":"2024-08-03T21:00:40.276Z","updated_at":"2025-05-12T02:33:07.495Z","avatar_url":"https://github.com/smoothnlp.png","language":"Java","funding_links":[],"categories":["Chinese NLP Toolkits 中文NLP工具","Java","人工智能"],"sub_categories":["Toolkits 综合NLP工具包"],"readme":"# [SmoothNLP](http://www.smoothnlp.com)\n![Version](https://img.shields.io/badge/Version-0.4-green.svg) ![Python3](https://img.shields.io/badge/Python-3-blue.svg?style=flat) [![star this repo](http://githubbadges.com/star.svg?user=smoothnlp\u0026repo=SmoothNLP)](https://github.com/smoothnlp/SmoothNLP/stargazers) [![fork this repo](http://githubbadges.com/fork.svg?user=smoothnlp\u0026repo=SmoothNLP\u0026color=fff\u0026background=007ec6)](http://github.com/smoothnlp/SmoothNLP/fork)\n****\t\n\n| Author | Email | \n| ----- | ------ | \n| Victor | zhangruinan@smoothnlp.com |\n| Yinjun | yinjun@smoothnlp.com |\n| 海蜇 | yuzhe_wang@smoothnlp.com | \n\n****\n\n\n\u003c!-- TOC --\u003e\n\n- [SmoothNLP](#smoothnlp)\n    - [Install 安装](#install-安装)\n    - [知识图谱](#知识图谱)\n        - [调用示例\u0026可视化](#调用示例可视化)\n    - [NLP基础Pipelines](#nlp基础pipelines)\n        - [1. Tokenize分词](#1-tokenize分词)\n        - [2. Postag词性标注](#2-postag词性标注)\n        - [3. NER 实体识别](#3-ner-实体识别)\n        - [4. 金融实体识别](#4-金融实体识别)\n        - [5. 依存句法分析](#5-依存句法分析)\n        - [6. 切句](#6-切句)\n        - [7. 多线程支持](#7-多线程支持)\n        - [8. 日志](#8-日志)\n    - [无监督学习](#无监督学习)\n        - [新词挖掘](#新词挖掘)\n        - [事件聚类](#事件聚类)\n    - [有监督学习](#有监督学习)\n        - [(资讯)事件分类](#资讯事件分类)\n    - [Tutorial](#tutorial)\n    - [服务说明](#服务说明)\n        - [声明](#声明)\n        - [Pro 专业版本](#pro-专业版本)\n        - [常见问题](#常见问题)\n    - [设置字体](#设置字体)\n    - [彩蛋](#彩蛋)\n\n\u003c!-- /TOC --\u003e\n\n\n## Install 安装\n通过`pip`安装\n```shell\npip install smoothnlp\u003e=0.4.0\n```\n\n通过源代码安装最新版本\n```shell\ngit clone https://github.com/smoothnlp/SmoothNLP.git\ncd SmoothNLP\npython setup.py install\n```\n\n\n## 知识图谱\n\u003e 仅支持SmoothNLP `V0.3.0`以后的版本; 以下展示为`V0.4`版本后样例:\n\n### 调用示例\u0026可视化\n\n```python\nfrom smoothnlp.algorithm import kg\nfrom kgexplore import visual\nngrams = kg.extract_ngram([\"SmoothNLP在V0.3版本中正式推出知识抽取功能\",\n                            \"SmoothNLP专注于可解释的NLP技术\",\n                            \"SmoothNLP支持Python与Java\",\n                            \"SmoothNLP将帮助工业界与学术界更加高效的构建知识图谱\",\n                            \"SmoothNLP是上海文磨网络科技公司的开源项目\",\n                            \"SmoothNLP在V0.4版本中推出对图谱节点的分类功能\",\n                            \"KGExplore是SmoothNLP的一个子项目\"])\nvisual.visualize(ngrams,width=12,height=10)\n```\n\n![SmoothNLP_KG_Demo](/tutorials/知识图谱/0.4demo.png)\n\n\n\u003e 功能说明\n* V0.4版本中支持的边关系(edge-type), 包括: `事件触发`, `状态描述`, `属性描述`, `数值描述`. \n* V0.4版本中支持的节点种类(node-type), 包括:  `产品`、`地区`、`公司与品牌`、`货品`、`机构`、`人物`、`修饰短语`、`其他`. \n\n ---------\n\n## NLP基础Pipelines\n\n### 1.Tokenize分词\n```python\n\u003e\u003e import smoothnlp \n\u003e\u003e smoothnlp.segment('欢迎在Python中使用SmoothNLP')\n['欢迎', '在', 'Python', '中', '使用', 'SmoothNLP']\n```\n\n### 2.Postag词性标注\n\n[词性标注标签解释wiki](https://github.com/smoothnlp/SmoothNLP/wiki/%E8%AF%8D%E6%80%A7%E6%A0%87%E6%B3%A8%E8%A7%A3%E9%87%8A%E6%96%87%E6%A1%A3)\n\n```python\n\u003e\u003e smoothnlp.postag('欢迎使用smoothnlp的Python接口')\n[{'token': '欢迎', 'postag': 'VV'},\n {'token': '在', 'postag': 'P'},\n {'token': 'Python', 'postag': 'NN'},\n {'token': '中', 'postag': 'LC'},\n {'token': '使用', 'postag': 'VV'},\n {'token': 'SmoothNLP', 'postag': 'NN'}]\n```\n\n\n### 3.NER 实体识别\n```python\n\u003e\u003e smoothnlp.ner(\"中国平安2019年度长期服务计划于2019年5月7日至5月14日通过二级市场完成购股\" )\n[{'charStart': 0, 'charEnd': 4, 'text': '中国平安', 'nerTag': 'COMPANY_NAME', 'sTokenList': {'1': {'token': '中国平安', 'postag': None}}, 'normalizedEntityValue': '中国平安'},\n{'charStart': 4, 'charEnd': 9, 'text': '2019年', 'nerTag': 'NUMBER', 'sTokenList': {'2': {'token': '2019年', 'postag': 'CD'}}, 'normalizedEntityValue': '2019年'},\n{'charStart': 17, 'charEnd': 26, 'text': '2019年5月7日', 'nerTag': 'DATETIME', 'sTokenList': {'8': {'token': '2019年5月', 'postag': None}, '9': {'token': '7日', 'postag': None}}, 'normalizedEntityValue': '2019年5月7日'},\n{'charStart': 27, 'charEnd': 32, 'text': '5月14日', 'nerTag': 'DATETIME', 'sTokenList': {'11': {'token': '5月', 'postag': None}, '12': {'token': '14日', 'postag': None}}, 'normalizedEntityValue': '5月14日'}]\n```\n\n\n### 4. 金融实体识别\n```python\n\u003e\u003e smoothnlp.company_recognize(\"旷视科技预计将在今年9月在港IPO\")\n[{'charStart': 0,\n  'charEnd': 4,\n  'text': '旷视科技',\n  'nerTag': 'COMPANY_NAME',\n  'sTokenList': {'1': {'token': '旷视科技', 'postag': None}},\n  'normalizedEntityValue': '旷视科技'}]\n```\n\n\n### 5. 依存句法分析\n\u003e 注意, `smoothnlp.dep_parsing`返回的`Index=0` 为 dummy的`root`token.\n\n[依存句法分析标签解释wiki](https://github.com/smoothnlp/SmoothNLP/wiki/%E4%BE%9D%E5%AD%98%E5%8F%A5%E6%B3%95%E5%88%86%E6%9E%90%E8%A7%A3%E9%87%8A%E6%96%87%E6%A1%A3)\n\n```python\nsmoothnlp.dep_parsing(\"特斯拉是全球最大的电动汽车制造商。\")\n\u003e [{'relationship': 'top', 'dependentIndex': 2, 'targetIndex': 1},\n  {'relationship': 'root', 'dependentIndex': 0, 'targetIndex': 2},\n  {'relationship': 'dep', 'dependentIndex': 5, 'targetIndex': 3},\n  {'relationship': 'advmod', 'dependentIndex': 5, 'targetIndex': 4},\n  {'relationship': 'ccomp', 'dependentIndex': 2, 'targetIndex': 5},\n  {'relationship': 'cpm', 'dependentIndex': 5, 'targetIndex': 6},\n  {'relationship': 'amod', 'dependentIndex': 8, 'targetIndex': 7},\n  {'relationship': 'attr', 'dependentIndex': 2, 'targetIndex': 8},\n  {'relationship': 'attr', 'dependentIndex': 2, 'targetIndex': 9},\n  {'relationship': 'punct', 'dependentIndex': 2, 'targetIndex': 10}]\n```\n\n### 6. 切句\n```python\nsmoothnlp.split2sentences(\"句子1!句子2!\")\n\u003e ['句子1!', '句子2!']\n```\n\n### 7. 多线程支持\n\u003e SmoothNLP 默认使用2个Thread进行服务调用; \n```python\nfrom smoothnlp import config\nconfig.setNumThreads(2)\n```\n\n### 8. 日志\n```python\nfrom smoothnlp import config\nconfig.setLogLevel(\"DEBUG\")  ## 设定日志级别\n```\n\n-----\n\n## 无监督学习\n### 新词挖掘\n[算法介绍](https://zhuanlan.zhihu.com/p/80385615) | [使用说明](https://github.com/smoothnlp/SmoothNLP/tree/master/tutorials/%E6%96%B0%E8%AF%8D%E5%8F%91%E7%8E%B0)\n\n### 事件聚类\n该功能我们目前仅支持商业化的解决方案支持, 与线上服务. 详情可联系  business@smoothnlp.com\n\n**效果演示**\n```json\n[\n  {\n    \"url\": \"https://36kr.com/p/5167309\",\n    \"title\": \"Facebook第三次数据泄露，可能导致680万用户私人照片泄露\",\n    \"pub_ts\": 1544832000\n  },\n  {\n    \"url\": \"https://www.pencilnews.cn/p/24038.html\",\n    \"title\": \"热点 | Facebook将因为泄露700万用户个人照片 面临16亿美元罚款\",\n    \"pub_ts\": 1544832000\n  },\n  {\n    \"url\": \"https://finance.sina.com.cn/stock/usstock/c/2018-12-15/doc-ihmutuec9334184.shtml\",\n    \"title\": \"Facebook再曝新数据泄露 6800万用户或受影响\",\n    \"pub_ts\": 1544844120\n  }\n]\n```\n\u003e 吐槽: 新浪小编数据错误... 夸大事实, 真实情况Facebook并没有泄露6800万张照片\n\n## 有监督学习\n### (资讯)事件分类\n该功能我们目前仅支持商业化的解决方案支持, 与线上服务. 详情可联系  business@smoothnlp.com; 线上服务支持[API输出]()\n\n**效果**\n\n| 事件名称 | AUC | Precision|\n| --- | -- | -- |\n| 投资并购 | 0.996 |0.982|\n| 企业合作 | 0.977 |0.885|\n| 董监高管 | 0.982 |0.940|\n| 营收报导 | 0.994 |0.960|\n| 企业签约 | 0.993 |0.904|\n| 商业拓展 | 0.968 |0.869|\n| 产品报道 | 0.977 |0.911|\n| 产业政策 | 0.990 |0.879|\n| 经营不善 | 0.981 |0.765|\n| 违规约谈 | 0.951 |0.890|\n\n-------\n\n参考文献\n* [ASER](https://arxiv.org/abs/1905.00270)\n* [HanLP](https://github.com/hankcs/hanlp)\n\n----------\n\n## Tutorial\n- [多线程调用](tutorials/多进程调用/SmoothNLP多线程调用Demo.ipynb)\n\n\n## 服务说明\n\n### 声明\n1. SmoothNLP通过**云端微服务**提供完整的REST文本解析及相关服务应用. 对于开源爱好者等一般用户, 目前我们提供qps\u003c=5的服务支持; 对于商业用户, 我们提供部不受限制的云端账号或本地部署方案. \n2. 包括:切词,词性标注,依存句法分析等基础NLP任务由java代码实现, 在文件夹`smoothnlp_maven`下. 可通过 `maven`编译打包\n3. 如果您寻求商业化的NLP或知识图谱解决方案, 欢迎邮件至 business@smoothnlp.com\n\n### Pro 专业版本\nSmoothNLP Pro 支持稳定可靠的企业级用户, [使用文档](https://github.com/smoothnlp/SmoothNLP/tree/master/tutorials/Pro%E4%B8%93%E4%B8%9A%E7%89%88); 如需试用或购买, 请联系 contact@smoothnlp.com\n\n\n### 常见问题\n1.  注意, 在0.2.20版本调整后, 以下基础Pipeline功能仅对字符串长度做出了限制(不超过200). 如对较长corpus进行处理, 请先试用`smoothnlp.split2sentences` 进行切句预处理\n2. 知识图谱可视化部分(V0.4版本以前)默认支持字体`SimHei`,大多数环境下的matplotlib不支持中文字体, 我们提供字体包的[下载链接](http://storm.cloud.smoothnlp.com/s/HHM6KkmPymie4RA); 您可以通过运行以下代码, 将`Simhei`字体加载入matplotlib字体库\n\n```python\nimport matplotlib.pyplot as plt\nimport matplotlib.font_manager as font_manager\n## 设置字体\nfont_dirs = ['simhei/']\nfont_files = font_manager.findSystemFonts(fontpaths=font_dirs)\nfont_list = font_manager.createFontList(font_files)\nfont_manager.fontManager.ttflist.extend(font_list)\nplt.rcParams['font.family'] = \"SimHei\"\n```\n\n## 彩蛋\n1. 如果你对本项目, 有任何建议或者想成为联合开发者, 欢迎提交issue或pull request; 作为回赠, 我们会提供数据分享或 [kgexplore](https://github.com/smoothnlp/KGExplore) 的免费数据体验\n2. 如果你对NLP相关算法或引用场景感兴趣, 但是却缺少实现数据, 我们提供免费的数据支持, [下载](https://github.com/smoothnlp/FinancialDatasets). \n3. 如果你是高校学生, 寻求`NLP`或`知识图谱`相关的研究素材, 甚至是实习机会. 欢迎邮件到 contact@smoothnlp.com\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmoothnlp%2FSmoothNLP","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsmoothnlp%2FSmoothNLP","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmoothnlp%2FSmoothNLP/lists"}