{"id":13484169,"url":"https://github.com/SeanLee97/xmnlp","last_synced_at":"2025-03-27T16:30:34.392Z","repository":{"id":29359261,"uuid":"120146329","full_name":"SeanLee97/xmnlp","owner":"SeanLee97","description":"xmnlp：提供中文分词, 词性标注, 命名体识别，情感分析，文本纠错，文本转拼音，文本摘要，偏旁部首，句子表征及文本相似度计算等功能","archived":false,"fork":false,"pushed_at":"2022-11-12T03:29:39.000Z","size":120006,"stargazers_count":1275,"open_issues_count":9,"forks_count":189,"subscribers_count":28,"default_branch":"master","last_synced_at":"2025-03-24T12:07:01.406Z","etag":null,"topics":["lexical-analysis","ner","nlp","pinyin","postagging","radical","segmentation","sentence-embeddings","sentence-similarity","sentiment-analysis","spell-checker"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SeanLee97.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-04T01:48:44.000Z","updated_at":"2025-03-20T02:18:48.000Z","dependencies_parsed_at":"2023-01-14T15:00:37.905Z","dependency_job_id":null,"html_url":"https://github.com/SeanLee97/xmnlp","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanLee97%2Fxmnlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanLee97%2Fxmnlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanLee97%2Fxmnlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SeanLee97%2Fxmnlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SeanLee97","download_url":"https://codeload.github.com/SeanLee97/xmnlp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245882181,"owners_count":20687842,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lexical-analysis","ner","nlp","pinyin","postagging","radical","segmentation","sentence-embeddings","sentence-similarity","sentiment-analysis","spell-checker"],"created_at":"2024-07-31T17:01:20.180Z","updated_at":"2025-03-27T16:30:32.331Z","avatar_url":"https://github.com/SeanLee97.png","language":"Python","readme":"\u003cp align='center'\u003e\u003cimg src='docs/xmnlp-logo.png' width=350 /\u003e\u003c/p\u003e\n\n\u003cp align='center'\u003exmnlp: 一款开箱即用的开源中文自然语言处理工具包\u003c/p\u003e\n\n\u003cp align='center'\u003eXMNLP: An out-of-the-box Chinese Natural Language Processing Toolkit\u003c/p\u003e\n\n\n\u003cdiv align='center'\u003e\n\n[![pypi](https://img.shields.io/pypi/v/xmnlp?style=for-the-badge)](https://pypi.org/project/xmnlp/)\n[![pypi downloads](https://img.shields.io/pypi/dm/xmnlp?style=for-the-badge)](https://pypi.org/project/xmnlp/)\n[![python version](https://img.shields.io/badge/python-3.6,3.7,3.8-orange.svg?style=for-the-badge)]()\n[![onnx](https://img.shields.io/badge/onnx,onnxruntime-orange.svg?style=for-the-badge)]()\n[![support os](https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg?style=for-the-badge)]()\n[![GitHub license](https://img.shields.io/github/license/SeanLee97/xmnlp?style=for-the-badge)](https://github.com/SeanLee97/xmnlp/blob/master/LICENSE)\n\n\u003c/div\u003e\n\n\n---\n\n\n\u003ca name=\"overview\"\u003e\u003c/a\u003e\n# 功能概览\n\n\n- 中文词法分析 (RoBERTa + CRF finetune)\n   - 分词\n   - 词性标注\n   - 命名体识别\n   - 支持自定义字典\n- 中文拼写检查 (Detector + Corrector SpellCheck)\n- 文本摘要 \u0026 关键词提取 (Textrank)\n- 情感分析 (RoBERTa finetune)\n- 文本转拼音 (Trie)\n- 汉字偏旁部首 (HashMap)\n- [句子表征及相似度计算](https://mp.weixin.qq.com/s/DFUXUnlH_5BlxwyQYeB2xw)\n\n\n\u003ca name=\"outline\"\u003e\u003c/a\u003e\n# Outline\n\n- [一. 安装](#installation)\n  - [模型下载](#installation-download)\n  - [配置模型](#installation-configure)\n- [二. 使用文档](#usage)\n  - [默认分词：seg](#usage-seg)\n    - [快速分词：fast_seg](#usage-fast_seg)\n    - [深度分词：deep_seg](#usage-deep_seg)\n  - [词性标注：tag](#usage-tag)\n    - [快速词性标注：fast_tag](#usage-fast_tag)\n    - [深度词性标注：deep_tag](#usage-deep_tag)\n  - [分词\u0026词性标注自定义字典](#usage-userdict)\n  - [命名体识别：ner](#usage-ner)\n  - [关键词提取：keyword](#usage-keyword)\n  - [关键语句提取：keyphrase](#usage-keyphrase)\n  - [情感识别：sentiment](#usage-sentiment)\n  - [拼音提取：pinyin](#usage-pinyin)\n  - [部首提取：radical](#usage-radical)\n  - [文本纠错：checker](#usage-checker)\n  - [句子表征及相似度计算：sentence_vector](#usage-sentence_vector)\n  - [并行处理](#usage-parallel)\n- [三. 更多](#more)\n  - [贡献者](#more-contribution)\n  - [学术引用](#more-citation)\n  - [需求定制](#more-business)\n  - [交流群](#more-contact)\n- [Refrence](#reference)\n- [License](#license)\n\n\n\u003ca name=\"installation\"\u003e\u003c/a\u003e\n## 一. 安装\n\n\u003cbr /\u003e安装最新版 xmnlp\u003cbr /\u003e\n\n```bash\npip install -U xmnlp\n```\n\n\u003cbr /\u003e国内用户可以加一下 index-url\u003cbr /\u003e\n\n```bash\npip install -i https://pypi.tuna.tsinghua.edu.cn/simple -U xmnlp\n```\n\n安装完包之后，还需要下载模型权重才可正常使用\n\n\u003cbr /\u003e\n\n\u003ca name=\"installation-download\"\u003e\u003c/a\u003e\n### 模型下载\n\n\u003cbr /\u003e请下载 xmnlp 对应版本的模型，如果不清楚 xmnlp 的版本，可以执行`python -c 'import xmnlp; print(xmnlp.__version__)'` 查看版本\u003cbr /\u003e\n\n\n| 模型名称 | 适用版本 | 下载地址 |\n| --- | --- | --- |\n| xmnlp-onnx-models-v5.zip | v0.5.0, v0.5.1, v0.5.2, v0.5.3 | [飞书](https://wao3cag89c.feishu.cn/file/boxcnppW9Vbd9SSoZEnJdP32Dsg) \\[IGHI\\] \\| [百度网盘](https://pan.baidu.com/s/1YBqD-L5spNg0VOPSDPN3iA) \\[l9id\\] |\n| xmnlp-onnx-models-v4.zip | v0.4.0 | [飞书](https://wao3cag89c.feishu.cn/file/boxcnwdZ9PTtCurhkddlsXrIr0c) \\[DKLa\\] \\| [百度网盘](https://pan.baidu.com/s/1qIHDwXJv18AAv0w72FzrjQ) \\[j1qi\\] |\n| xmnlp-onnx-models-v3.zip | v0.3.2, v0.3.3 | [飞书](https://wao3cag89c.feishu.cn/file/boxcnG5OVqqM8kxtQilt5DachE2) \\[o4bA\\] \\| [百度网盘](https://pan.baidu.com/s/1DsIec7W5CEJ8UNInezgm0Q) \\[9g7e\\] |\n\n\n\u003ca name=\"installation-configure\"\u003e\u003c/a\u003e\n### 配置模型\n\n下载模型后需要设置模型路径 xmnlp 才可以正常运行。提供两种配置方式\n\n**方式 1：配置环境变量（推荐）**\n\n\u003cbr /\u003e下载好的模型解压后，可以设置环境变量指定模型地址。以 Linux 系统为例，设置如下\u003cbr /\u003e\n\n```bash\nexport XMNLP_MODEL=/path/to/xmnlp-models\n```\n\n\n**方式 2：通过函数设置**\n\n\u003cbr /\u003e在调用 xmnlp 前设置模型地址，如下\u003cbr /\u003e\n\n```python\nimport xmnlp\n\nxmnlp.set_model('/path/to/xmnlp-models')\n```\n\n\u003cbr /\u003e* 上述 `/path/to/` 只是占位用的，配置时请替换成模型真实的目录地址。\u003cbr /\u003e\n\n\n\n\u003ca name=\"usage\"\u003e\u003c/a\u003e\n## 二. 使用文档\n\n\n\u003ca name=\"usage-seg\"\u003e\u003c/a\u003e\n### xmnlp.seg(text: str) -\u003e List[str]\n\n\u003cbr /\u003e中文分词（默认），基于逆向最大匹配来分词，采用 RoBERTa + CRF 来进行新词识别。\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 列表，分词后的结果\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"\"\"xmnlp 是一款开箱即用的轻量级中文自然语言处理工具🔧。\"\"\"\n\u003e\u003e\u003e print(xmnlp.seg(text))\n['xmnlp', '是', '一款', '开箱', '即用', '的', '轻量级', '中文', '自然语言', '处理', '工具', '🔧', '。']\n```\n\n\u003cbr /\u003e\n\n\u003ca name=\"usage-fast_seg\"\u003e\u003c/a\u003e\n### xmnlp.fast_seg(text: str) -\u003e List[str]\n\n\u003cbr /\u003e基于逆向最大匹配来分词，不包含新词识别，速度较快。\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 列表，分词后的结果\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"\"\"xmnlp 是一款开箱即用的轻量级中文自然语言处理工具🔧。\"\"\"\n\u003e\u003e\u003e print(xmnlp.seg(text))\n['xmnlp', '是', '一款', '开箱', '即', '用', '的', '轻量级', '中文', '自然语言', '处理', '工具', '🔧', '。']\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-deep_seg\"\u003e\u003c/a\u003e\n### xmnlp.deep_seg(text: str) -\u003e List[str]\n\n\u003cbr /\u003e基于 RoBERTa + CRF 模型，速度较慢。当前深度接口只支持简体中文，不支持繁体。\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 列表，分词后的结果\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"\"\"xmnlp 是一款开箱即用的轻量级中文自然语言处理工具🔧。\"\"\"\n\u003e\u003e\u003e print(xmnlp.deep_seg(text))\n['xmnlp', '是', '一款', '开箱', '即用', '的', '轻', '量级', '中文', '自然', '语言', '处理', '工具', '🔧', '。']\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-tag\"\u003e\u003c/a\u003e\n### xmnlp.tag(text: str) -\u003e List[Tuple(str, str)]\n\n\u003cbr /\u003e词性标注。\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 词和词性元组组成的列表\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"\"\"xmnlp 是一款开箱即用的轻量级中文自然语言处理工具🔧。\"\"\"\n\u003e\u003e\u003e print(xmnlp.tag(text))\n[('xmnlp', 'eng'), ('是', 'v'), ('一款', 'm'), ('开箱', 'n'), ('即用', 'v'), ('的', 'u'), ('轻量级', 'b'), ('中文', 'nz'), ('自然语言', 'l'), ('处理', 'v'), ('工具', 'n'), ('🔧', 'x'), ('。', 'x')]\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-fast_tag\"\u003e\u003c/a\u003e\n### xmnlp.fast_tag(text: str) -\u003e List[Tuple(str, str)]\n\n\u003cbr /\u003e基于逆向最大匹配，不包含新词识别，速度较快。\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 词和词性元组组成的列表\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"\"\"xmnlp 是一款开箱即用的轻量级中文自然语言处理工具🔧。\"\"\"\n\u003e\u003e\u003e print(xmnlp.fast_tag(text))\n[('xmnlp', 'eng'), ('是', 'v'), ('一款', 'm'), ('开箱', 'n'), ('即', 'v'), ('用', 'p'), ('的', 'uj'), ('轻量级', 'b'), ('中文', 'nz'), ('自然语言', 'l'), ('处理', 'v'), ('工具', 'n'), ('🔧', 'x'), ('。', 'x')]\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-deep_tag\"\u003e\u003c/a\u003e\n### xmnlp.deep_tag(text: str) -\u003e List[Tuple(str, str)]\n\n\u003cbr /\u003e基于 RoBERTa + CRF 模型，速度较慢。当前深度接口只支持简体中文，不支持繁体。\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 词和词性元组组成的列表\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"\"\"xmnlp 是一款开箱即用的轻量级中文自然语言处理工具🔧。\"\"\"\n\u003e\u003e\u003e print(xmnlp.deep_tag(text))\n[('xmnlp', 'x'), ('是', 'v'), ('一款', 'm'), ('开箱', 'v'), ('即用', 'v'), ('的', 'u'), ('轻', 'nz'), ('量级', 'b'), ('中文', 'nz'), ('自然', 'n'), ('语言', 'n'), ('处理', 'v'), ('工具', 'n'), ('🔧', 'w'), ('。', 'w')]\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-userdict\"\u003e\u003c/a\u003e\n### 分词\u0026词性标注自定义字典\n\n支持用户自定义字典，字典格式为\n\n```\n词1 词性1\n词2 词性2\n```\n\n也兼容 jieba 分词的字典格式\n\n```\n词1 词频1 词性1\n词2 词频2 词性2\n```\n\n注：上述行内的间隔符为空格\n\n\n\u003cbr /\u003e**使用示例：**\u003cbr /\u003e\n\n```python\nfrom xmnlp.lexical.tokenization import Tokenization\n\n# 定义 tokenizer\n# detect_new_word 定义是否识别新词，默认 True， 设为 False 时速度会更快\ntokenizer = Tokenization(user_dict_path, detect_new_word=True)\n\n# 分词\ntokenizer.seg(texts)\n# 词性标注\ntokenizer.tag(texts)\n```\n\n\u003cbr /\u003e\n\n\u003ca name=\"usage-ner\"\u003e\u003c/a\u003e\n### xmnlp.ner(text: str) -\u003e List[Tuple(str, str, int, int)]\n\n\u003cbr /\u003e命名体识别，支持识别的实体类型为：\n\n- TIME：时间\n- LOCATION：地点\n- PERSON：人物\n- JOB：职业\n- ORGANIZAIRION：机构\n\n\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 实体、实体类型、实体起始位置和实体结尾位置组成的列表\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"现任美国总统是拜登。\"\n\u003e\u003e\u003e print(xmnlp.ner(text))\n[('美国', 'LOCATION', 2, 4), ('总统', 'JOB', 4, 6), ('拜登', 'PERSON', 7, 9)]\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-keyword\"\u003e\u003c/a\u003e\n### xmnlp.keyword(text: str, k: int = 10, stopword: bool = True, allowPOS: Optional[List[str]] = None) -\u003e List[Tuple[str, float]]\n\n\u003cbr /\u003e从文本中提取关键词，基于 Textrank 算法。\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：文本输入\n- k：返回关键词的个数\n- stopword：是否去除停用词\n- allowPOS：配置允许的词性\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 由关键词和权重组成的列表\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"\"\"自然语言处理: 是人工智能和语言学领域的分支学科。\n    ...: 在这此领域中探讨如何处理及运用自然语言；自然语言认知则是指让电脑“懂”人类的\n    ...: 语言。\n    ...: 自然语言生成系统把计算机数据转化为自然语言。自然语言理解系统把自然语言转化\n    ...: 为计算机程序更易于处理的形式。\"\"\"\n\u003e\u003e\u003e print(xmnlp.keyword(text))\n[('自然语言', 2.3000579596585897), ('语言', 1.4734141257937314), ('计算机', 1.3747500999598312), ('转化', 1.2687686226652466), ('系统', 1.1171384775870152), ('领域', 1.0970728069617324), ('人类', 1.0192131829490039), ('生成', 1.0075197087342542), ('认知', 0.9327188339671753), ('指', 0.9218423928455112)]\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-keyphrase\"\u003e\u003c/a\u003e\n### xmnlp.keyphrase(text: str, k: int = 10, stopword: bool = False) -\u003e List[str]\n\n\u003cbr /\u003e从文本中提取关键句，基于 Textrank 算法。\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：文本输入\n- k：返回关键词的个数\n- stopword：是否去除停用词\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 由关键词和权重组成的列表\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"\"\"自然语言处理: 是人工智能和语言学领域的分支学科。\n    ...: 在这此领域中探讨如何处理及运用自然语言；自然语言认知则是指让电脑“懂”人类的\n    ...: 语言。\n    ...: 自然语言生成系统把计算机数据转化为自然语言。自然语言理解系统把自然语言转化\n    ...: 为计算机程序更易于处理的形式。\"\"\"\n\u003e\u003e\u003e print(xmnlp.keyphrase(text, k=2))\n['自然语言理解系统把自然语言转化为计算机程序更易于处理的形式', '自然语言生成系统把计算机数据转化为自然语言']\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-sentiment\"\u003e\u003c/a\u003e\n### xmnlp.sentiment(text: str) -\u003e Tuple[float, float]\n\n\u003cbr /\u003e情感识别，基于电商评论语料训练，适用于电商场景下的情感识别。\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 元组，格式为：[负向情感概率，正向情感概率]\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"这本书真不错，下次还要买\"\n\u003e\u003e\u003e print(xmnlp.sentiment(text))\n(0.02727833203971386, 0.9727216958999634)\n```\n\n\u003cbr /\u003e\n\n\u003ca name=\"usage-pinyin\"\u003e\u003c/a\u003e\n### xmnlp.pinyin(text: str) -\u003e List[str]\n\n\u003cbr /\u003e文本转拼音\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 拼音组成的列表\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"自然语言处理\"\n\u003e\u003e\u003e print(xmnlp.pinyin(text))\n['Zi', 'ran', 'yu', 'yan', 'chu', 'li']\n```\n\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-radical\"\u003e\u003c/a\u003e\n### xmnlp.radical(text: str) -\u003e List[str]\n\n\u003cbr /\u003e提取文本部首\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- 部首组成的列表\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"自然语言处理\"\n\u003e\u003e\u003e print(xmnlp.radical(text))\n['自', '灬', '讠', '言', '夂', '王']\n```\n\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-checker\"\u003e\u003c/a\u003e\n### xmnlp.checker(text: str, suggest: bool = True, k: int = 5, max_k: int = 200) -\u003e Union[ List[Tuple[int, str]], Dict[Tuple[int, str], List[Tuple[str, float]]]]:\n\n\u003cbr /\u003e文本纠错\u003cbr /\u003e\n\u003cbr /\u003e**参数：**\u003cbr /\u003e\n\n- text：输入文本\n- suggest：是否返回建议词\n- k：返回建议词的个数\n- max_k：拼音搜索最大次数（建议保持默认值）\n\n\n\u003cbr /\u003e**结果返回：**\u003cbr /\u003e\n\n- suggest 为 False 时返回 (错词下标，错词) 列表；suggest 为 True 时返回字典，字典键为(错词下标，错词) 列表，值为建议词以及权重列表。\n\n\n\u003cbr /\u003e**示例：**\u003cbr /\u003e\n\n```python\n\u003e\u003e\u003e import xmnlp\n\u003e\u003e\u003e text = \"不能适应体育专业选拔人材的要求\"\n\u003e\u003e\u003e print(xmnlp.checker(text))\n{(11, '材'): [('才', 1.58528071641922), ('材', 1.0009655653266236), ('裁', 1.0000178480604518), ('员', 0.35814568400382996), ('士', 0.011077565141022205)]}\n```\n\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-sentence_vector\"\u003e\u003c/a\u003e\n### xmnlp.sv.SentenceVector(model_dir: Optional[str] = None, genre: str = '通用', max_length: int = 512)\n\nSentenceVector 初始化函数\n\n- model_dir: 模型保存地址，默认加载 xmnlp 提供的模型权重\n- genre: 内容类型，目前支持 ['通用', '金融', '国际'] 三种\n- max_length: 输入文本的最大长度，默认 512\n\n以下是 SentenceVector 的三个成员函数\n\n### xmnlp.sv.SentenceVector.transform(self, text: str) -\u003e np.ndarray\n### xmnlp.sv.SentenceVector.similarity(self, x: Union[str, np.ndarray], y: Union[str, np.ndarray]) -\u003e float\n### xmnlp.sv.SentenceVector.most_similar(self, query: str, docs: List[str], k: int = 1, **kwargs) -\u003e List[Tuple[str, float]]\n\n- query: 查询内容\n- docs: 文档列表\n- k: 返回 topk 相似文本\n- kwargs: KDTree 的参数，详见 [sklearn.neighbors.KDTree](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html)\n\n使用示例\n\n```python\nimport numpy as np\nfrom xmnlp.sv import SentenceVector\n\n\nquery = '我想买手机'\ndocs = [\n    '我想买苹果手机',\n    '我喜欢吃苹果'\n]\n\nsv = SentenceVector(genre='通用')\nfor doc in docs:\n    print('doc:', doc)\n    print('similarity:', sv.similarity(query, doc))\nprint('most similar doc:', sv.most_similar(query, docs))\nprint('query representation shape:', sv.transform(query).shape)\n```\n\n输出\n\n```\ndoc: 我想买苹果手机\nsimilarity: 0.68668646\ndoc: 我喜欢吃苹果\nsimilarity: 0.3020076\nmost similar doc: [('我想买苹果手机', 16.255546509314417)]\nquery representation shape: (312,)\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"usage-parallel\"\u003e\u003c/a\u003e\n### 并行处理\n\n新版本不再提供对应的并行处理接口，需要使用 `xmnlp.utils.parallel_handler` 来定义并行处理接口。\n\n接口如下：\n\n```python\nxmnlp.utils.parallel_handler(callback: Callable, texts: List[str], n_jobs: int = 2, **kwargs) -\u003e Generator[List[Any], None, None]\n```\n\n使用示例：\n\n```python\nfrom functools import partial\n\nimport xmnlp\nfrom xmnlp.utils import parallel_handler\n\n\nseg_parallel = partial(parallel_handler, xmnlp.seg)\nprint(seg_parallel(texts))\n```\n\n\u003cbr /\u003e\n\n\n\u003ca name=\"more\"\u003e\u003c/a\u003e\n## 三. 更多\n\n\n\u003ca name=\"more-contribution\"\u003e\u003c/a\u003e\n### 关于贡献者\n\n\u003cbr /\u003e期待更多小伙伴的 contributions，一起打造一款简单易用的中文 NLP 工具 \u003cbr /\u003e\n\n\u003ca name=\"more-citation\"\u003e\u003c/a\u003e\n### 学术引用 Citation\n\n\n```python\n@misc{\n  xmnlp,\n  title={XMNLP: A Lightweight Chinese Natural Language Processing Toolkit},\n  author={Xianming Li},\n  year={2018},\n  publisher={GitHub},\n  howpublished={\\url{https://github.com/SeanLee97/xmnlp}},\n}\n```\n\n\u003cbr /\u003e\n\n\u003ca name=\"more-business\"\u003e\u003c/a\u003e\n### 需求定制\n\n\u003cbr /\u003e本人致力于 NLP 研究和落地，方向包括：信息抽取，情感分类等。\u003cbr /\u003e\n\u003cbr /\u003e其他 NLP 落地需求可以联系 [xmlee97@gmail.com](mailto:xmlee97@gmail.com) （此为有偿服务，xmnlp 相关的 bug 直接提 issue）\u003cbr /\u003e\n\u003cbr /\u003e\n\n\u003ca name=\"more-contact\"\u003e\u003c/a\u003e\n### 交流群\n\n\u003cbr /\u003e搜索公众号 `xmnlp-ai` 关注，菜单选择 “交流群” 入群。\u003cbr /\u003e\n\u003cbr /\u003e\n\n\u003ca name=\"reference\"\u003e\u003c/a\u003e\n## Reference\n\n\u003cbr /\u003e本项目采用的数据主要有：\u003cbr /\u003e\n\n- 词法分析，文本纠错：人民日报语料\n- 情感识别：[ChineseNlpCorpus](https://github.com/SophonPlus/ChineseNlpCorpus)\n\n\n\u003ca name=\"license\"\u003e\u003c/a\u003e\n## License\n\n\u003cbr /\u003e[Apache 2.0](https://github.com/SeanLee97/xmnlp/blob/master/LICENSE)\u003cbr /\u003e\n\u003cbr /\u003e\n\n\u003cp style='font-size: 14px; color: #666666'\u003e\n大部分模型基于 \u003ca href='https://github.com/4AI/langml'\u003eLangML\u003c/a\u003e 搭建\n\u003c/p\u003e\n","funding_links":[],"categories":["Uncategorized","Python","Chinese NLP Toolkits 中文NLP工具"],"sub_categories":["Uncategorized","Toolkits 综合NLP工具包"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSeanLee97%2Fxmnlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSeanLee97%2Fxmnlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSeanLee97%2Fxmnlp/lists"}