{"id":13704666,"url":"https://github.com/lizhichao/VicWord","last_synced_at":"2025-05-05T10:30:44.323Z","repository":{"id":43961465,"uuid":"115339213","full_name":"lizhichao/VicWord","owner":"lizhichao","description":" 一个纯php分词","archived":false,"fork":false,"pushed_at":"2020-12-30T01:54:26.000Z","size":2505,"stargazers_count":596,"open_issues_count":2,"forks_count":112,"subscribers_count":26,"default_branch":"master","last_synced_at":"2024-11-06T13:06:30.611Z","etag":null,"topics":["php","segmentation","split","word"],"latest_commit_sha":null,"homepage":null,"language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lizhichao.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-12-25T13:15:53.000Z","updated_at":"2024-08-29T07:48:03.000Z","dependencies_parsed_at":"2022-09-15T20:10:20.334Z","dependency_job_id":null,"html_url":"https://github.com/lizhichao/VicWord","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lizhichao%2FVicWord","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lizhichao%2FVicWord/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lizhichao%2FVicWord/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lizhichao%2FVicWord/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lizhichao","download_url":"https://codeload.github.com/lizhichao/VicWord/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224440046,"owners_count":17311576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["php","segmentation","split","word"],"created_at":"2024-08-02T21:01:16.543Z","updated_at":"2024-11-13T11:31:38.838Z","avatar_url":"https://github.com/lizhichao.png","language":"PHP","funding_links":[],"categories":["目录","类库"],"sub_categories":["字符串 Strings","文本处理"],"readme":"# VicWord 一个纯php的分词\n\n\u003ca href=\"https://github.com/996icu/996.ICU/blob/master/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/support-996.icu-red.svg\"\u003e\u003c/a\u003e\n\nQQ交流群: 731475644\n\n## 安装\n\n```shell\ncomposer require lizhichao/word\n```\n\n\n\n## 分词说明\n- 含有3种切分方法\n    - `getWord` 长度优先切分 。最快\n    - `getShortWord` 细粒度切分。比最快慢一点点\n    - `getAutoWord` 自动切分  。效果最好\n- 可自定义词典，自己添加词语到词库，词库支持文本格式`json`和二级制格式`igb`\n二进制格式词典小，加载快\n- `dict.igb`含有175662个词，欢迎大家补充词语到 `dict.txt` ，格式(词语 \\t idf \\t 词性)\n    - idf 获取方法 百度搜索这个词语 `Math.log(100000001/结果数量)`，如果你有更好的方法欢迎补充。\n    - 词性 [标点符号,名词,动词,形容词,区别词,代词,数词,量词,副词,介词,连词,助词,语气词,拟声词,叹词] 取index ；标点符号取0\n- 三种分词结果对比\n```php\n$fc = new VicWord();\n$arr = $fc-\u003egetWord('北京大学生喝进口红酒，在北京大学生活区喝进口红酒');\n//北京大学|生喝|进口|红酒|，|在|北京大学|生活区|喝|进口|红酒\n//$arr 是一个数组 每个单元的结构[词语,词语位置,词性,这个词语是否包含在词典中] 这里只值列出了词语\n\n$arr =  $fc-\u003egetShortWord('北京大学生喝进口红酒，在北京大学生活区喝进口红酒');\n//北京|大学|生喝|进口|红酒|，|在|北京|大学|生活|区喝|进口|红酒\n\n$arr = $fc-\u003egetAutoWord('北京大学生喝进口红酒，在北京大学生活区喝进口红酒');\n//北京|大学生|喝|进口|红酒|，|在|北京大学|生活区|喝|进口|红酒\n\n//对比\n//qq的分词 http://nlp.qq.com/semantic.cgi#page2 \n//百度的分词 http://ai.baidu.com/tech/nlp/lexical\n\n```\n## 分词速度\n机器阿里云 `Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz`   \n`getWord` 每秒140w字  \n`getShortWord` 每秒138w字  \n`getAutoWord` 每秒40w字  \n测试文本在百度百科拷贝的一段5000字的文本\n\n## 制作词库\n- 词库支持utf-8的任意字符   \n- 词典大小不影响 分词速度  \n\n只有一个方法 VicDict-\u003eadd(词语,词性 = null)\n```php\nrequire __DIR__.'/Lib/VicDict.php';\n\n//目前可支持 igb 和 json 两种词典库格式；igb需要安装igbinary扩展，igb文件小，加载快\n$path = ''; //词典地址\n$dict = new VicDict($path);\n\n//添加词语词库 add(词语,词性) 不分语言，可以是utf-8编码的任何字符\n$dict-\u003eadd('中国','n');\n\n//保存词库\n$dict-\u003esave();\n```\n\n## demo \n[demo](http://blogs.vicsdf.com/my/fc)\n\n## 该作者的其他软件\n* [一个极简高性能php框架，支持[swoole | php-fpm ]环境](https://github.com/lizhichao/one)\n* [clickhouse tcp 客户端](https://github.com/lizhichao/one-ck)\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flizhichao%2FVicWord","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flizhichao%2FVicWord","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flizhichao%2FVicWord/lists"}