{"id":13542839,"url":"https://github.com/koth/kcws","last_synced_at":"2025-05-15T17:05:22.946Z","repository":{"id":48663428,"uuid":"74712719","full_name":"koth/kcws","owner":"koth","description":"Deep Learning Chinese Word Segment ","archived":false,"fork":false,"pushed_at":"2018-05-18T03:22:58.000Z","size":14014,"stargazers_count":2081,"open_issues_count":44,"forks_count":645,"subscribers_count":162,"default_branch":"master","last_synced_at":"2025-04-07T22:08:21.042Z","etag":null,"topics":["chinese-text-segmentation","deep-learning","nlp","pos-tagger","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/koth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-11-25T00:17:50.000Z","updated_at":"2025-03-10T23:02:14.000Z","dependencies_parsed_at":"2022-09-15T18:23:31.634Z","dependency_job_id":null,"html_url":"https://github.com/koth/kcws","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koth%2Fkcws","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koth%2Fkcws/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koth%2Fkcws/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koth%2Fkcws/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/koth","download_url":"https://codeload.github.com/koth/kcws/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254384988,"owners_count":22062422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese-text-segmentation","deep-learning","nlp","pos-tagger","tensorflow"],"created_at":"2024-08-01T11:00:18.571Z","updated_at":"2025-05-15T17:05:17.924Z","avatar_url":"https://github.com/koth.png","language":"C++","funding_links":[],"categories":["Segmentation","Machine Learning","C++","Chinese NLP Toolkits 中文NLP工具","Projects built with Bazel"],"sub_categories":["Word Segmentation","Chinese Word Segment 中文分词","non-Google projects"],"readme":"\n### 引用 \n\n \n本项目模型BiLSTM+CRF参考论文：http://www.aclweb.org/anthology/N16-1030 ,IDCNN+CRF参考论文：https://arxiv.org/abs/1702.02098\n\n\n### 构建\n\n1. 安装好bazel代码构建工具，安装好tensorflow（目前本项目需要tf 1.0.0alpha版本以上)\n2. 切换到本项目代码目录，运行./configure\n3. 编译后台服务 \n\n   \u003e bazel build //kcws/cc:seg_backend_api\n\n\n### 训练\n\n1. 关注待字闺中公众号 回复 kcws 获取语料下载地址：\n   \n   ![logo](https://github.com/koth/kcws/blob/master/docs/qrcode_dzgz.jpg?raw=true \"待字闺中\")\n   \n   \n2. 解压语料到一个目录\n\n3. 切换到代码目录，运行:\n  \u003e python kcws/train/process_anno_file.py \u003c语料目录\u003e pre_chars_for_w2v.txt\n  \n  \u003e bazel build third_party/word2vec:word2vec\n  \n  \u003e 先得到初步词表\n  \n  \u003e ./bazel-bin/third_party/word2vec/word2vec -train pre_chars_for_w2v.txt -save-vocab pre_vocab.txt -min-count 3\n  \n  \u003e 处理低频词\n  \n  \u003e python kcws/train/replace_unk.py pre_vocab.txt pre_chars_for_w2v.txt chars_for_w2v.txt\n  \u003e \n  \u003e 训练word2vec\n  \u003e \n  \u003e ./bazel-bin/third_party/word2vec/word2vec -train chars_for_w2v.txt -output vec.txt -size 50 -sample 1e-4 -negative 5 -hs 1 -binary 0 -iter 5\n  \u003e \n  \u003e 构建训练语料工具\n  \u003e \n  \u003e bazel build kcws/train:generate_training\n  \u003e \n  \u003e 生成语料\n  \u003e \n  \u003e ./bazel-bin/kcws/train/generate_training vec.txt \u003c语料目录\u003e all.txt\n  \u003e \n  \u003e 得到train.txt , test.txt文件\n  \u003e \n  \u003e python kcws/train/filter_sentence.py all.txt\n  \n4. 安装好tensorflow,切换到kcws代码目录，运行:\n\n  \u003e python kcws/train/train_cws.py --word2vec_path vec.txt --train_data_path \u003c绝对路径到train.txt\u003e --test_data_path test.txt --max_sentence_len 80 --learning_rate 0.001\n  （默认使用IDCNN模型，可设置参数”--use_idcnn False“来切换BiLSTM模型)\n  \n5. 生成vocab\n  \u003e bazel  build kcws/cc:dump_vocab\n  \n  \u003e ./bazel-bin/kcws/cc/dump_vocab vec.txt kcws/models/basic_vocab.txt\n  \n6. 导出训练好的模型\n \u003e  python tools/freeze_graph.py --input_graph logs/graph.pbtxt  --input_checkpoint logs/model.ckpt --output_node_names  \"transitions,Reshape_7\"   --output_graph kcws/models/seg_model.pbtxt\n\n7. 词性标注模型下载  (临时方案，后续文档给出词性标注模型训练，导出等）\n\n   \u003e  从 https://pan.baidu.com/s/1bYmABk 下载pos_model.pbtxt到kcws/models/目录下\n\n8. 运行web service\n \u003e  ./bazel-bin/kcws/cc/seg_backend_api --model_path=kcws/models/seg_model.pbtxt(绝对路径到seg_model.pbtxt\u003e)   --vocab_path=kcws/models/basic_vocab.txt   --max_sentence_len=80\n\n### 词性标注的训练说明：\n\nhttps://github.com/koth/kcws/blob/master/pos_train.md\n\n### 自定义词典\n目前支持自定义词典是在解码阶段，参考具体使用方式请参考kcws/cc/test_seg.cc\n字典为文本格式，每一行格式如下:\n\u003e\u003c自定义词条\u003e\\t\u003c权重\u003e\n\n比如：\n\u003e蓝瘦香菇\t4\n\n权重为一个正整数，一般4以上，越大越重要\n \n### demo\nhttp://45.32.100.248:9090/\n\n附： 使用相同模型训练的公司名识别demo:\n\nhttp://45.32.100.248:18080\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoth%2Fkcws","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkoth%2Fkcws","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoth%2Fkcws/lists"}