{"id":19696411,"url":"https://github.com/jayyip/cws-tensorflow","last_synced_at":"2026-04-02T16:02:01.788Z","repository":{"id":119138800,"uuid":"86705220","full_name":"JayYip/cws-tensorflow","owner":"JayYip","description":"基于Tensorflow的中文分词模型","archived":false,"fork":false,"pushed_at":"2018-12-21T06:15:17.000Z","size":2595,"stargazers_count":26,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-29T11:41:58.660Z","etag":null,"topics":["nlp","tensorflow","word-segmentation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JayYip.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-30T13:23:43.000Z","updated_at":"2024-12-17T14:25:40.000Z","dependencies_parsed_at":"2023-07-09T23:46:10.459Z","dependency_job_id":null,"html_url":"https://github.com/JayYip/cws-tensorflow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/JayYip/cws-tensorflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JayYip%2Fcws-tensorflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JayYip%2Fcws-tensorflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JayYip%2Fcws-tensorflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JayYip%2Fcws-tensorflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JayYip","download_url":"https://codeload.github.com/JayYip/cws-tensorflow/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JayYip%2Fcws-tensorflow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31309583,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","tensorflow","word-segmentation"],"created_at":"2024-11-11T19:35:01.273Z","updated_at":"2026-04-02T16:02:01.766Z","avatar_url":"https://github.com/JayYip.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tensorflow中文分词模型\r\n\r\n注: 如果对准确度比较高的要求, 请使用 https://github.com/JayYip/bert-multitask-learning\r\n\r\n部分代码参考 [TensorFlow Model Zoo](https://github.com/tensorflow/models)\r\n\r\n运行环境:\r\n\r\n- Python 3.5 / Python 2.7\r\n- Tensorflow r1.4\r\n- Windows / Ubuntu 16.04\r\n- hanziconv 0.3.2\r\n- numpy\r\n\r\n## 训练模型\r\n\r\n### 1. 建立训练数据\r\n进入到data目录下，执行以下命令\r\n\r\n```\r\nDATA_OUTPUT=\"output_dir\"\r\n\r\npython build_pku_msr_input.py \\ \r\n --num_threads=4 \\\r\n --output_dir=${DATA_OUTPUT}\r\n```\r\n\r\n### 2. 字符嵌入\r\n\r\n#### 2.1 预训练好的字嵌入\r\n1. 将`configuration.py`中的`ModelConfig`的`self.random_embedding`设置为`False`\r\n2. 从[Polygot](https://sites.google.com/site/rmyeid/projects/polyglot)下载中文字嵌入数据集至项目目录，运行项目目录下`process_chr_embedding.py`。\r\n\r\n```\r\nEMBEDDING_DIR=...\r\nVOCAB_DIR=...\r\n\r\npython process_chr_embedding.py \\\r\n --chr_embedding_dir=${EMBEDDING_DIR}\r\n --vocab_dir=${VOCAB_DIR}\r\n```\r\n\r\n#### 2.2 随机初始化字嵌入\r\n\r\n将`configuration.py`中的`ModelConfig`的`self.random_embedding`设置为`True`\r\n\r\n### 3. 训练模型\r\n\r\n根据需要修改configuration.py里面的模型及训练参数，开始训练模型。\r\n以下参数如不提供将会使用默认值。\r\n\r\n```\r\nTRAIN_INPUT=\"data\\${DATA_OUTPUT}\"\r\nMODEL=\"save_model\"\r\n\r\npython train.py \\\r\n --input_file_dir=${TRAIN_INPUT} \\\r\n --train_dir=${MODEL} \\\r\n --log_every_n_steps=10\r\n \r\n```\r\n\r\n## 使用训练好的模型进行分词\r\n\r\n编码须为utf8，检测的后缀为'txt'，'csv'， 'utf8'。\r\n\r\n```\r\nINF_INPUT=...\r\nINF_OUTPUT=...\r\n\r\npython inference.py \\\r\n --input_file_dir=${INF_INPUT} \\\r\n --train_dir=${MODEL} \\\r\n --vocab_dir=${VOCAB_DIR} \\\r\n --out_dir=${INF_OUTPUT}\r\n```\r\n\r\n## 如何根据自己需要修改算法\r\n\r\n本模型使用的是单向LSTM+CRF，但是提供了算法修改的可能性。在```lstm_based_cws_model.py```文件中的\r\n\r\n\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjayyip%2Fcws-tensorflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjayyip%2Fcws-tensorflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjayyip%2Fcws-tensorflow/lists"}