{"id":23854045,"url":"https://github.com/lewangdev/scel2txt","last_synced_at":"2025-09-08T01:31:53.900Z","repository":{"id":40483965,"uuid":"249390755","full_name":"lewangdev/scel2txt","owner":"lewangdev","description":"搜狗细胞词库转鼠须管（Rime）词库","archived":false,"fork":false,"pushed_at":"2023-04-07T09:21:03.000Z","size":29,"stargazers_count":140,"open_issues_count":0,"forks_count":21,"subscribers_count":3,"default_branch":"master","last_synced_at":"2023-11-07T17:51:56.896Z","etag":null,"topics":["golang","python3","rime","sogou-pinyin-dict-to-txt","squrrel"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lewangdev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-23T09:44:46.000Z","updated_at":"2023-11-07T17:20:28.000Z","dependencies_parsed_at":"2022-08-09T21:50:48.822Z","dependency_job_id":null,"html_url":"https://github.com/lewangdev/scel2txt","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lewangdev%2Fscel2txt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lewangdev%2Fscel2txt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lewangdev%2Fscel2txt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lewangdev%2Fscel2txt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lewangdev","download_url":"https://codeload.github.com/lewangdev/scel2txt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232271860,"owners_count":18497766,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["golang","python3","rime","sogou-pinyin-dict-to-txt","squrrel"],"created_at":"2025-01-02T23:51:36.166Z","updated_at":"2025-01-02T23:51:36.757Z","avatar_url":"https://github.com/lewangdev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scel2txt\n\n搜狗细胞词库转鼠须管（Rime）词库，提供 Python3 和 Golang 实现的版本\n\n## 使用\n\n将从[搜狗官方词库网站](https://pinyin.sogou.com/dict/)下载的 `*.scel` 文件放入 `scel` 文件夹，然后运行\n\n### Python\n\n```shell\npython3 scel2txt.py\n```\n\n### 或者下载编译好的命令 [scel2txt-darwin-amd64-0.0.1.gz](https://github.com/lewangdev/scel2txt/releases/download/v0.0.1/scel2txt-darwin-amd64-0.0.1.gz)\n\n```shell\ngunzip scel2txt-darwin-amd64-0.0.1.gz\nchmod +x scel2txt-darwin-amd64-0.0.1\n./scel2txt-darwin-amd64-0.0.1\n```\n## 生成的文件\n\n* 后缀为 .txt 的同名词库文件\n* 自动合并所有 *.txt 文件到 `luna_pinyin.sogou.dict.yaml`\n\n\n## 搜狗细胞词库（scel格式文件） 格式说明\n\n按照一定格式保存的 Unicode 编码文件，其中每两个字节表示一个字符（中文汉字或者英文字母）。  \n\n主要包括两部分: \n\n1. 全局拼音表，在文件中的偏移值是 0x1540+4, 格式为 (py_idx, py_len, py_str)\n    - py_idx: 两个字节的整数，代表这个拼音的索引\n    - py_len: 两个字节的整数，拼音的字节长度\n    - py_str: 当前的拼音，每个字符两个字节，总长 py_len\n\n2. 汉语词组表，在文件中的偏移值是 0x2628 或 0x26c4, 格式为 (word_count, py_idx_count, py_idx_data, (word_len, word_str, ext_len, ext){word_count})，其中 (word_len, word, ext_len, ext){word_count} 一共重复 word_count 次, 表示拼音的相同的词一共有 word_count 个\n    - word_count: 两个字节的整数，同音词数量\n    - py_idx_count:  两个字节的整数，拼音的索引个数\n    - py_idx_data: 两个字节表示一个整数，每个整数代表一个拼音的索引，拼音索引数 \n    - word_len:两个字节的整数，代表中文词组字节数长度\n    - word_str: 汉语词组，每个中文汉字两个字节，总长度 word_len\n    - ext_len: 两个字节的整数，可能代表扩展信息的长度，好像都是 10\n    - ext: 扩展信息，一共 10 个字节，前两个字节是一个整数(不知道是不是词频)，后八个字节全是 0，ext_len 和 ext 一共 12 个字节\n\n\n## 目前已测试的词库\n\n* [网络流行新词【官方推荐】](https://pinyin.sogou.com/dict/detail/index/4), 24923 个词\n* [最详细的全国地名大全](https://pinyin.sogou.com/dict/detail/index/1316), 114572 个词\n* [开发大神专用词库【官方推荐】](https://pinyin.sogou.com/dict/detail/index/75228), 430 个词\n* [中国高等院校（大学）大全【官方推荐】](https://pinyin.sogou.com/dict/detail/index/20647), 7192 个词\n* [宋词精选【官方推荐】](https://pinyin.sogou.com/dict/detail/index/3), 7297 个词\n* [成语俗语【官方推荐】](https://pinyin.sogou.com/dict/detail/index/15097), 46785 个词\n* [计算机词汇大全【官方推荐】](https://pinyin.sogou.com/dict/detail/index/15117), 10300 个词\n* [论语大全【官方推荐】](https://pinyin.sogou.com/dict/detail/index/22406), 2907 个词\n* [歇后语集锦【官方推荐】](https://pinyin.sogou.com/dict/detail/index/22418), 1926 个词\n* [数学词汇大全【官方推荐】](https://pinyin.sogou.com/dict/detail/index/15202), 15992 个词\n* [物理词汇大全【官方推荐】](https://pinyin.sogou.com/dict/detail/index/15203), 13107 个词\n* [中国历史词汇大全【官方推荐】](https://pinyin.sogou.com/dict/detail/index/15130), 20526 个词\n* [饮食大全【官方推荐】](https://pinyin.sogou.com/dict/detail/index/15201), 6918 个词\n* [上海市城市信息精选](https://pinyin.sogou.com/dict/detail/index/19430), 37757 个词\n* [linux少量术语](https://pinyin.sogou.com/dict/detail/index/225), 136 个词\n\n## 参考资料\n\n1. [scel2mmseg](https://raw.githubusercontent.com/archerhu/scel2mmseg/master/scel2mmseg.py)\n2. [scel-to-txt](https://raw.githubusercontent.com/xwzhong/small-program/master/scel-to-txt/scel2txt.py)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flewangdev%2Fscel2txt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flewangdev%2Fscel2txt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flewangdev%2Fscel2txt/lists"}