{"id":15047536,"url":"https://github.com/wangfenjin/simple","last_synced_at":"2025-05-15T07:05:22.407Z","repository":{"id":42072489,"uuid":"244509610","full_name":"wangfenjin/simple","owner":"wangfenjin","description":"支持中文和拼音的 SQLite fts5 全文搜索扩展 ｜ A SQLite3 fts5 tokenizer which supports Chinese and PinYin","archived":false,"fork":false,"pushed_at":"2025-05-11T14:41:05.000Z","size":992,"stargazers_count":681,"open_issues_count":18,"forks_count":97,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-05-11T15:30:02.093Z","etag":null,"topics":["chinese","cpp14","fts","fts5","pinyin","sqlite","sqlite3","sqlite3-fts5","tokenizer"],"latest_commit_sha":null,"homepage":"https://www.wangfenjin.com/posts/simple-tokenizer/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wangfenjin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-03-03T00:59:07.000Z","updated_at":"2025-05-11T14:41:07.000Z","dependencies_parsed_at":"2023-12-08T12:24:51.713Z","dependency_job_id":"0e575bdd-d686-4713-b1c2-8460cc4b5888","html_url":"https://github.com/wangfenjin/simple","commit_stats":{"total_commits":124,"total_committers":11,"mean_commits":"11.272727272727273","dds":"0.24193548387096775","last_synced_commit":"5ed7ca8ea8bb07610b54a0b15476d852c09ce479"},"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangfenjin%2Fsimple","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangfenjin%2Fsimple/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangfenjin%2Fsimple/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wangfenjin%2Fsimple/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wangfenjin","download_url":"https://codeload.github.com/wangfenjin/simple/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254292040,"owners_count":22046426,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese","cpp14","fts","fts5","pinyin","sqlite","sqlite3","sqlite3-fts5","tokenizer"],"created_at":"2024-09-24T20:59:53.388Z","updated_at":"2025-05-15T07:05:17.398Z","avatar_url":"https://github.com/wangfenjin.png","language":"C++","readme":"[![Downloads](https://img.shields.io/github/downloads/wangfenjin/simple/total)](https://img.shields.io/github/downloads/wangfenjin/simple/total)\n[![build](https://github.com/wangfenjin/simple/workflows/CI/badge.svg)](https://github.com/wangfenjin/simple/actions?query=workflow%3ACI)\n[![codecov](https://codecov.io/gh/wangfenjin/simple/branch/master/graph/badge.svg?token=8SHLFZ3RB4)](https://codecov.io/gh/wangfenjin/simple)\n[![CodeFactor](https://www.codefactor.io/repository/github/wangfenjin/simple/badge)](https://www.codefactor.io/repository/github/wangfenjin/simple)\n[![License: MIT](https://img.shields.io/badge/Dual_License-MIT_or_GPL_v3_later-blue.svg)](https://github.com/wangfenjin/simple/blob/master/LICENSE)\n\n# Simple tokenizer\n\nsimple 是一个支持中文和拼音的 [sqlite3 fts5](https://www.sqlite.org/fts5.html) 拓展。它完整提供了 [微信移动端的全文检索多音字问题解决方案](https://cloud.tencent.com/developer/article/1198371) 一文中的方案四，非常简单和高效地支持中文及拼音的搜索。\n\n实现相关介绍：https://www.wangfenjin.com/posts/simple-tokenizer/\n\n在此基础上，我们还支持通过 [cppjieba](https://github.com/yanyiwu/cppjieba) 实现更精准的词组匹配，介绍文章见 https://www.wangfenjin.com/posts/simple-jieba-tokenizer/\n\n## 用法\n\n### 代码使用\n\n* 下载已经编译好的插件：https://github.com/wangfenjin/simple/releases 参考 examples 目录，目前已经有 c++, python, go 和 node-sqlite3 的例子。\n* iOS可以参考:\n  - [#73](https://github.com/wangfenjin/simple/pull/73)\n  - [@hxicoder](https://github.com/hxicoder) 提供的 [demo](https://github.com/hxicoder/DBDemo)\n  - [@pipi32167](https://github.com/pipi32167)提供的[demo](https://github.com/pipi32167/SQLiteSwiftDemo)\n* 在 Rust 中使用的例子 https://github.com/wangfenjin/simple/issues/89 https://github.com/fundon/tiny-docs-se\n* Java 例子 https://github.com/wangfenjin/sqlite-java-connect\n* C# 例子 https://github.com/dudylan/SqliteCheck/\n* Rust 例子 https://github.com/xuxiaocheng0201/libsimple/\n* Android Flutter 的例子 https://github.com/SageMik/sqlite3_simple\n\n### 命令行使用\n\n首先需要确认你用到的 sqlite 版本支持 fts5 拓展，确认方法是：\n```sql\nselect fts5(?1);\n```\n然后就可以使用了，具体的例子可以参考 [example.sql](./example.sql) 和 [cpp](https://github.com/wangfenjin/simple/blob/master/examples/cpp/main.cc) \n\n```\n$ ./sqlite3\nSQLite version 3.32.3 2020-06-18 14:00:33\nEnter \".help\" for usage hints.\nConnected to a transient in-memory database.\nUse \".open FILENAME\" to reopen on a persistent database.\nsqlite\u003e .load libsimple\nsqlite\u003e CREATE VIRTUAL TABLE t1 USING fts5(text, tokenize = 'simple');\nsqlite\u003e INSERT INTO t1 VALUES ('中华人民共和国国歌');\nsqlite\u003e select simple_highlight(t1, 0, '[', ']') as text from t1 where text match simple_query('中华国歌');\n[中华]人民共和[国国歌]\nsqlite\u003e select simple_highlight(t1, 0, '[', ']') as text from t1 where text match jieba_query('中华国歌');\n[中华]人民共和国[国歌]\nsqlite\u003e select simple_highlight(t1, 0, '[', ']') as text from t1 where text match simple_query('中华人民共和国');\n[中华人民共和国国]歌\nsqlite\u003e select simple_highlight(t1, 0, '[', ']') as text from t1 where text match jieba_query('中华人民共和国');\n[中华人民共和国]国歌\n```\n\n## 功能\n\n1. simple tokenizer 支持中文和拼音的分词，并且可通过开关控制是否需要支持拼音\n2. simple_query() 函数实现自动组装 match query 的功能，用户不用学习 fts5 query 的语法\n3. simple_highlight() 实现连续高亮 match 的词汇，与 sqlite 自带的 highlight 类似，但是 simple_highlight 实现了连续 match 的词汇分到同一组的逻辑，理论上用户更需要这样\n4. simple_highlight_pos() 实现返回 match 的词汇位置，用户可以自行决定怎么使用\n5. simple_snippet() 实现截取 match 片段的功能，与 sqlite 自带的 snippet 功能类似，同样是增强连续 match 的词汇分到同一组的逻辑\n6. jieba_query() 实现jieba分词的效果，在索引不变的情况下，可以实现更精准的匹配。可以通过 `-DSIMPLE_WITH_JIEBA=OFF ` 关掉结巴分词的功能 [#35](https://github.com/wangfenjin/simple/pull/35)\n7. jieba_dict() 指定 dict 的目录，只需要调用一次，需要在调用 jieba_query() 之前指定。\n\n## 开发\n\n### 编译相关\n\n使用支持 c++14 以上的编译器编译，直接在根目录 ./build-and-run 就会编译所有需要的文件并运行测试。编译输出见 output 目录\n\n也可以手动 cmake:\n```shell\nmkdir build; cd build\ncmake ..\nmake -j 12\nmake install\n```\n\n支持 iOS 编译：\n```\n./build-ios.sh\n```\n\n### 代码\n- `src/entry` 入口文件，注册 sqlite tokenizer 和函数\n- `src/simple_tokenizer` 分词器实现\n- `src/simple_highlight` 高亮函数，基于内置的高亮函数改的，让命中的相邻单词连续高亮\n- `src/pinyin` 中文转拼音以及拼音拆 query 的实现\n\n## TODO\n\n- [x] 添加 CI/CD \n- [x] 添加使用的例子，参见 [cpp](https://github.com/wangfenjin/simple/blob/master/examples/cpp/main.cc) [python3](https://github.com/wangfenjin/simple/blob/master/examples/python3/db_connector.py)\n- [x] 部分参数可配，比如拼音文件的路径(已经把文件打包到 so 中)\n- [x] 减少依赖，减小 so 的大小\n- [x] 给出性能数据：加载扩展时间2ms内；第一次使用拼音功能需要加载拼音文件，大概 500ms；第一次使用结巴分词功能需要加载结巴分词文件，大概 4s。\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=wangfenjin/simple\u0026type=Date)](https://www.star-history.com/#wangfenjin/simple\u0026Date)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwangfenjin%2Fsimple","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwangfenjin%2Fsimple","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwangfenjin%2Fsimple/lists"}