{"id":21884037,"url":"https://github.com/leiless/sqlite3-ngram","last_synced_at":"2025-10-08T20:32:54.037Z","repository":{"id":140255207,"uuid":"385983687","full_name":"leiless/sqlite3-ngram","owner":"leiless","description":"SQLite3 FTS5 n-gram tokenizer (WIP)","archived":false,"fork":false,"pushed_at":"2021-11-16T12:23:45.000Z","size":207,"stargazers_count":16,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-01-20T02:50:42.960Z","etag":null,"topics":["fts","fts5","ngram","sqlite","sqlite-extension","sqlite3","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/leiless.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-14T15:14:32.000Z","updated_at":"2024-11-24T12:04:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"f3463c85-6cb6-40b7-8ce9-5a9af6b3a11a","html_url":"https://github.com/leiless/sqlite3-ngram","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leiless%2Fsqlite3-ngram","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leiless%2Fsqlite3-ngram/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leiless%2Fsqlite3-ngram/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leiless%2Fsqlite3-ngram/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/leiless","download_url":"https://codeload.github.com/leiless/sqlite3-ngram/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235752207,"owners_count":19039856,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fts","fts5","ngram","sqlite","sqlite-extension","sqlite3","tokenizer"],"created_at":"2024-11-28T10:11:57.291Z","updated_at":"2025-10-08T20:32:48.714Z","avatar_url":"https://github.com/leiless.png","language":"C++","readme":"# `sqlite3-ngram`\n\n`ngram` is a SQLite3 FTS5 [n-gram](https://en.wikipedia.org/wiki/N-gram#Examples) tokenizer, it tokenize the input text in computational linguistics level.\n\nFor the input text `Hello 新 世界`:\n\n- `ngram = 1`\n\n  `Hello`, `新`, `世`, `界`\n\n- `ngram = 2`\n\n  `Hello`, `新`, `新世`, `世界`\n\n- `ngram = 3`\n\n  `Hello`, `新`, `新世`, `新世界`\n\nThe tokenization is based on [UTF-8](https://en.wikipedia.org/wiki/UTF-8#Encoding) character and character category boundary.\n\nThe ngram currently support is in range `[1, 4]`, larger ngram can be supported but it's usually unnecessary.\n\nThis tokenizer extension can be used as a fallback(generic) tokenizer for FTS purpose.\n\n## Build\n\n```bash\n# Tested under podman, docker should also be ok.\ncontainer/build.sh\n```\n\n## Usage\n\n```sql\n-- First load the ngram extension\n.load build/libngram.so\n-- By default N = 2, valid N is in range [1, 4]\nCREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'ngram');\nCREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'ngram gram N');\n\n-- Or check sql/load-ext.sql for example usage\n-- sqlite3 \u003c sql/load-ext.sql\n```\n\n## Advance usage\n\nYou can integrate this tokenizer with the SQLite3 official [`porter`](https://www.sqlite.org/fts5.html#porter_tokenizer) tokenizer:\n\n```sql\n.load build/libngram.so\nCREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter ngram gram N');\n```\n\nIn such case, if you tokenized the word `direct`. `directed`, `directing`, `direction`, `directly`... all can be coalesced into `direct` and thus hit a match.\n\n## Limitation\n\nCurrently only the UTF-8 string is supported for tokenization, usually not a big concern though.\n\n## Credits\n\nThis project was inspired from the following projects:\n\n[wangfenjin/simple - 支持中文（简体和繁体）和拼音的 SQLite fts5 扩展](https://github.com/wangfenjin/simple)\n\n## TODO\n\n* [ ] Implement `ngram_highlight()` function\n* [ ] Add more test cases\n* [ ] Enable build \u0026 test CI\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleiless%2Fsqlite3-ngram","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleiless%2Fsqlite3-ngram","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleiless%2Fsqlite3-ngram/lists"}