Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/leiless/sqlite3-ngram
SQLite3 FTS5 n-gram tokenizer (WIP)
https://github.com/leiless/sqlite3-ngram
fts fts5 ngram sqlite sqlite-extension sqlite3 tokenizer
Last synced: 16 days ago
JSON representation
SQLite3 FTS5 n-gram tokenizer (WIP)
- Host: GitHub
- URL: https://github.com/leiless/sqlite3-ngram
- Owner: leiless
- License: mit
- Created: 2021-07-14T15:14:32.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-11-16T12:23:45.000Z (about 3 years ago)
- Last Synced: 2025-01-20T02:50:42.960Z (23 days ago)
- Topics: fts, fts5, ngram, sqlite, sqlite-extension, sqlite3, tokenizer
- Language: C++
- Homepage:
- Size: 202 KB
- Stars: 16
- Watchers: 4
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# `sqlite3-ngram`
`ngram` is a SQLite3 FTS5 [n-gram](https://en.wikipedia.org/wiki/N-gram#Examples) tokenizer, it tokenize the input text in computational linguistics level.
For the input text `Hello 新 世界`:
- `ngram = 1`
`Hello`, `新`, `世`, `界`
- `ngram = 2`
`Hello`, `新`, `新世`, `世界`
- `ngram = 3`
`Hello`, `新`, `新世`, `新世界`
The tokenization is based on [UTF-8](https://en.wikipedia.org/wiki/UTF-8#Encoding) character and character category boundary.
The ngram currently support is in range `[1, 4]`, larger ngram can be supported but it's usually unnecessary.
This tokenizer extension can be used as a fallback(generic) tokenizer for FTS purpose.
## Build
```bash
# Tested under podman, docker should also be ok.
container/build.sh
```## Usage
```sql
-- First load the ngram extension
.load build/libngram.so
-- By default N = 2, valid N is in range [1, 4]
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'ngram');
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'ngram gram N');-- Or check sql/load-ext.sql for example usage
-- sqlite3 < sql/load-ext.sql
```## Advance usage
You can integrate this tokenizer with the SQLite3 official [`porter`](https://www.sqlite.org/fts5.html#porter_tokenizer) tokenizer:
```sql
.load build/libngram.so
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter ngram gram N');
```In such case, if you tokenized the word `direct`. `directed`, `directing`, `direction`, `directly`... all can be coalesced into `direct` and thus hit a match.
## Limitation
Currently only the UTF-8 string is supported for tokenization, usually not a big concern though.
## Credits
This project was inspired from the following projects:
[wangfenjin/simple - 支持中文(简体和繁体)和拼音的 SQLite fts5 扩展](https://github.com/wangfenjin/simple)
## TODO
* [ ] Implement `ngram_highlight()` function
* [ ] Add more test cases
* [ ] Enable build & test CI