{"id":30762685,"url":"https://github.com/cwt/fts5-icu-tokenizer","last_synced_at":"2025-09-04T15:09:47.441Z","repository":{"id":309999432,"uuid":"1013646425","full_name":"cwt/fts5-icu-tokenizer","owner":"cwt","description":"FTS5 ICU Tokenizer for SQLite (mirror)","archived":false,"fork":false,"pushed_at":"2025-08-15T02:32:25.000Z","size":7,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-15T04:21:32.032Z","etag":null,"topics":["fts5","sqlite","tokenizer"],"latest_commit_sha":null,"homepage":"https://sr.ht/~cwt/fts5-icu-tokenizer/","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cwt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-04T08:33:10.000Z","updated_at":"2025-08-15T02:32:23.000Z","dependencies_parsed_at":"2025-08-15T04:21:33.445Z","dependency_job_id":"6023cb2f-66b6-429e-abb0-42dd73c0fc29","html_url":"https://github.com/cwt/fts5-icu-tokenizer","commit_stats":null,"previous_names":["cwt/fts5-icu-tokenizer"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/cwt/fts5-icu-tokenizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cwt%2Ffts5-icu-tokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cwt%2Ffts5-icu-tokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cwt%2Ffts5-icu-tokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cwt%2Ffts5-icu-tokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cwt","download_url":"https://codeload.github.com/cwt/fts5-icu-tokenizer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cwt%2Ffts5-icu-tokenizer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273626677,"owners_count":25139527,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-04T02:00:08.968Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fts5","sqlite","tokenizer"],"created_at":"2025-09-04T15:09:41.687Z","updated_at":"2025-09-04T15:09:47.430Z","avatar_url":"https://github.com/cwt.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FTS5 ICU Tokenizer for SQLite\n\nThis project provides a custom FTS5 tokenizer for SQLite that uses the International Components for Unicode (ICU) library to provide robust word segmentation for various languages.\n\nIt is written in C for maximum stability and performance, making it suitable for high-availability systems. The target locale is configurable at build time.\n\n## Prerequisites\n\nBefore you begin, ensure you have the following installed on your system:\n\n- **CMake** (version 3.10 or higher)\n- A **C Compiler** (GCC, Clang, or MSVC)\n- **SQLite3** development libraries (`libsqlite3-dev` on Debian/Ubuntu, `sqlite-devel` on Fedora/CentOS)\n- **ICU** development libraries (`libicu-dev` on Debian/Ubuntu, `libicu-devel` on Fedora/CentOS)\n\n## Building and Installing\n\nThis project uses a standard CMake build process. The target locale can be specified using the `LOCALE` variable.\n\n### 1. Create a build directory\n\nIt's best practice to build the project in a separate directory.\n\n```bash\nmkdir build\ncd build\n```\n\n### 2. Configure the project with CMake\n\nThis step generates the native build files (e.g., Makefiles). You can specify the locale here.\n\n**Example for Thai (`th`):** This will produce `fts5_icu_th.so` and register the tokenizer as `icu_th`.\n\n```bash\ncmake .. -DLOCALE=th\n```\n\n**Example for Chinese (`cn`):** This will produce `fts5_icu_cn.so` and register the tokenizer as `icu_cn`.\n\n```bash\ncmake .. -DLOCALE=cn\n```\n\n**Example for a Universal Tokenizer:** If you omit the `LOCALE` option, it will build a generic tokenizer named `fts5_icu.so` that uses the default ICU word breaker and registers as `icu`.\n\n```bash\ncmake ..\n```\n\n### 3. Compile the project\n\n```bash\ncmake --build .\n```\n\n_(Alternatively, on Linux/macOS, you can just run `make`)_\n\n### 4. Install the library (optional)\n\nThis will copy the compiled shared library to a standard system location (e.g., `/usr/local/lib`).\n\n```bash\nsudo cmake --build . --target install\n```\n\n_(Alternatively, on Linux/macOS, you can just run `sudo make install`)_\n\n## Building on Windows\n\nThis project can be built on Windows using Visual Studio and CMake. Here's how:\n\n### Prerequisites for Windows\n\n1. **Visual Studio 2022** with C++ development tools\n   - Download from https://visualstudio.microsoft.com/\n   - Ensure you install the \"Desktop development with C++\" workload\n2. **CMake** 3.10 or higher\n   - Download from https://cmake.org/download/\n   - Ensure CMake is added to your system PATH during installation\n3. **SQLite** pre-compiled binaries and source code\n4. **ICU4C** pre-compiled binaries\n\n### Step 1: Download and Extract Dependencies\n\n#### SQLite\n1. Download the pre-compiled SQLite binaries:\n   - Visit https://www.sqlite.org/download.html\n   - Download the \"Precompiled Binaries for Windows\" (sqlite-dll-win64-x64-*.zip)\n   - Extract to a directory of your choice (e.g., `C:\\sqlite`)\n\n2. Download the SQLite source code:\n   - From the same page, download \"Source Code\" (sqlite-src-*.zip)\n   - Extract to a directory of your choice (e.g., `C:\\sqlite-src`)\n\n#### ICU4C\n1. Download pre-compiled ICU4C binaries:\n   - Visit https://github.com/unicode-org/icu/releases\n   - Download the latest Windows binaries (e.g., `icu4c-*-Win64-msvc.zip`)\n   - Extract to a directory of your choice (e.g., `C:\\icu`)\n\n### Step 2: Generate SQLite Header Files\n\n1. Open \"Developer PowerShell for VS 2022\" (from Start Menu) - this is the recommended shell\n2. Navigate to your SQLite source directory:\n   ```powershell\n   cd C:\\sqlite-src\n   ```\n3. Generate the sqlite3.h header file:\n   ```powershell\n   nmake /f Makefile.msc sqlite3.h\n   ```\n4. Create an include directory in your SQLite binaries folder:\n   ```powershell\n   mkdir C:\\sqlite\\include\n   ```\n5. Copy the generated header files:\n   ```powershell\n   copy sqlite3.h C:\\sqlite\\include\\sqlite3.h\n   copy src\\sqlite3ext.h C:\\sqlite\\include\\sqlite3ext.h\n   ```\n\n### Step 3: Generate SQLite Import Library\n\n1. In the same PowerShell window, navigate to your SQLite binaries directory:\n   ```powershell\n   cd C:\\sqlite\n   ```\n2. Generate the import library from the DEF file:\n   ```powershell\n   lib /def:sqlite3.def /out:sqlite3.lib /machine:x64\n   ```\n\n### Step 4: Build the FTS5 ICU Tokenizer\n\n1. Clone or download this repository to a directory of your choice\n2. Create a build directory:\n   ```powershell\n   mkdir build\n   cd build\n   ```\n3. Configure with CMake (replace paths with your actual paths):\n   ```powershell\n   cmake -G \"Visual Studio 17 2022\" -T host=x64 -A x64 .. `\n     -DICU_ROOT=\"C:\\icu\" `\n     -DSQLite3_INCLUDE_DIR=\"C:\\sqlite\\include\" `\n     -DSQLite3_LIBRARY=\"C:\\sqlite\\sqlite3.lib\"\n   ```\n4. Build the project:\n   ```powershell\n   cmake --build . --config Release\n   ```\n\n### Step 5: Using the Extension\n\nAfter successful compilation, you'll find `fts5_icu.dll` in the `build\\Release` directory. To use it with SQLite:\n\n#### Method 1: Simple Load (Windows)\n\nThe easiest and most reliable method on Windows is to copy the built `fts5_icu.dll` along with `icudt77.dll` and `icuuc77.dll` from the pre-compiled ICU4C (in this case ICU77) to your current directory, then use the command:\n\n```sql\n.load fts5_icu.dll\n\nCREATE VIRTUAL TABLE documents USING fts5(\n    content,\n    tokenize = 'icu'\n);\n```\n\nThis method is recommended over full path loading because of how Windows resolves DLL dependencies. When you use a full path to load the extension (e.g., `.load ./build/Release/fts5_icu.dll`), Windows may not automatically search for the required ICU DLLs (`icudt77.dll` and `icuuc77.dll`) in the same directory. Instead, it follows the Windows DLL search order, which typically looks in:\n\n1. The directory where the application (SQLite) is located\n2. The system directory\n3. The Windows directory\n4. The current directory\n5. The directories listed in the PATH environment variable\n\nBy copying the DLLs to the current directory and using the simple load command, you ensure that all required DLLs are found correctly.\n\n#### Method 2: Full Path Load (Not Recommended on Windows)\n\nAlternatively, you can load the extension using the full path, but this may cause issues with loading the required ICU DLLs:\n\n```sql\n.load ./build/Release/fts5_icu.dll\n\nCREATE VIRTUAL TABLE documents USING fts5(\n    content,\n    tokenize = 'icu'\n);\n```\n\nTo build for a specific locale (e.g., Thai), add the LOCALE parameter during CMake configuration:\n\n```powershell\ncmake -G \"Visual Studio 17 2022\" -T host=x64 -A x64 .. `\n  -DICU_ROOT=\"C:\\icu\" `\n  -DSQLite3_INCLUDE_DIR=\"C:\\sqlite\\include\" `\n  -DSQLite3_LIBRARY=\"C:\\sqlite\\sqlite3.lib\" `\n  -DLOCALE=th\n```\n\nThis will create `fts5_icu_th.dll` and register the tokenizer as `icu_th`.\n\n## Usage\n\nAfter compiling, you can load the specific tokenizer you built into SQLite.\n\n**Example for the Thai Tokenizer:**\n\n```sql\n-- Provide the path to the specific library in your build directory.\n.load ./build/fts5_icu_th.so\n\n-- Create a virtual table using the correctly named tokenizer\nCREATE VIRTUAL TABLE documents_th USING fts5(\n    content,\n    tokenize = 'icu_th'\n);\n\n-- Insert and query Thai text\nINSERT INTO documents_th(content) VALUES ('การทดสอบภาษาไทยในระบบค้นหา');\nSELECT * FROM documents_th WHERE documents_th MATCH 'ภาษา';\n```\n\n**Example for the Universal Tokenizer:**\n\n```sql\n-- Provide the path to the specific library in your build directory.\n.load ./build/fts5_icu.so\n\n-- Create a virtual table using the correctly named tokenizer\nCREATE VIRTUAL TABLE documents USING fts5(\n    content,\n    tokenize = 'icu'\n);\n\n-- Insert and query text\nINSERT INTO documents(content) VALUES ('甜蜜蜜,你笑得甜蜜蜜-หวานปานน้ำผึ้ง,ยิ้มของคุณช่างหวานปานน้ำผึ้ง');\nSELECT * FROM documents WHERE documents MATCH 'หวาน';\nSELECT * FROM documents WHERE documents MATCH '甜蜜蜜';\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcwt%2Ffts5-icu-tokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcwt%2Ffts5-icu-tokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcwt%2Ffts5-icu-tokenizer/lists"}