Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/haifengkao/sqlitesubstringsearch

An open source tokenizer which supports fast substring search with sqlite FTS (full text search)
https://github.com/haifengkao/sqlitesubstringsearch

Last synced: about 1 month ago
JSON representation

An open source tokenizer which supports fast substring search with sqlite FTS (full text search)

Host: GitHub
URL: https://github.com/haifengkao/sqlitesubstringsearch
Owner: haifengkao
Created: 2013-04-06T23:33:23.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2015-11-26T06:31:09.000Z (about 9 years ago)
Last Synced: 2023-05-27T19:01:09.565Z (over 1 year ago)
Language: C
Homepage:
Size: 15.6 KB
Stars: 83
Watchers: 3
Forks: 9
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        SqliteSubstringSearch

=====================

An open source tokenizer which supports fast substring search with sqlite FTS (full text search).

If you think `LIKE '%text%'` is too slow, this is the right solution for you.

## How to use it

* register the "character_tokenizer" module

* create a FTS table with character_tokenizer. For example:

    CREATE VIRTUAL TABLE Book USING fts3(name TEXT NOT NULL, author TEXT, tokenize=character);

* to search for a substring, use [phrase queries](http://www.sqlite.org/fts3.html#section_3). For example, to match strings such as "Adrenalines", "Linux", or "Penicillin", use:

    SELECT * FROM docs WHERE docs MATCH '"lin"';

See SqliteSubstringSearchDemo for a complete example.

## Objective-C Example

If you want to open a database encoded by `character` tokenizer, do the following:        

```objc        

#import 

#import "character_tokenizer.h"

FMDatabase* database = [[FMDatabase alloc] initWithPath:@"my_database.db"];

if ([database open]) {

    // add FTS support

    const sqlite3_tokenizer_module *ptr;

    get_character_tokenizer_module(&ptr);

    registerTokenizer(database.sqliteHandle, "character", ptr);

}

```      

        

## Motivation

English uses a space to separate words, but Chinese and Japanese do not.

Since built-in FTS tokenizers relies on spaces to separate words, it will treat the whole sentence in Chinese or Japanese as a single word, which makes FTS not useful at all in these languages.

The third-party Chinese and Japanese tokenizers ([mmseg](https://code.google.com/p/pymmseg-cpp/) for Chinese, [MeCab](http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html), [ChaSen](http://chasen-legacy.sourceforge.jp/) for Japanese) use sophisticated and memory intensive approaches to find the ambiguous boundary between words. For simple applications such as querying for people names, a simple substring search is a more reasonable choice than these sophisticated tokenizers.

## How it works

The character tokenizer partitions each character as an individual token. 

Searching for a substring is equivalent to finding consecutive tokens in the document, which are provided by FTS as [phrase queries](http://www.sqlite.org/fts3.html#section_3).