Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chakrit/fts

Fast fuzzy unicode-aware full-text-search engine for node.js
https://github.com/chakrit/fts

Last synced: about 20 hours ago
JSON representation

Fast fuzzy unicode-aware full-text-search engine for node.js

Host: GitHub
URL: https://github.com/chakrit/fts
Owner: chakrit
Created: 2012-09-25T11:37:57.000Z (about 12 years ago)
Default Branch: master
Last Pushed: 2012-10-02T12:44:24.000Z (about 12 years ago)
Last Synced: 2024-04-15T12:21:32.437Z (7 months ago)
Language: CoffeeScript
Homepage:
Size: 168 KB
Stars: 19
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        
# NODE-FTS

NODE-FTS is a full-text-search engine that tries to be:

* fast - using pre-filled redis zsets, searching is one ZREVRANGE command

* fuzzy - using some basic pre-computation

* unicode - using [node-stringprep](https://github.com/astro/node-stringprep) to normalize strings and

  [icu-wordsplit](https://github.com/chakrit/node-icu-wordsplit) for unicode word split

I had enough with all the search modules out there not having a hint of international characters

support in their implementation so I decided to just write one. If atleast to get other module

authors to pay more attention to people who don't speak English as their native language.

As I'm just using this for a few of my low-traffic personal side projects,

I will definitely need your help to submit enhancements and fix whatever bugs you may find.

If you don't like CoffeeScript, feel free to send me patches in pure JS, I'll happily convert it

for you.

## Install

Install with:

    npm install fts --save

and use with:

    var fts = require('fts')

      , doc =

        { id: 123

        , title: 'Hello World'

        , keywords: 'this, should, be, searchable, as, well' };

    fts.use(require('redis').createClient());

    fts.index(doc.id, [doc.title, doc.keywords], function(e) {

      fts.query('searchable', function(e, ids) {

        // ids == ['123']

      });

      fts.query('wrld', function(e, ids) {

        // ids == ['123']

      });

    });

### libicu dependency

FTS uses [node-stringprep](https://github.com/astro/node-stringprep) and

[icu-wordsplit](https://github.com/chakrit/node-icu-wordsplit) for unicode support. This means

that you will need to have a working libicu binaries installed on your machine. Depending on where you're developing node.js one of the following command will install a

working copy of libicu binaries and data files into your system:

    # ubuntu and debian-based systems

    apt-get install libicu-dev

    # gentoo

    emerge icu

    # os x

    port install icu +devel                 # with macports

    brew install icu4c && brew link icu4c   # homebrew

### How does it work?

Index in `fts` is actually a list of pre-sorted results for possible search keywords.

That is, when a new document is added, `fts` tries to build as many search keywords as possible

from it, give them weights, and add those to the "results list" for the keyword ready to be retrieved

on search.

When `index` is called, fts does the following:

1. Split up all the words

2. Index each word -

   Prefixes are weighted more than suffixes to makes searching for document titles and exact matches

   more effective.

3. Index "typo" variations for each word -

   For example "bngkok" for "Bangkok" is added with a less weight.

4. Index concatenated words subset for the entire string -

   This is required to effectively search in some language such as Thai where

   there are many ways to split a word (e.g. ตากลม => ตา | กลม or ตาก | ลม)

5. Index "typo" variations for the each concatenated word

## Main API

#### indexer.use( [prefix], redis-client )

Setup the indexer to use the specified `prefix` and `redis-client`.

You must call this function before using any of the fts module functionality.

* `prefix` - Prefix to use for all redis keys used by FTS.

* `redis-client` - The redis client to use. Any object with interface compatible with the de facto

  [redis module](https://github.com/mranney/node_redis) is fine.

#### indexer.index( id, items, callback )

Add one or more `items` to the index with identifier `id` and then calls `callback`.

* `id` - the document identifier which will be returned on queries

* `items` - Content string or array of strings to index

* `callback` - Standard callback with one error

#### indexer.clear( callback )

Removes all entries from the index effectively resetting it to initial state.

#### indexer.query( query, callback )

Queries the index using the string `query`.

* `query` - The string to search for. Spaces don't matter.

* `callback` - Callback function with signature `function(e, ids) { }`

  where `ids` is an array of document `ids` that matches the supplied query.

## Lower-level API

These APIs are provided in case you need more fine-grained control of the indexes

but should not need to to be used in most cases.

#### indexer.degree.words = 3

Get/set the degree to which fts will permute words. Defaults to 3.

Example: Permuting `"quick brown fox"` with degree 2 gives `["quickbrown", "brownfox"]`

querying with `"brownfox"` will returns the original string.

Higher degree enables broader search term matches but will require more memory

and CPU during `index()` calls.

#### indexer.degree.typos = 5

Get/set the degree to which fts will permute typos. Defaults to 5.

Example: Permuting `"search"` with degree 2 gives 22 results including `"serc"` and `"srch"`

querying with any 2 letters missing from the string will match the document.

Higher degree enables broader search term matches but will require more memory

and CPU during `index()` calls.

#### indexer.addKey( id, key, weight, callback )

Adds `id` to search key `key` with weight `weight` and then calls `callback` (optional).

* `id` - document identifier

* `key` - search key to add the document to, this will be normalized.

* `weight` - weight to give to this document for this particular search key.

#### indexer.removeKey( id, key, callback )

Removes `id` from the search key `key` regardless of weight.

* `id` - document identifier

* `key` - search key to add the document to, this will be normalized.

## License

BSD

## TODO / CONTRIBUTE

* Redis ZSETs can actually be replicated pretty easily using other kinds of databases such as MongoDB or Sqlite3

  so a different kind of backend store would probably benefits a lot of people.

* Reverse lookup (i.e. which search terms contains the document id)

* Ability to remove a document from the index.

* Search quality tests and/or more tests in general. I want this module to be rock solid.