An open API service indexing awesome lists of open source software.

https://github.com/dariasmyr/fts-engine

A modular full-text search engine in Go with instant indexing, pluggable indexers, and configurable pre-search filters.
https://github.com/dariasmyr/fts-engine

fulltext-search fuzzy-search ngram-analysis ngrams stemming trie

Last synced: 3 months ago
JSON representation

A modular full-text search engine in Go with instant indexing, pluggable indexers, and configurable pre-search filters.

Awesome Lists containing this project

README

          

# Full-Text Search Test Engine

Reusable full-text search engine in Go with configurable indexes, token pipeline, and snapshot support.

![Demo](docs/demo.gif)

## What this repository provides

- Public library API in `pkg/fts`.
- Public index implementations in `pkg/index/*`:
- `radix`
- `slicedradix`
- `trigram`
- `hamt`
- `hamtpointered`
- Public text processing pipeline in `pkg/textproc`.
- Public key generators in `pkg/keygen`.
- Public probabilistic filters in `pkg/filter`.
- CLI entrypoint in `cmd/fts` with:
- `prod` mode (run with configurable filters and interactive CUI)
- `experiment` mode (collect indexing metrics)

## Library usage

### 1) Install

```bash
go get github.com/dariasmyr/fts-engine@latest
```

If you test against local source:

```go
replace github.com/dariasmyr/fts-engine => /absolute/path/to/fts-engine
```

### 2) Quickstart

```go
package main

import (
"context"
"fmt"

"github.com/dariasmyr/fts-engine/pkg/fts"
"github.com/dariasmyr/fts-engine/pkg/index/radix"
"github.com/dariasmyr/fts-engine/pkg/keygen"
)

func main() {
engine := fts.New(radix.New(), keygen.Word)

_ = engine.IndexDocument(context.Background(), "doc-1", "Wikipedia: Rosa is a French hotel barge")
res, _ := engine.SearchDocuments(context.Background(), "french hotel", 10)

fmt.Println(res.TotalResultsCount)
}
```

### 3) Snapshots

#### Simple file snapshot (index + filter in separate files)

Use `pkg/ftsbuiltin` for built-in name-based codecs (`radix`, `bloom`, etc.) without manual codec registry wiring.

For a more advanced in-memory `io.Writer`/`io.Reader` example with one combined payload, see `examples/client-library/snapshot-buffer-filter/main.go`.

Save index + filter snapshots:

```go
package main

import (
"context"
"os"

"github.com/dariasmyr/fts-engine/pkg/fts"
"github.com/dariasmyr/fts-engine/pkg/ftsbuiltin"
"github.com/dariasmyr/fts-engine/pkg/keygen"
)

func main() {
opts := ftsbuiltin.FilterOptions{BloomExpectedItems: 1_000_000, BloomBitsPerItem: 10, BloomK: 7}

idx, _ := ftsbuiltin.BuildIndex("radix")
flt, _ := ftsbuiltin.BuildFilter("bloom", opts)
svc := fts.New(idx, keygen.Word, fts.WithFilter(flt))
_ = svc.IndexDocument(context.Background(), "doc-1", "file snapshot demo")

indexOut, _ := os.Create("./data/segments/default.fidx")
defer indexOut.Close()
_ = ftsbuiltin.SaveIndexSnapshot(indexOut, "radix", idx)

filterOut, _ := os.Create("./data/segments/default.fflt")
defer filterOut.Close()
_ = ftsbuiltin.SaveFilterSnapshot(filterOut, "bloom", flt)
}
```

Load snapshots from files:

```go
package main

import (
"context"
"fmt"
"os"

"github.com/dariasmyr/fts-engine/pkg/fts"
"github.com/dariasmyr/fts-engine/pkg/ftsbuiltin"
"github.com/dariasmyr/fts-engine/pkg/keygen"
)

func main() {
indexIn, _ := os.Open("./data/segments/default.fidx")
defer indexIn.Close()
loadedIndex, _ := ftsbuiltin.LoadIndexSnapshot(indexIn)

filterIn, _ := os.Open("./data/segments/default.fflt")
defer filterIn.Close()
loadedFilter, _ := ftsbuiltin.LoadFilterSnapshot(filterIn)

restored := fts.New(loadedIndex.Index, keygen.Word, fts.WithFilter(loadedFilter.Filter))
res, _ := restored.SearchDocuments(context.Background(), "snapshot", 10)
fmt.Println(res.TotalResultsCount)
}
```

### 4) Custom pipeline and language presets

Default preset shortcut:

```go
engine := fts.New(radix.New(), keygen.Word, ftspreset.English())
```

Available presets:

- `textproc.DefaultEnglishPipeline()`
- `textproc.DefaultRussianPipeline()`
- `textproc.DefaultMultilingualPipeline()`
- `ftspreset.English()` / `ftspreset.Russian()` / `ftspreset.Multilingual()`

Custom pipeline:

```go
pipe := textproc.NewPipeline(
textproc.AlnumTokenizer{},
textproc.LowercaseFilter{},
textproc.MinLengthOrNumericFilter{MinLength: 2},
textproc.EnglishStopwordFilter{},
textproc.EnglishStemFilter{},
)

engine := fts.New(radix.New(), keygen.Word, fts.WithPipeline(pipe))
```

## Run main app (local testing via config)

Use this only when you want to test the repository app itself (`cmd/fts`), not when embedding the library into your service.

1) Create config from template:

```bash
cp ./config/config_local_example.yaml ./config/config_local.yaml
```

2) Run with config:

```bash
go run ./cmd/fts --config=./config/config_local.yaml
```

Important config fields:

```yaml
fts:
engine: "trie" # trie|kv
index: "radix" # radix|slicedradix|trigram|hamt|hamtpointered
keygen: "word" # word|trigram
filter: "none" # none|bloom|cuckoo|ribbon
snapshot:
enabled: true
path: "./data/segments/default.fidx"
split_files: true
index_path: "./data/segments/local.index.fidx"
filter_path: "./data/segments/loca.filter.fidx"
load_on_start: true
save_on_build: true
buffer_size: 1048576
flush_threshold: 262144
sync_file: true
bloom:
expected_items: 1000000
bits_per_item: 10
k: 7
cuckoo:
bucket_count: 262144
bucket_size: 4
max_kicks: 500
ribbon:
expected_items: 1000000
extra_cells: 250000
window_size: 24 # 1..32
seed: 0
max_attempts: 5
pipeline:
lowercase: true
stopwords_en: true
stopwords_ru: false
stem_en: true
stem_ru: false
min_length: 3
mode:
type: "prod" # prod|experiment
```

Snapshot fields (`fts.snapshot`):

- `enabled`: enable snapshot persistence flow in CLI prod mode.
- `path`: final snapshot artifact path.
- `split_files`: if true, save/load index and filter in separate files.
- `index_path`: optional explicit path for index snapshot file in split mode.
- `filter_path`: optional explicit path for filter snapshot file in split mode.
- `load_on_start`: if true and snapshot exists, load it and skip rebuild.
- `save_on_build`: if true, save snapshot after indexing finishes.
- `buffer_size`: writer buffer size used during save.
- `flush_threshold`: buffered flush threshold used by the built-in save helper.
- `sync_file`: fsync temp file before atomic rename.

## CLI modes

- `prod`:
- runs engine with configurable pipeline and interactive CUI search,
- if `fts.snapshot.enabled=true` and `load_on_start=true` and snapshot exists: loads snapshot and skips re-index,
- otherwise indexes documents and (if `save_on_build=true`) persists snapshot atomically.
- `experiment`:
- always indexes current input and prints memory/index stats,
- does not run CUI snapshot restore flow.

## Ribbon filter usage

Ribbon is a static filter. In `fts` it is used via `BufferedStaticFilter`.

Build ribbon from file with a custom parser:

```go
opts := ftsbuiltin.FilterOptions{
RibbonExpectedItems: 1_000_000, // estimated unique keys
RibbonExtraCells: 250_000,
RibbonWindowSize: 16,
RibbonSeed: 0,
RibbonMaxAttempts: 5,
}

rf, _ := filter.NewRibbonFilter(
opts.RibbonExpectedItems,
opts.RibbonExtraCells,
opts.RibbonWindowSize,
opts.RibbonSeed,
)

_ = rf.BuildWithRetriesFromFileWithParser("./data/keys.txt", parseKeysFile, opts.RibbonMaxAttempts)

out, _ := os.Create("./data/segments/ribbon.filter.fidx")
defer out.Close()
_ = rf.Serialize(out)
```

Minimal parser example (line-by-line keys):

```go
func parseKeysFile(path string, emit func([]byte) bool) error {
f, err := os.Open(path)
if err != nil {
return err
}
defer f.Close()

s := bufio.NewScanner(f)
for s.Scan() {
key := strings.TrimSpace(s.Text())
if key == "" {
continue
}
if !emit([]byte(key)) {
break
}
}

return s.Err()
}
```

Load ribbon filter from file:

```go
in, _ := os.Open("./data/segments/ribbon.filter.fidx")
defer in.Close()

ribbonFilter, _ := filter.LoadRibbonFilter(in)

fmt.Println(ribbonFilter.Contains([]byte("market")))
```

Full runnable example (default parser save, custom parser save, load from file, normalized `Contains`) is in `examples/client-library/ribbon-file/main.go`.

### Standalone filter `Contains` with normalization

Use this when you store normalized keys in filter and later want to check a raw user word.

Example: indexed key is `beauty`, user enters `beautiful`.
With stemming, both become `beauti`, so normalized check returns `true`.

```go
pipe := textproc.NewPipeline(
textproc.AlnumTokenizer{},
textproc.LowercaseFilter{},
textproc.EnglishStemFilter{},
)

indexedTerms := []string{"beauty", "hotel"}
normalizedKeys := make([]string, 0, len(indexedTerms))
for _, term := range indexedTerms {
keys, _ := fts.NormalizeToKeys(term, pipe, keygen.Word)
normalizedKeys = append(normalizedKeys, keys...)
}

rf, _ := filter.NewRibbonFilter(uint32(len(normalizedKeys)), 32, 24, 0)
stream := func(emit func([]byte) bool) error {
for _, key := range normalizedKeys {
if !emit([]byte(key)) {
break
}
}
return nil
}

_ = rf.BuildWithRetriesFromKeyStream(stream, 5)

raw := rf.Contains([]byte("beautiful")) // false: filter stores normalized keys

normalized, _ := fts.ContainsNormalized(rf, "beautiful", pipe, keygen.Word)

fmt.Println("raw", raw, "normalized", normalized) // raw=false normalized=true
```

`ContainsNormalized` applies pipeline + keygen and checks all normalized keys via `Contains`.

## Tests

Run all tests:

```bash
go test ./...
```

Run only public packages:

```bash
go test ./pkg/...
```