https://github.com/dariasmyr/fts-engine
A modular full-text search engine in Go with instant indexing, pluggable indexers, and configurable pre-search filters.
https://github.com/dariasmyr/fts-engine
fulltext-search fuzzy-search ngram-analysis ngrams stemming trie
Last synced: 3 months ago
JSON representation
A modular full-text search engine in Go with instant indexing, pluggable indexers, and configurable pre-search filters.
- Host: GitHub
- URL: https://github.com/dariasmyr/fts-engine
- Owner: dariasmyr
- Created: 2025-01-29T19:54:15.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2026-03-19T08:59:15.000Z (3 months ago)
- Last Synced: 2026-03-19T11:39:51.503Z (3 months ago)
- Topics: fulltext-search, fuzzy-search, ngram-analysis, ngrams, stemming, trie
- Language: Go
- Homepage:
- Size: 72.5 MB
- Stars: 11
- Watchers: 2
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Full-Text Search Test Engine
Reusable full-text search engine in Go with configurable indexes, token pipeline, and snapshot support.

## What this repository provides
- Public library API in `pkg/fts`.
- Public index implementations in `pkg/index/*`:
- `radix`
- `slicedradix`
- `trigram`
- `hamt`
- `hamtpointered`
- Public text processing pipeline in `pkg/textproc`.
- Public key generators in `pkg/keygen`.
- Public probabilistic filters in `pkg/filter`.
- CLI entrypoint in `cmd/fts` with:
- `prod` mode (run with configurable filters and interactive CUI)
- `experiment` mode (collect indexing metrics)
## Library usage
### 1) Install
```bash
go get github.com/dariasmyr/fts-engine@latest
```
If you test against local source:
```go
replace github.com/dariasmyr/fts-engine => /absolute/path/to/fts-engine
```
### 2) Quickstart
```go
package main
import (
"context"
"fmt"
"github.com/dariasmyr/fts-engine/pkg/fts"
"github.com/dariasmyr/fts-engine/pkg/index/radix"
"github.com/dariasmyr/fts-engine/pkg/keygen"
)
func main() {
engine := fts.New(radix.New(), keygen.Word)
_ = engine.IndexDocument(context.Background(), "doc-1", "Wikipedia: Rosa is a French hotel barge")
res, _ := engine.SearchDocuments(context.Background(), "french hotel", 10)
fmt.Println(res.TotalResultsCount)
}
```
### 3) Snapshots
#### Simple file snapshot (index + filter in separate files)
Use `pkg/ftsbuiltin` for built-in name-based codecs (`radix`, `bloom`, etc.) without manual codec registry wiring.
For a more advanced in-memory `io.Writer`/`io.Reader` example with one combined payload, see `examples/client-library/snapshot-buffer-filter/main.go`.
Save index + filter snapshots:
```go
package main
import (
"context"
"os"
"github.com/dariasmyr/fts-engine/pkg/fts"
"github.com/dariasmyr/fts-engine/pkg/ftsbuiltin"
"github.com/dariasmyr/fts-engine/pkg/keygen"
)
func main() {
opts := ftsbuiltin.FilterOptions{BloomExpectedItems: 1_000_000, BloomBitsPerItem: 10, BloomK: 7}
idx, _ := ftsbuiltin.BuildIndex("radix")
flt, _ := ftsbuiltin.BuildFilter("bloom", opts)
svc := fts.New(idx, keygen.Word, fts.WithFilter(flt))
_ = svc.IndexDocument(context.Background(), "doc-1", "file snapshot demo")
indexOut, _ := os.Create("./data/segments/default.fidx")
defer indexOut.Close()
_ = ftsbuiltin.SaveIndexSnapshot(indexOut, "radix", idx)
filterOut, _ := os.Create("./data/segments/default.fflt")
defer filterOut.Close()
_ = ftsbuiltin.SaveFilterSnapshot(filterOut, "bloom", flt)
}
```
Load snapshots from files:
```go
package main
import (
"context"
"fmt"
"os"
"github.com/dariasmyr/fts-engine/pkg/fts"
"github.com/dariasmyr/fts-engine/pkg/ftsbuiltin"
"github.com/dariasmyr/fts-engine/pkg/keygen"
)
func main() {
indexIn, _ := os.Open("./data/segments/default.fidx")
defer indexIn.Close()
loadedIndex, _ := ftsbuiltin.LoadIndexSnapshot(indexIn)
filterIn, _ := os.Open("./data/segments/default.fflt")
defer filterIn.Close()
loadedFilter, _ := ftsbuiltin.LoadFilterSnapshot(filterIn)
restored := fts.New(loadedIndex.Index, keygen.Word, fts.WithFilter(loadedFilter.Filter))
res, _ := restored.SearchDocuments(context.Background(), "snapshot", 10)
fmt.Println(res.TotalResultsCount)
}
```
### 4) Custom pipeline and language presets
Default preset shortcut:
```go
engine := fts.New(radix.New(), keygen.Word, ftspreset.English())
```
Available presets:
- `textproc.DefaultEnglishPipeline()`
- `textproc.DefaultRussianPipeline()`
- `textproc.DefaultMultilingualPipeline()`
- `ftspreset.English()` / `ftspreset.Russian()` / `ftspreset.Multilingual()`
Custom pipeline:
```go
pipe := textproc.NewPipeline(
textproc.AlnumTokenizer{},
textproc.LowercaseFilter{},
textproc.MinLengthOrNumericFilter{MinLength: 2},
textproc.EnglishStopwordFilter{},
textproc.EnglishStemFilter{},
)
engine := fts.New(radix.New(), keygen.Word, fts.WithPipeline(pipe))
```
## Run main app (local testing via config)
Use this only when you want to test the repository app itself (`cmd/fts`), not when embedding the library into your service.
1) Create config from template:
```bash
cp ./config/config_local_example.yaml ./config/config_local.yaml
```
2) Run with config:
```bash
go run ./cmd/fts --config=./config/config_local.yaml
```
Important config fields:
```yaml
fts:
engine: "trie" # trie|kv
index: "radix" # radix|slicedradix|trigram|hamt|hamtpointered
keygen: "word" # word|trigram
filter: "none" # none|bloom|cuckoo|ribbon
snapshot:
enabled: true
path: "./data/segments/default.fidx"
split_files: true
index_path: "./data/segments/local.index.fidx"
filter_path: "./data/segments/loca.filter.fidx"
load_on_start: true
save_on_build: true
buffer_size: 1048576
flush_threshold: 262144
sync_file: true
bloom:
expected_items: 1000000
bits_per_item: 10
k: 7
cuckoo:
bucket_count: 262144
bucket_size: 4
max_kicks: 500
ribbon:
expected_items: 1000000
extra_cells: 250000
window_size: 24 # 1..32
seed: 0
max_attempts: 5
pipeline:
lowercase: true
stopwords_en: true
stopwords_ru: false
stem_en: true
stem_ru: false
min_length: 3
mode:
type: "prod" # prod|experiment
```
Snapshot fields (`fts.snapshot`):
- `enabled`: enable snapshot persistence flow in CLI prod mode.
- `path`: final snapshot artifact path.
- `split_files`: if true, save/load index and filter in separate files.
- `index_path`: optional explicit path for index snapshot file in split mode.
- `filter_path`: optional explicit path for filter snapshot file in split mode.
- `load_on_start`: if true and snapshot exists, load it and skip rebuild.
- `save_on_build`: if true, save snapshot after indexing finishes.
- `buffer_size`: writer buffer size used during save.
- `flush_threshold`: buffered flush threshold used by the built-in save helper.
- `sync_file`: fsync temp file before atomic rename.
## CLI modes
- `prod`:
- runs engine with configurable pipeline and interactive CUI search,
- if `fts.snapshot.enabled=true` and `load_on_start=true` and snapshot exists: loads snapshot and skips re-index,
- otherwise indexes documents and (if `save_on_build=true`) persists snapshot atomically.
- `experiment`:
- always indexes current input and prints memory/index stats,
- does not run CUI snapshot restore flow.
## Ribbon filter usage
Ribbon is a static filter. In `fts` it is used via `BufferedStaticFilter`.
Build ribbon from file with a custom parser:
```go
opts := ftsbuiltin.FilterOptions{
RibbonExpectedItems: 1_000_000, // estimated unique keys
RibbonExtraCells: 250_000,
RibbonWindowSize: 16,
RibbonSeed: 0,
RibbonMaxAttempts: 5,
}
rf, _ := filter.NewRibbonFilter(
opts.RibbonExpectedItems,
opts.RibbonExtraCells,
opts.RibbonWindowSize,
opts.RibbonSeed,
)
_ = rf.BuildWithRetriesFromFileWithParser("./data/keys.txt", parseKeysFile, opts.RibbonMaxAttempts)
out, _ := os.Create("./data/segments/ribbon.filter.fidx")
defer out.Close()
_ = rf.Serialize(out)
```
Minimal parser example (line-by-line keys):
```go
func parseKeysFile(path string, emit func([]byte) bool) error {
f, err := os.Open(path)
if err != nil {
return err
}
defer f.Close()
s := bufio.NewScanner(f)
for s.Scan() {
key := strings.TrimSpace(s.Text())
if key == "" {
continue
}
if !emit([]byte(key)) {
break
}
}
return s.Err()
}
```
Load ribbon filter from file:
```go
in, _ := os.Open("./data/segments/ribbon.filter.fidx")
defer in.Close()
ribbonFilter, _ := filter.LoadRibbonFilter(in)
fmt.Println(ribbonFilter.Contains([]byte("market")))
```
Full runnable example (default parser save, custom parser save, load from file, normalized `Contains`) is in `examples/client-library/ribbon-file/main.go`.
### Standalone filter `Contains` with normalization
Use this when you store normalized keys in filter and later want to check a raw user word.
Example: indexed key is `beauty`, user enters `beautiful`.
With stemming, both become `beauti`, so normalized check returns `true`.
```go
pipe := textproc.NewPipeline(
textproc.AlnumTokenizer{},
textproc.LowercaseFilter{},
textproc.EnglishStemFilter{},
)
indexedTerms := []string{"beauty", "hotel"}
normalizedKeys := make([]string, 0, len(indexedTerms))
for _, term := range indexedTerms {
keys, _ := fts.NormalizeToKeys(term, pipe, keygen.Word)
normalizedKeys = append(normalizedKeys, keys...)
}
rf, _ := filter.NewRibbonFilter(uint32(len(normalizedKeys)), 32, 24, 0)
stream := func(emit func([]byte) bool) error {
for _, key := range normalizedKeys {
if !emit([]byte(key)) {
break
}
}
return nil
}
_ = rf.BuildWithRetriesFromKeyStream(stream, 5)
raw := rf.Contains([]byte("beautiful")) // false: filter stores normalized keys
normalized, _ := fts.ContainsNormalized(rf, "beautiful", pipe, keygen.Word)
fmt.Println("raw", raw, "normalized", normalized) // raw=false normalized=true
```
`ContainsNormalized` applies pipeline + keygen and checks all normalized keys via `Contains`.
## Tests
Run all tests:
```bash
go test ./...
```
Run only public packages:
```bash
go test ./pkg/...
```