https://github.com/hmerritt/go-ngram

ngram indexing in Go. Create a reverse-index of strings, with custom indexes as values, and search through them.
https://github.com/hmerritt/go-ngram

Last synced: 2 months ago
JSON representation

ngram indexing in Go. Create a reverse-index of strings, with custom indexes as values, and search through them.

Host: GitHub
URL: https://github.com/hmerritt/go-ngram
Owner: hmerritt
License: apache-2.0
Created: 2021-05-06T00:01:20.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2021-05-14T17:36:51.000Z (about 4 years ago)
Last Synced: 2023-07-27T22:12:23.421Z (almost 2 years ago)
Language: Go
Homepage:
Size: 80.1 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # n-gram Index and Search

[![Go Reference](https://pkg.go.dev/badge/github.com/hmerritt/go-ngram.svg)](https://pkg.go.dev/github.com/hmerritt/go-ngram) [![Go](https://github.com/hmerritt/go-ngram/actions/workflows/go.yml/badge.svg)](https://github.com/hmerritt/go-ngram/actions/workflows/go.yml) [![Coverage Status](https://coveralls.io/repos/github/hmerritt/go-ngram/badge.svg?branch=master)](https://coveralls.io/github/hmerritt/go-ngram?branch=master&kill_cache=1)

Create n-grams, index items, and search through them.

## Install

```bash

go get github.com/hmerritt/go-ngram

```

## What is an N-gram reverse-index?

I highly recommend reading the following:

-   [Russ Cox - Google Code Search, Trigram Index](https://swtch.com/~rsc/regexp/regexp4.html)

The `N` in n-gram refers to a set number. The most common n-gram is the 3-gram, usually called a "tri-gram".

A tri-gram index stores each item in sets of 3.

```

Hello, World  ->  [Hel ell llo lo, o,_ _Wo Wor orl rld]

Search("rld") -> Hello, World

        ^^^ any 3 (or more) characters will find a match

```

#### Why use N-gram indexing?

-   Used to search through large data sets quickly.

-   Has built-in spelling correction (only 3 chars need to match)

## Usage

| Methods          | Description                                                                                                  |

| ---------------- | ------------------------------------------------------------------------------------------------------------ |

| `NewNgramIndex`  | Returns a new trigram index with the default settings                                                        |

| `NewIndexValue`  | Returns a new index value using custom input values (use this for `Add()`)                                   |

| `Add`            | Add a string AND your own index value                                                                        |

| `Search`         | Returns slice of all matched strings which contains your index value and the match strength, `[]*IndexValue` |

|                  |                                                                                                              |

| Helper functions |                                                                                                              |

| `StringToNgram`  | Returns an ngram of length N - Splits a string into groups of N length                                       |

| `GetMatches`     | Get all ngram matches, returns an unsorted map of indexes along with their Index Value struct                |

| `SortMatches`    | Sorts output from `GetMatches` into a slice, first index [0] = best match                                    |

### Example

```go

// Create a new index

// Default ngram size is '3' (tri-gram)

ni := ngram.NewNgramIndex()

// Add string + add your OWN index value

// (the index value is what will be returned when a search matches)

ni.Add(string, *IndexValue)

ni.Add("I just think they're neat.", NewIndexValue(0, "This additional index string can be anything"))

ni.Add("Don’t eat me. I have a wife and kids. Eat them", NewIndexValue(1, "It could be a filepath"))

ni.Add("Words to match above quotes... string eat neat", NewIndexValue(2, "Or maybe a URL"))

// Search

// Returns a sorted slice of matches (sorts based on number of matches)

search := ni.Search("Eat")

// -> [1, 2]

// -> [IndexValue{1, "It could be a filepath"}, IndexValue{2, "Or maybe a URL"}]

search = ni.Search("near") // near will match [nea]t

// -> [0, 2]

// Usually, you would loop your own data-set

// and add the index from that as the ngram index value

for index, file := myFiles {

	ni.Add(file.content, NewIndexValue(index, file.path)

}

```

### Long-Winded Example

```go

package main

import (

	"fmt"

	ngram "github.com/hmerritt/go-ngram"

)

// Example structure

type File struct {

	Path    string

	Content string

}

// Example data-set

var files = make([]File, 0)

func main() {

	// Add Files to example data-set

	files = append(files, File{

		Path:    "/home/me/nice.txt",

		Content: "Example file data, nice",

	})

	files = append(files, File{

		Path:    "/home/me/example-file.txt",

		Content: "More example file data, neat",

	})

	/*

	 * Create N-gram here!

	 */

	// Create a new index

	// Default ngram size is '3' (tri-gram)

	ni := ngram.NewNgramIndex()

	// Add each File in our example data-set

	for index, file := range files {

		// Add a string + the index value we want to store in assosiation with the input string

		ni.Add(file.Content, ngram.NewIndexValue(index, file.Path))

		// What this is actually doing

		ni.Add("More example file data, neat", ngram.NewIndexValue(1, "/home/me/example-file.txt"))

	}

	// Search indexed data

	// returns a sorted slice

	search := ni.Search("neat")

	fmt.Println("Search =", search)

	// -> [1]

	// -> [IndexValue{

	//      Index: 1,

	//      Data: "/home/me/example-file.txt"

	//    }]

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hmerritt/go-ngram

Awesome Lists containing this project

README