Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ekzhu/go-set-similarity-search
Efficient set similarity search algorithms implemented in Go
https://github.com/ekzhu/go-set-similarity-search
all-pairs set-similarity-search similarity-search
Last synced: about 2 months ago
JSON representation
Efficient set similarity search algorithms implemented in Go
- Host: GitHub
- URL: https://github.com/ekzhu/go-set-similarity-search
- Owner: ekzhu
- License: apache-2.0
- Created: 2019-01-10T20:56:22.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-08-27T14:16:02.000Z (over 2 years ago)
- Last Synced: 2024-10-14T07:33:41.463Z (2 months ago)
- Topics: all-pairs, set-similarity-search, similarity-search
- Language: Go
- Homepage:
- Size: 21.5 KB
- Stars: 29
- Watchers: 4
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Set Similarity Search in Go
[![Build Status](https://travis-ci.org/ekzhu/go-set-similarity-search.svg?branch=master)](https://travis-ci.org/ekzhu/go-set-similarity-search)
[![GoDoc](https://godoc.org/github.com/ekzhu/go-set-similarity-search?status.svg)](https://godoc.org/github.com/ekzhu/go-set-similarity-search)This is a mirror implementation of the
Python [SetSimilaritySearch](https://github.com/ekzhu/SetSimilaritySearch)
library in Go, with better performance.## Benchmarks
Run `AllPairs` algorithm on 3.5 GHz Intel Core i7,
using similarity function `jaccard` and similarity threshold 0.5.| Dataset | Input Sets | Avg. Size | `go-set-similarity-search` Runtime | `SetSimilaritySearch` Runtime |
|---------|------------|-----------|---|---|
| [Pokec social network (relationships)](https://snap.stanford.edu/data/soc-Pokec.html): from-nodes are set IDs; to-nodes are elements | 1432693 | 27.31 | 1m25s | 10m49s |
| [LiveJournal](https://snap.stanford.edu/data/soc-LiveJournal1.html): from-nodes are set IDs; to-nodes are elements | 4308452 | 16.01 | 4m11s | 28m51s |## Library Usage
For *All-Pairs*,
it takes an input of a list of sets, and output pairs that meet the
similarity threshold.```go
import (
"fmt"
"go-set-similarity-search"
)func main() {
// Each raw set must be a slice of unique string tokens.
rawSets := [][]string{
[]string{"a"},
[]string{"a", "b"},
[]string{"a", "b", "c"},
[]string{"a", "b", "c", "d"},
[]string{"a", "b", "c", "d", "e"},
}
// Use frequency order transformation to replace the string tokens
// with integers.
sets, _ := SetSimilaritySearch.FrequencyOrderTransform(rawSets)
// Run all-pairs algorithm, get a channel of pairs.
pairs, _ := SetSimilaritySearch.AllPairs(sets,
/*similarityFunctionName=*/"jaccard",
/*similarityThreshold=*/0.1)
for pair := range pairs {
// X and Y are indexes to the original rawSets and sets slices.
fmt.Println(pair.X, pair.Y, pair.Similarity)
}
}
```For *Query*, it takes an input of a list of sets, and builds a search
index that can compute any number of queries. Currently the search index
only supports a static collection of sets with no updates.```go
import (
"fmt"
"go-set-similarity-search"
)func main() {
// Each raw set must be a slice of unique string tokens.
rawSets := [][]string{
[]string{"a"},
[]string{"a", "b"},
[]string{"a", "b", "c"},
[]string{"a", "b", "c", "d"},
[]string{"a", "b", "c", "d", "e"},
}
// Use frequency order transformation to replace the string tokens
// with integers.
sets, dict := SetSimilaritySearch.FrequencyOrderTransform(rawSets)
// Build a search index.
searchIndex, err := SetSimilaritySearch.NewSearchIndex(sets,
/*similarityFunctionName=*/"jaccard",
/*similarityThreshold=*/0.1)
// Use dictionary to transform a query set.
querySet := dict.Transform([]string{"a", "c", "d"})
// Query the search index.
searchResults := searchIndex.Query(querySet)
for _, result := range searchResults {
// X is the index to the original rawSets and sets slices.
fmt.Println(result.X, result.Similarity)
}
}
```Supported similarity functions (more to come):
* [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index): intersection size divided by union size; set `similarityFunctionName="jaccard"`.
* [Cosine](https://en.wikipedia.org/wiki/Cosine_similarity): intersection size divided by square root of the product of sizes; set `similarityFunctionName="cosine"`.
* [Containment](https://ekzhu.github.io/datasketch/lshensemble.html#containment): intersection size divided by the size of the first set (or query set); set `similarityFunctionName="containment"`.