Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ankane/neighbor
Nearest neighbor search for Rails and Postgres
https://github.com/ankane/neighbor
nearest-neighbor-search
Last synced: about 1 month ago
JSON representation
Nearest neighbor search for Rails and Postgres
- Host: GitHub
- URL: https://github.com/ankane/neighbor
- Owner: ankane
- License: mit
- Created: 2021-02-16T04:36:33.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2024-07-16T00:40:26.000Z (2 months ago)
- Last Synced: 2024-07-16T04:04:37.050Z (2 months ago)
- Topics: nearest-neighbor-search
- Language: Ruby
- Homepage:
- Size: 153 KB
- Stars: 463
- Watchers: 15
- Forks: 13
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Neighbor
Nearest neighbor search for Rails and Postgres
[![Build Status](https://github.com/ankane/neighbor/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/neighbor/actions)
## Installation
Add this line to your application’s Gemfile:
```ruby
gem "neighbor"
```## Choose An Extension
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [vector](https://github.com/pgvector/pgvector). cube ships with Postgres, while vector supports more dimensions and approximate nearest neighbor search.
For cube, run:
```sh
rails generate neighbor:cube
rails db:migrate
```For vector, [install pgvector](https://github.com/pgvector/pgvector#installation) and run:
```sh
rails generate neighbor:vector
rails db:migrate
```## Getting Started
Create a migration
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[7.1]
def change
add_column :items, :embedding, :cube
# or
add_column :items, :embedding, :vector, limit: 3 # dimensions
end
end
```Add to your model
```ruby
class Item < ApplicationRecord
has_neighbors :embedding
end
```Update the vectors
```ruby
item.update(embedding: [1.0, 1.2, 0.5])
```Get the nearest neighbors to a record
```ruby
item.nearest_neighbors(:embedding, distance: "euclidean").first(5)
```Get the nearest neighbors to a vector
```ruby
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)
```## Distance
Supported values are:
- `euclidean`
- `cosine`
- `taxicab`
- `chebyshev` (cube only)
- `inner_product` (vector only)
- `hamming` (vector only)
- `jaccard` (vector only)For cosine distance with cube, vectors must be normalized before being stored.
```ruby
class Item < ApplicationRecord
has_neighbors :embedding, normalize: true
end
```For inner product with cube, see [this example](examples/disco_user_recs_cube.rb).
Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
```ruby
nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
nearest_item.neighbor_distance
```## Dimensions
The cube data type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this. The vector data type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
```ruby
class Item < ApplicationRecord
has_neighbors :embedding, dimensions: 3
end
```## Indexing
For vector, add an approximate index to speed up queries. Create a migration with:
```ruby
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.1]
def change
add_index :items, :embedding, using: :hnsw, opclass: :vector_l2_ops
# or
add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
end
end
```Use `:vector_cosine_ops` for cosine distance and `:vector_ip_ops` for inner product.
Set the size of the dynamic candidate list with HNSW
```ruby
Item.connection.execute("SET hnsw.ef_search = 100")
```Or the number of probes with IVFFlat
```ruby
Item.connection.execute("SET ivfflat.probes = 3")
```## Half-Precision Vectors
Use the `halfvec` type to store half-precision vectors
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[7.1]
def change
add_column :items, :embedding, :halfvec, limit: 3 # dimensions
end
end
```## Half-Precision Indexing
Index vectors at half precision for smaller indexes
```ruby
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.1]
def change
add_index :items, "(embedding::halfvec(3)) vector_l2_ops", using: :hnsw
end
end
```Get the nearest neighbors [unreleased]
```ruby
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)
```## Binary Vectors
Use the `bit` type to store binary vectors
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[7.1]
def change
add_column :items, :embedding, :bit, limit: 3 # dimensions
end
end
```Get the nearest neighbors by Hamming distance
```ruby
Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)
```## Binary Quantization
Use expression indexing for binary quantization
```ruby
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.1]
def change
add_index :items, "(binary_quantize(embedding)::bit(3)) bit_hamming_ops", using: :hnsw
end
end
```## Sparse Vectors
Use the `sparsevec` type to store sparse vectors
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[7.1]
def change
add_column :items, :embedding, :sparsevec, limit: 3 # dimensions
end
end
```Get the nearest neighbors
```ruby
embedding = Neighbor::SparseVector.new({0 => 0.9, 1 => 1.3, 2 => 1.1}, 3)
Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)
```## Examples
- [OpenAI Embeddings](#openai-embeddings)
- [Cohere Embeddings](#cohere-embeddings)
- [Disco Recommendations](#disco-recommendations)### OpenAI Embeddings
Generate a model
```sh
rails generate model Document content:text embedding:vector{1536}
rails db:migrate
```And add `has_neighbors`
```ruby
class Document < ApplicationRecord
has_neighbors :embedding
end
```Create a method to call the [embeddings API](https://platform.openai.com/docs/guides/embeddings)
```ruby
def fetch_embeddings(input)
url = "https://api.openai.com/v1/embeddings"
headers = {
"Authorization" => "Bearer #{ENV.fetch("OPENAI_API_KEY")}",
"Content-Type" => "application/json"
}
data = {
input: input,
model: "text-embedding-3-small"
}response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
JSON.parse(response.body)["data"].map { |v| v["embedding"] }
end
```Pass your input
```ruby
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = fetch_embeddings(input)
```Store the embeddings
```ruby
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)
```And get similar documents
```ruby
document = Document.first
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)
```See the [complete code](examples/openai_embeddings.rb)
### Cohere Embeddings
Generate a model
```sh
rails generate model Document content:text embedding:bit{1024}
rails db:migrate
```And add `has_neighbors`
```ruby
class Document < ApplicationRecord
has_neighbors :embedding
end
```Create a method to call the [embed API](https://docs.cohere.com/reference/embed)
```ruby
def fetch_embeddings(input, input_type)
url = "https://api.cohere.com/v1/embed"
headers = {
"Authorization" => "Bearer #{ENV.fetch("CO_API_KEY")}",
"Content-Type" => "application/json"
}
data = {
texts: input,
model: "embed-english-v3.0",
input_type: input_type,
embedding_types: ["ubinary"]
}response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
JSON.parse(response.body)["embeddings"]["ubinary"].map { |e| e.map { |v| v.chr.unpack1("B*") }.join }
end
```Pass your input
```ruby
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = fetch_embeddings(input, "search_document")
```Store the embeddings
```ruby
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)
```Embed the search query
```ruby
query = "forest"
query_embedding = fetch_embeddings([query], "search_query")[0]
```And search the documents
```ruby
Document.nearest_neighbors(:embedding, query_embedding, distance: "hamming").first(5).map(&:content)
```See the [complete code](examples/cohere_embeddings.rb)
### Disco Recommendations
You can use Neighbor for online item-based recommendations with [Disco](https://github.com/ankane/disco). We’ll use MovieLens data for this example.
Generate a model
```sh
rails generate model Movie name:string factors:cube
rails db:migrate
```And add `has_neighbors`
```ruby
class Movie < ApplicationRecord
has_neighbors :factors, dimensions: 20, normalize: true
end
```Fit the recommender
```ruby
data = Disco.load_movielens
recommender = Disco::Recommender.new(factors: 20)
recommender.fit(data)
```Store the item factors
```ruby
movies = []
recommender.item_ids.each do |item_id|
movies << {name: item_id, factors: recommender.item_factors(item_id)}
end
Movie.insert_all!(movies)
```And get similar movies
```ruby
movie = Movie.find_by(name: "Star Wars (1977)")
movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)
```See the complete code for [cube](examples/disco_item_recs_cube.rb) and [vector](examples/disco_item_recs_vector.rb)
## History
View the [changelog](https://github.com/ankane/neighbor/blob/master/CHANGELOG.md)
## Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- [Report bugs](https://github.com/ankane/neighbor/issues)
- Fix bugs and [submit pull requests](https://github.com/ankane/neighbor/pulls)
- Write, clarify, or fix documentation
- Suggest or add new featuresTo get started with development:
```sh
git clone https://github.com/ankane/neighbor.git
cd neighbor
bundle install
createdb neighbor_test
bundle exec rake test
```