https://github.com/ankane/neighbor
Nearest neighbor search for Rails
https://github.com/ankane/neighbor
nearest-neighbor-search
Last synced: 8 days ago
JSON representation
Nearest neighbor search for Rails
- Host: GitHub
- URL: https://github.com/ankane/neighbor
- Owner: ankane
- License: mit
- Created: 2021-02-16T04:36:33.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2025-02-16T02:55:33.000Z (2 months ago)
- Last Synced: 2025-04-09T23:12:46.975Z (8 days ago)
- Topics: nearest-neighbor-search
- Language: Ruby
- Homepage:
- Size: 281 KB
- Stars: 663
- Watchers: 13
- Forks: 15
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-rails-with-postgres - Neighbor - neighbor search, ActiveRecord style (Gems / Query Enhancement)
- awesome-rails-with-postgres - Neighbor - neighbor search, ActiveRecord style (Gems / Query Enhancement)
README
# Neighbor
Nearest neighbor search for Rails
Supports:
- Postgres (cube and pgvector)
- SQLite (sqlite-vec) - experimental
- MariaDB 11.7 - experimental
- MySQL 9 (searching requires HeatWave) - experimental[](https://github.com/ankane/neighbor/actions)
## Installation
Add this line to your application’s Gemfile:
```ruby
gem "neighbor"
```### For Postgres
Neighbor supports two extensions: [cube](https://www.postgresql.org/docs/current/cube.html) and [pgvector](https://github.com/pgvector/pgvector). cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.
For cube, run:
```sh
rails generate neighbor:cube
rails db:migrate
```For pgvector, [install the extension](https://github.com/pgvector/pgvector#installation) and run:
```sh
rails generate neighbor:vector
rails db:migrate
```### For SQLite
Add this line to your application’s Gemfile:
```ruby
gem "sqlite-vec"
```And run:
```sh
rails generate neighbor:sqlite
```## Getting Started
Create a migration
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
def change
# cube
add_column :items, :embedding, :cube# pgvector, MariaDB, and MySQL
add_column :items, :embedding, :vector, limit: 3 # dimensions# sqlite-vec
add_column :items, :embedding, :binary
end
end
```Add to your model
```ruby
class Item < ApplicationRecord
has_neighbors :embedding
end
```Update the vectors
```ruby
item.update(embedding: [1.0, 1.2, 0.5])
```Get the nearest neighbors to a record
```ruby
item.nearest_neighbors(:embedding, distance: "euclidean").first(5)
```Get the nearest neighbors to a vector
```ruby
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)
```Records returned from `nearest_neighbors` will have a `neighbor_distance` attribute
```ruby
nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
nearest_item.neighbor_distance
```See the additional docs for:
- [cube](#cube)
- [pgvector](#pgvector)
- [sqlite-vec](#sqlite-vec)
- [MariaDB](#mariadb)
- [MySQL](#mysql)Or check out some [examples](#examples)
## cube
### Distance
Supported values are:
- `euclidean`
- `cosine`
- `taxicab`
- `chebyshev`For cosine distance with cube, vectors must be normalized before being stored.
```ruby
class Item < ApplicationRecord
has_neighbors :embedding, normalize: true
end
```For inner product with cube, see [this example](examples/disco/user_recs_cube.rb).
### Dimensions
The `cube` type can have up to 100 dimensions by default. See the [Postgres docs](https://www.postgresql.org/docs/current/cube.html) for how to increase this.
For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
```ruby
class Item < ApplicationRecord
has_neighbors :embedding, dimensions: 3
end
```## pgvector
### Distance
Supported values are:
- `euclidean`
- `inner_product`
- `cosine`
- `taxicab`
- `hamming`
- `jaccard`### Dimensions
The `vector` type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
The `halfvec` type can have up to 16,000 dimensions, and half vectors with up to 4,000 dimensions can be indexed.
The `bit` type can have up to 83 million dimensions, and bit vectors with up to 64,000 dimensions can be indexed.
The `sparsevec` type can have up to 16,000 non-zero elements, and sparse vectors with up to 1,000 non-zero elements can be indexed.
### Indexing
Add an approximate index to speed up queries. Create a migration with:
```ruby
class AddIndexToItemsEmbedding < ActiveRecord::Migration[8.0]
def change
add_index :items, :embedding, using: :hnsw, opclass: :vector_l2_ops
# or
add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
end
end
```Use `:vector_cosine_ops` for cosine distance and `:vector_ip_ops` for inner product.
Set the size of the dynamic candidate list with HNSW
```ruby
Item.connection.execute("SET hnsw.ef_search = 100")
```Or the number of probes with IVFFlat
```ruby
Item.connection.execute("SET ivfflat.probes = 3")
```### Half-Precision Vectors
Use the `halfvec` type to store half-precision vectors
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
def change
add_column :items, :embedding, :halfvec, limit: 3 # dimensions
end
end
```### Half-Precision Indexing
Index vectors at half precision for smaller indexes
```ruby
class AddIndexToItemsEmbedding < ActiveRecord::Migration[8.0]
def change
add_index :items, "(embedding::halfvec(3)) vector_l2_ops", using: :hnsw
end
end
```Get the nearest neighbors
```ruby
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)
```### Binary Vectors
Use the `bit` type to store binary vectors
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
def change
add_column :items, :embedding, :bit, limit: 3 # dimensions
end
end
```Get the nearest neighbors by Hamming distance
```ruby
Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)
```### Binary Quantization
Use expression indexing for binary quantization
```ruby
class AddIndexToItemsEmbedding < ActiveRecord::Migration[8.0]
def change
add_index :items, "(binary_quantize(embedding)::bit(3)) bit_hamming_ops", using: :hnsw
end
end
```### Sparse Vectors
Use the `sparsevec` type to store sparse vectors
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
def change
add_column :items, :embedding, :sparsevec, limit: 3 # dimensions
end
end
```Get the nearest neighbors
```ruby
embedding = Neighbor::SparseVector.new({0 => 0.9, 1 => 1.3, 2 => 1.1}, 3)
Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)
```## sqlite-vec
### Distance
Supported values are:
- `euclidean`
- `cosine`
- `taxicab`
- `hamming`### Dimensions
For sqlite-vec, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
```ruby
class Item < ApplicationRecord
has_neighbors :embedding, dimensions: 3
end
```### Virtual Tables
You can also use [virtual tables](https://alexgarcia.xyz/sqlite-vec/features/knn.html)
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
def change
# Rails 8+
create_virtual_table :items, :vec0, [
"id integer PRIMARY KEY AUTOINCREMENT NOT NULL",
"embedding float[3] distance_metric=L2"
]# Rails < 8
execute <<~SQL
CREATE VIRTUAL TABLE items USING vec0(
id integer PRIMARY KEY AUTOINCREMENT NOT NULL,
embedding float[3] distance_metric=L2
)
SQL
end
end
```Use `distance_metric=cosine` for cosine distance
You can optionally ignore any shadow tables that are created
```ruby
ActiveRecord::SchemaDumper.ignore_tables += [
"items_chunks", "items_rowids", "items_vector_chunks00"
]
```Get the `k` nearest neighbors
```ruby
Item.where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)
```Filter by primary key
```ruby
Item.where(id: [2, 3]).where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)
```### Int8 Vectors
Use the `type` option for int8 vectors
```ruby
class Item < ApplicationRecord
has_neighbors :embedding, dimensions: 3, type: :int8
end
```### Binary Vectors
Use the `type` option for binary vectors
```ruby
class Item < ApplicationRecord
has_neighbors :embedding, dimensions: 8, type: :bit
end
```Get the nearest neighbors by Hamming distance
```ruby
Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)
```## MariaDB
### Distance
Supported values are:
- `euclidean`
- `cosine`
- `hamming`### Indexing
Vector columns must use `null: false` to add a vector index
```ruby
class CreateItems < ActiveRecord::Migration[8.0]
def change
create_table :items do |t|
t.vector :embedding, limit: 3, null: false
t.index :embedding, type: :vector
end
end
end
```### Binary Vectors
Use the `bigint` type to store binary vectors
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
def change
add_column :items, :embedding, :bigint
end
end
```Note: Binary vectors can have up to 64 dimensions
Get the nearest neighbors by Hamming distance
```ruby
Item.nearest_neighbors(:embedding, 5, distance: "hamming").first(5)
```## MySQL
### Distance
Supported values are:
- `euclidean`
- `cosine`
- `hamming`Note: The `DISTANCE()` function is [only available on HeatWave](https://dev.mysql.com/doc/refman/9.0/en/vector-functions.html)
### Binary Vectors
Use the `binary` type to store binary vectors
```ruby
class AddEmbeddingToItems < ActiveRecord::Migration[8.0]
def change
add_column :items, :embedding, :binary
end
end
```Get the nearest neighbors by Hamming distance
```ruby
Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)
```## Examples
- [Embeddings](#openai-embeddings) with OpenAI
- [Binary embeddings](#cohere-embeddings) with Cohere
- [Sentence embeddings](#sentence-embeddings) with Informers
- [Hybrid search](#hybrid-search) with Informers
- [Sparse search](#sparse-search) with Transformers.rb
- [Recommendations](#disco-recommendations) with Disco### OpenAI Embeddings
Generate a model
```sh
rails generate model Document content:text embedding:vector{1536}
rails db:migrate
```And add `has_neighbors`
```ruby
class Document < ApplicationRecord
has_neighbors :embedding
end
```Create a method to call the [embeddings API](https://platform.openai.com/docs/guides/embeddings)
```ruby
def fetch_embeddings(input)
url = "https://api.openai.com/v1/embeddings"
headers = {
"Authorization" => "Bearer #{ENV.fetch("OPENAI_API_KEY")}",
"Content-Type" => "application/json"
}
data = {
input: input,
model: "text-embedding-3-small"
}response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
JSON.parse(response.body)["data"].map { |v| v["embedding"] }
end
```Pass your input
```ruby
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = fetch_embeddings(input)
```Store the embeddings
```ruby
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)
```And get similar documents
```ruby
document = Document.first
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)
```See the [complete code](examples/openai/example.rb)
### Cohere Embeddings
Generate a model
```sh
rails generate model Document content:text embedding:bit{1024}
rails db:migrate
```And add `has_neighbors`
```ruby
class Document < ApplicationRecord
has_neighbors :embedding
end
```Create a method to call the [embed API](https://docs.cohere.com/reference/embed)
```ruby
def fetch_embeddings(input, input_type)
url = "https://api.cohere.com/v1/embed"
headers = {
"Authorization" => "Bearer #{ENV.fetch("CO_API_KEY")}",
"Content-Type" => "application/json"
}
data = {
texts: input,
model: "embed-english-v3.0",
input_type: input_type,
embedding_types: ["ubinary"]
}response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
JSON.parse(response.body)["embeddings"]["ubinary"].map { |e| e.map { |v| v.chr.unpack1("B*") }.join }
end
```Pass your input
```ruby
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = fetch_embeddings(input, "search_document")
```Store the embeddings
```ruby
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)
```Embed the search query
```ruby
query = "forest"
query_embedding = fetch_embeddings([query], "search_query")[0]
```And search the documents
```ruby
Document.nearest_neighbors(:embedding, query_embedding, distance: "hamming").first(5).map(&:content)
```See the [complete code](examples/cohere/example.rb)
### Sentence Embeddings
You can generate embeddings locally with [Informers](https://github.com/ankane/informers).
Generate a model
```sh
rails generate model Document content:text embedding:vector{384}
rails db:migrate
```And add `has_neighbors`
```ruby
class Document < ApplicationRecord
has_neighbors :embedding
end
```Load a [model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
```ruby
model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")
```Pass your input
```ruby
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = model.(input)
```Store the embeddings
```ruby
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)
```And get similar documents
```ruby
document = Document.first
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)
```See the [complete code](examples/informers/example.rb)
### Hybrid Search
You can use Neighbor for hybrid search with [Informers](https://github.com/ankane/informers).
Generate a model
```sh
rails generate model Document content:text embedding:vector{768}
rails db:migrate
```And add `has_neighbors` and a scope for keyword search
```ruby
class Document < ApplicationRecord
has_neighbors :embeddingscope :search, ->(query) {
where("to_tsvector(content) @@ plainto_tsquery(?)", query)
.order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
}
end
```Create some documents
```ruby
Document.create!(content: "The dog is barking")
Document.create!(content: "The cat is purring")
Document.create!(content: "The bear is growling")
```Generate an embedding for each document
```ruby
embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding modelDocument.find_each do |document|
embedding = embed.(document.content, **embed_options)
document.update!(embedding: embedding)
end
```Perform keyword search
```ruby
query = "growling bear"
keyword_results = Document.search(query).limit(20).load_async
```And semantic search in parallel (the query prefix is specific to the [embedding model](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5))
```ruby
query_prefix = "Represent this sentence for searching relevant passages: "
query_embedding = embed.(query_prefix + query, **embed_options)
semantic_results =
Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_async
```To combine the results, use Reciprocal Rank Fusion (RRF)
```ruby
Neighbor::Reranking.rrf(keyword_results, semantic_results).first(5)
```Or a reranking model
```ruby
rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
results = (keyword_results + semantic_results).uniq
rerank.(query, results.map(&:content)).first(5).map { |v| results[v[:doc_id]] }
```See the [complete code](examples/hybrid/example.rb)
### Sparse Search
You can generate sparse embeddings locally with [Transformers.rb](https://github.com/ankane/transformers-ruby).
Generate a model
```sh
rails generate model Document content:text embedding:sparsevec{30522}
rails db:migrate
```And add `has_neighbors`
```ruby
class Document < ApplicationRecord
has_neighbors :embedding
end
```Load a [model](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-v1) to generate embeddings
```ruby
class EmbeddingModel
def initialize(model_id)
@model = Transformers::AutoModelForMaskedLM.from_pretrained(model_id)
@tokenizer = Transformers::AutoTokenizer.from_pretrained(model_id)
@special_token_ids = @tokenizer.special_tokens_map.map { |_, token| @tokenizer.vocab[token] }
enddef embed(input)
feature = @tokenizer.(input, padding: true, truncation: true, return_tensors: "pt", return_token_type_ids: false)
output = @model.(**feature)[0]
values = Torch.max(output * feature[:attention_mask].unsqueeze(-1), dim: 1)[0]
values = Torch.log(1 + Torch.relu(values))
values[0.., @special_token_ids] = 0
values.to_a
end
endmodel = EmbeddingModel.new("opensearch-project/opensearch-neural-sparse-encoding-v1")
```Pass your input
```ruby
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = model.embed(input)
```Store the embeddings
```ruby
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: Neighbor::SparseVector.new(embedding)}
end
Document.insert_all!(documents)
```Embed the search query
```ruby
query = "forest"
query_embedding = model.embed([query])[0]
```And search the documents
```ruby
Document.nearest_neighbors(:embedding, Neighbor::SparseVector.new(query_embedding), distance: "inner_product").first(5).map(&:content)
```See the [complete code](examples/sparse/example.rb)
### Disco Recommendations
You can use Neighbor for online item-based recommendations with [Disco](https://github.com/ankane/disco). We’ll use MovieLens data for this example.
Generate a model
```sh
rails generate model Movie name:string factors:cube
rails db:migrate
```And add `has_neighbors`
```ruby
class Movie < ApplicationRecord
has_neighbors :factors, dimensions: 20, normalize: true
end
```Fit the recommender
```ruby
data = Disco.load_movielens
recommender = Disco::Recommender.new(factors: 20)
recommender.fit(data)
```Store the item factors
```ruby
movies = []
recommender.item_ids.each do |item_id|
movies << {name: item_id, factors: recommender.item_factors(item_id)}
end
Movie.create!(movies)
```And get similar movies
```ruby
movie = Movie.find_by(name: "Star Wars (1977)")
movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)
```See the complete code for [cube](examples/disco/item_recs_cube.rb) and [pgvector](examples/disco/item_recs_vector.rb)
## History
View the [changelog](https://github.com/ankane/neighbor/blob/master/CHANGELOG.md)
## Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- [Report bugs](https://github.com/ankane/neighbor/issues)
- Fix bugs and [submit pull requests](https://github.com/ankane/neighbor/pulls)
- Write, clarify, or fix documentation
- Suggest or add new featuresTo get started with development:
```sh
git clone https://github.com/ankane/neighbor.git
cd neighbor
bundle install# Postgres
createdb neighbor_test
bundle exec rake test:postgresql# SQLite
bundle exec rake test:sqlite# MariaDB
docker run -e MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 -e MARIADB_DATABASE=neighbor_test -p 3307:3306 mariadb:11.7-rc
bundle exec rake test:mariadb# MySQL
docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=neighbor_test -p 3306:3306 mysql:9
bundle exec rake test:mysql
```