Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/asg017/sqlite-lembed
A SQLite extension for generate text embeddings from GGUF models using llama.cpp
https://github.com/asg017/sqlite-lembed
sqlite-extension
Last synced: 8 days ago
JSON representation
A SQLite extension for generate text embeddings from GGUF models using llama.cpp
- Host: GitHub
- URL: https://github.com/asg017/sqlite-lembed
- Owner: asg017
- Created: 2024-05-23T22:57:48.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-08-24T23:11:34.000Z (3 months ago)
- Last Synced: 2024-08-29T18:31:31.005Z (2 months ago)
- Topics: sqlite-extension
- Language: C
- Homepage:
- Size: 72.3 KB
- Stars: 78
- Watchers: 2
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-sqlite - asg017/sqlite-lembed: A SQLite extension for generate text embeddings from GGUF models using llama.cpp
README
# `sqlite-lembed`
A SQLite extension for generating text embeddings with [llama.cpp](https://github.com/ggerganov/llama.cpp). A sister project to [`sqlite-vec`](https://github.com/asg017/sqlite-vec) and [`sqlite-rembed`](https://github.com/asg017/sqlite-rembed). A work-in-progress!
## Usage
`sqlite-lembed` uses embeddings models that are in the [GGUF format](https://huggingface.co/docs/hub/en/gguf) to generate embeddings. These are a bit hard to find or convert, so here's a sample model you can use:
```bash
curl -L -o all-MiniLM-L6-v2.e4ce9877.q8_0.gguf https://huggingface.co/asg017/sqlite-lembed-model-examples/resolve/main/all-MiniLM-L6-v2/all-MiniLM-L6-v2.e4ce9877.q8_0.gguf
```This is the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model that I converted to the `.gguf` format, and quantized at `Q8_0` (made smaller at the expense of some quality).
To load it into `sqlite-lembed`, register it with the `temp.lembed_models` table.
```sql
.load ./lembed0INSERT INTO temp.lembed_models(name, model)
select 'all-MiniLM-L6-v2', lembed_model_from_file('all-MiniLM-L6-v2.e4ce9877.q8_0.gguf');select lembed(
'all-MiniLM-L6-v2',
'The United States Postal Service is an independent agency...'
);
```The `temp.lembed_models` virtual table lets you "register" models with pure `INSERT INTO` statements. The `name` field is a unique identifier for a given model, and `model` is provided as a path to the `.gguf` model, on disk, with the `lembed_model_from_file()` function.
### Using with `sqlite-vec`
`sqlite-lembed` works well with [`sqlite-vec`](https://github.com/asg017/sqlite-vec), a SQLite extension for vector search. Embeddings generated with `lembed()` use the same BLOB format for vectors that `sqlite-vec` uses.
Here's a sample "semantic search" application, made from a sample dataset of news article headlines.
```sql
create table articles(
headline text
);-- Random NPR headlines from 2024-06-04
insert into articles VALUES
('Shohei Ohtani''s ex-interpreter pleads guilty to charges related to gambling and theft'),
('The jury has been selected in Hunter Biden''s gun trial'),
('Larry Allen, a Super Bowl champion and famed Dallas Cowboy, has died at age 52'),
('After saying Charlotte, a lone stingray, was pregnant, aquarium now says she''s sick'),
('An Epoch Times executive is facing money laundering charge');-- Build a vector table with embeddings of article headlines
create virtual table vec_articles using vec0(
headline_embeddings float[384]
);insert into vec_articles(rowid, headline_embeddings)
select rowid, lembed('all-MiniLM-L6-v2', headline)
from articles;```
Now we have a regular `articles` table that stores text headlines, and a `vec_articles` virtual table that stores embeddings of the article headlines, using the `all-MiniLM-L6-v2` model.
To perform a "semantic search" on the embeddings, we can query the `vec_articles` table with an embedding of our query, and join the results back to our `articles` table to retrieve the original headlines.
```sql
param set :query 'firearm courtroom'with matches as (
select
rowid,
distance
from vec_articles
where headline_embeddings match lembed('all-MiniLM-L6-v2', :query)
order by distance
limit 3
)
select
headline,
distance
from matches
left join articles on articles.rowid = matches.rowid;/*
+--------------------------------------------------------------+------------------+
| headline | distance |
+--------------------------------------------------------------+------------------+
| Shohei Ohtani's ex-interpreter pleads guilty to charges rela | 1.14812409877777 |
| ted to gambling and theft | |
+--------------------------------------------------------------+------------------+
| The jury has been selected in Hunter Biden's gun trial | 1.18380105495453 |
+--------------------------------------------------------------+------------------+
| An Epoch Times executive is facing money laundering charge | 1.27715671062469 |
+--------------------------------------------------------------+------------------+
*/
```Notice how "firearm courtroom" doesn't appear in any of these headlines, but it can still figure out that "Hunter Biden's gun trial" is related, and the other two justice-related articles appear on top.
## Embedding Models in `.gguf` format
Most embeddings models out there are provided as PyTorch/ONNX models, but `sqlite-lembed` uses models in the [GGUF file format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md). However, since ggml/GGUF is relatively new, they can be hard to find. You can always [convert models yourself](https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py), or here's a few pre-converted embedding models already in GGUF format:
| Model Name | Link |
| ----------------------- | ---------------------------------------------------------- |
| `nomic-embed-text-v1.5` | https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF |
| `mxbai-embed-large-v1` | https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 |## Drawbacks
1. **No batch support yet.** `llama.cpp` has support for batch processing multiple inputs, but I haven't figured that out yet. Add a :+1: to [Issue #2](https://github.com/asg017/sqlite-lembed/issues/2) if you want to see this fixed.
2. **Pre-compiled version of `sqlite-lembed` don't use the GPU.** This was done to make compiling/distrubution easier, but that means it will likely take a long time to generate embeddings. If you need it to go faster, try compiling `sqlite-lembed` yourself (docs coming soon).