{"id":14383661,"url":"https://github.com/asg017/sqlite-lembed","last_synced_at":"2025-04-05T01:03:48.436Z","repository":{"id":241646992,"uuid":"805113086","full_name":"asg017/sqlite-lembed","owner":"asg017","description":"A SQLite extension for generating text embeddings from GGUF models using llama.cpp","archived":false,"fork":false,"pushed_at":"2024-11-24T06:10:43.000Z","size":85,"stargazers_count":178,"open_issues_count":12,"forks_count":8,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-29T00:04:40.683Z","etag":null,"topics":["sqlite-extension"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/asg017.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-23T22:57:48.000Z","updated_at":"2025-03-22T12:52:34.000Z","dependencies_parsed_at":"2024-06-04T06:23:48.230Z","dependency_job_id":"0eba182d-adb3-4057-ad44-115aadb66d4f","html_url":"https://github.com/asg017/sqlite-lembed","commit_stats":{"total_commits":55,"total_committers":1,"mean_commits":55.0,"dds":0.0,"last_synced_commit":"23fe65121d9a440bccc5f46ff89e33f81d02fcb4"},"previous_names":["asg017/sqlite-lembed"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-lembed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-lembed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-lembed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-lembed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/asg017","download_url":"https://codeload.github.com/asg017/sqlite-lembed/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247271515,"owners_count":20911587,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["sqlite-extension"],"created_at":"2024-08-28T18:00:56.394Z","updated_at":"2025-04-05T01:03:48.418Z","avatar_url":"https://github.com/asg017.png","language":"C","readme":"# `sqlite-lembed`\n\nA SQLite extension for generating text embeddings with [llama.cpp](https://github.com/ggerganov/llama.cpp). A sister project to [`sqlite-vec`](https://github.com/asg017/sqlite-vec) and [`sqlite-rembed`](https://github.com/asg017/sqlite-rembed). A work-in-progress!\n\n## Usage\n\n`sqlite-lembed` uses embeddings models that are in the [GGUF format](https://huggingface.co/docs/hub/en/gguf) to generate embeddings. These are a bit hard to find or convert, so here's a sample model you can use:\n\n```bash\ncurl -L -o all-MiniLM-L6-v2.e4ce9877.q8_0.gguf https://huggingface.co/asg017/sqlite-lembed-model-examples/resolve/main/all-MiniLM-L6-v2/all-MiniLM-L6-v2.e4ce9877.q8_0.gguf\n```\n\nThis is the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model that I converted to the `.gguf` format, and quantized at `Q8_0` (made smaller at the expense of some quality).\n\nTo load it into `sqlite-lembed`, register it with the `temp.lembed_models` table.\n\n```sql\n.load ./lembed0\n\nINSERT INTO temp.lembed_models(name, model)\n  select 'all-MiniLM-L6-v2', lembed_model_from_file('all-MiniLM-L6-v2.e4ce9877.q8_0.gguf');\n\nselect lembed(\n  'all-MiniLM-L6-v2',\n  'The United States Postal Service is an independent agency...'\n);\n```\n\nThe `temp.lembed_models` virtual table lets you \"register\" models with pure `INSERT INTO` statements. The `name` field is a unique identifier for a given model, and `model` is provided as a path to the `.gguf` model, on disk, with the `lembed_model_from_file()` function.\n\n### Using with `sqlite-vec`\n\n`sqlite-lembed` works well with [`sqlite-vec`](https://github.com/asg017/sqlite-vec), a SQLite extension for vector search. Embeddings generated with `lembed()` use the same BLOB format for vectors that `sqlite-vec` uses.\n\nHere's a sample \"semantic search\" application, made from a sample dataset of news article headlines.\n\n```sql\ncreate table articles(\n  headline text\n);\n\n-- Random NPR headlines from 2024-06-04\ninsert into articles VALUES\n  ('Shohei Ohtani''s ex-interpreter pleads guilty to charges related to gambling and theft'),\n  ('The jury has been selected in Hunter Biden''s gun trial'),\n  ('Larry Allen, a Super Bowl champion and famed Dallas Cowboy, has died at age 52'),\n  ('After saying Charlotte, a lone stingray, was pregnant, aquarium now says she''s sick'),\n  ('An Epoch Times executive is facing money laundering charge');\n\n\n-- Build a vector table with embeddings of article headlines\ncreate virtual table vec_articles using vec0(\n  headline_embeddings float[384]\n);\n\ninsert into vec_articles(rowid, headline_embeddings)\n  select rowid, lembed('all-MiniLM-L6-v2', headline)\n  from articles;\n\n```\n\nNow we have a regular `articles` table that stores text headlines, and a `vec_articles` virtual table that stores embeddings of the article headlines, using the `all-MiniLM-L6-v2` model.\n\nTo perform a \"semantic search\" on the embeddings, we can query the `vec_articles` table with an embedding of our query, and join the results back to our `articles` table to retrieve the original headlines.\n\n```sql\nparam set :query 'firearm courtroom'\n\nwith matches as (\n  select\n    rowid,\n    distance\n  from vec_articles\n  where headline_embeddings match lembed('all-MiniLM-L6-v2', :query)\n  order by distance\n  limit 3\n)\nselect\n  headline,\n  distance\nfrom matches\nleft join articles on articles.rowid = matches.rowid;\n\n/*\n+--------------------------------------------------------------+------------------+\n|                           headline                           |     distance     |\n+--------------------------------------------------------------+------------------+\n| Shohei Ohtani's ex-interpreter pleads guilty to charges rela | 1.14812409877777 |\n| ted to gambling and theft                                    |                  |\n+--------------------------------------------------------------+------------------+\n| The jury has been selected in Hunter Biden's gun trial       | 1.18380105495453 |\n+--------------------------------------------------------------+------------------+\n| An Epoch Times executive is facing money laundering charge   | 1.27715671062469 |\n+--------------------------------------------------------------+------------------+\n*/\n```\n\nNotice how \"firearm courtroom\" doesn't appear in any of these headlines, but it can still figure out that \"Hunter Biden's gun trial\" is related, and the other two justice-related articles appear on top.\n\n## Embedding Models in `.gguf` format\n\nMost embeddings models out there are provided as PyTorch/ONNX models, but `sqlite-lembed` uses models in the [GGUF file format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md). However, since ggml/GGUF is relatively new, they can be hard to find. You can always [convert models yourself](https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py), or here's a few pre-converted embedding models already in GGUF format:\n\n| Model Name              | Link                                                       |\n| ----------------------- | ---------------------------------------------------------- |\n| `nomic-embed-text-v1.5` | https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF |\n| `mxbai-embed-large-v1`  | https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1  |\n\n## Drawbacks\n\n1. **No batch support yet.** `llama.cpp` has support for batch processing multiple inputs, but I haven't figured that out yet. Add a :+1: to [Issue #2](https://github.com/asg017/sqlite-lembed/issues/2) if you want to see this fixed.\n2. **Pre-compiled version of `sqlite-lembed` don't use the GPU.** This was done to make compiling/distrubution easier, but that means it will likely take a long time to generate embeddings. If you need it to go faster, try compiling `sqlite-lembed` yourself (docs coming soon).\n","funding_links":[],"categories":["C","others","extentions"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasg017%2Fsqlite-lembed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fasg017%2Fsqlite-lembed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasg017%2Fsqlite-lembed/lists"}