{"id":14563686,"url":"https://github.com/asg017/sqlite-rembed","last_synced_at":"2025-04-06T12:08:06.792Z","repository":{"id":242270231,"uuid":"809132912","full_name":"asg017/sqlite-rembed","owner":"asg017","description":"A SQLite extension for generating text embeddings from remote APIs (OpenAI, Nomic, Ollama, llamafile...)","archived":false,"fork":false,"pushed_at":"2024-11-04T06:50:46.000Z","size":106,"stargazers_count":112,"open_issues_count":13,"forks_count":7,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-30T11:06:40.964Z","etag":null,"topics":["sqlite-extension"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/asg017.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-01T19:31:12.000Z","updated_at":"2025-03-05T05:03:50.000Z","dependencies_parsed_at":"2024-11-15T03:04:40.242Z","dependency_job_id":"b1b5a2de-a528-41c9-8256-4bd86bfaad49","html_url":"https://github.com/asg017/sqlite-rembed","commit_stats":{"total_commits":33,"total_committers":1,"mean_commits":33.0,"dds":0.0,"last_synced_commit":"571b5943c235382d43356552a2b5d665b7b29037"},"previous_names":["asg017/sqlite-rembed"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-rembed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-rembed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-rembed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asg017%2Fsqlite-rembed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/asg017","download_url":"https://codeload.github.com/asg017/sqlite-rembed/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247478321,"owners_count":20945266,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["sqlite-extension"],"created_at":"2024-09-07T02:04:20.623Z","updated_at":"2025-04-06T12:08:06.773Z","avatar_url":"https://github.com/asg017.png","language":"Rust","readme":"# `sqlite-rembed`\n\nA SQLite extension for generating text embeddings from remote APIs (OpenAI, Nomic, Cohere, llamafile, Ollama, etc.). A sister project to [`sqlite-vec`](https://github.com/asg017/sqlite-vec) and [`sqlite-lembed`](https://github.com/asg017/sqlite-lembed). A work-in-progress!\n\n## Usage\n\n```sql\n.load ./rembed0\n\nINSERT INTO temp.rembed_clients(name, options)\n VALUES ('text-embedding-3-small', 'openai');\n\nselect rembed(\n  'text-embedding-3-small',\n  'The United States Postal Service is an independent agency...'\n);\n```\n\nThe `temp.rembed_clients` virtual table lets you \"register\" clients with pure `INSERT INTO` statements. The `name` field is a unique identifier for a given client, and `options` allows you to specify which 3rd party embedding service you want to use.\n\nIn this case, `openai` is a pre-defined client that will default to OpenAI's `https://api.openai.com/v1/embeddings` endpoint and will source your API key from the `OPENAI_API_KEY` environment variable. The name of the client, `text-embedding-3-small`, will be used as the embeddings model.\n\nOther pre-defined clients include:\n\n| Client name  | Provider                                                                             | Endpoint                                       | API Key              |\n| ------------ | ------------------------------------------------------------------------------------ | ---------------------------------------------- | -------------------- |\n| `openai`     | [OpenAI](https://platform.openai.com/docs/guides/embeddings)                         | `https://api.openai.com/v1/embeddings`         | `OPENAI_API_KEY`     |\n| `nomic`      | [Nomic](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)                  | `https://api-atlas.nomic.ai/v1/embedding/text` | `NOMIC_API_KEY`      |\n| `cohere`     | [Cohere](https://docs.cohere.com/reference/embed)                                    | `https://api.cohere.com/v1/embed`              | `CO_API_KEY`         |\n| `jina`       | [Jina](https://api.jina.ai/redoc#tag/embeddings)                                     | `https://api.jina.ai/v1/embeddings`            | `JINA_API_KEY`       |\n| `mixedbread` | [MixedBread](https://www.mixedbread.ai/api-reference#quick-start-guide)              | `https://api.mixedbread.ai/v1/embeddings/`     | `MIXEDBREAD_API_KEY` |\n| `llamafile`  | [llamafile](https://github.com/Mozilla-Ocho/llamafile)                               | `http://localhost:8080/embedding`              | None                 |\n| `ollama`     | [Ollama](https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings) | `http://localhost:11434/api/embeddings`        | None                 |\n\nDifferent client options can be specified with `remebed_client_options()`. For example, if you have a different OpenAI-compatible service you want to use, then you can use:\n\n```sql\nINSERT INTO temp.rembed_clients(name, options) VALUES\n  (\n    'xyz-small-1',\n    rembed_client_options(\n      'format', 'openai',\n      'url', 'https://api.xyz.com/v1/embeddings',\n      'key', 'xyz-ca865ece65-hunter2'\n    )\n  );\n```\n\nOr to use a llamafile server that's on a different port:\n\n```sql\nINSERT INTO temp.rembed_clients(name, options) VALUES\n  (\n    'xyz-small-1',\n    rembed_client_options(\n      'format', 'lamafile',\n      'url', 'http://localhost:9999/embedding'\n    )\n  );\n```\n\n### Using with `sqlite-vec`\n\n`sqlite-rembed` works well with [`sqlite-vec`](https://github.com/asg017/sqlite-vec), a SQLite extension for vector search. Embeddings generated with `rembed()` use the same BLOB format for vectors that `sqlite-vec` uses.\n\nHere's a sample \"semantic search\" application, made from a sample dataset of news article headlines.\n\n```sql\ncreate table articles(\n  headline text\n);\n\n-- Random NPR headlines from 2024-06-04\ninsert into articles VALUES\n  ('Shohei Ohtani''s ex-interpreter pleads guilty to charges related to gambling and theft'),\n  ('The jury has been selected in Hunter Biden''s gun trial'),\n  ('Larry Allen, a Super Bowl champion and famed Dallas Cowboy, has died at age 52'),\n  ('After saying Charlotte, a lone stingray, was pregnant, aquarium now says she''s sick'),\n  ('An Epoch Times executive is facing money laundering charge');\n\n\n-- Build a vector table with embeddings of article headlines, using OpenAI's API\ncreate virtual table vec_articles using vec0(\n  headline_embeddings float[1536]\n);\n\ninsert into vec_articles(rowid, headline_embeddings)\n  select rowid, rembed('text-embedding-3-small', headline)\n  from articles;\n\n```\n\nNow we have a regular `articles` table that stores text headlines, and a `vec_articles` virtual table that stores embeddings of the article headlines, using OpenAI's `text-embedding-3-small` model.\n\nTo perform a \"semantic search\" on the embeddings, we can query the `vec_articles` table with an embedding of our query, and join the results back to our `articles` table to retrieve the original headlines.\n\n```sql\nparam set :query 'firearm courtroom'\n\nwith matches as (\n  select\n    rowid,\n    distance\n  from vec_articles\n  where headline_embeddings match rembed('text-embedding-3-small', :query)\n  order by distance\n  limit 3\n)\nselect\n  headline,\n  distance\nfrom matches\nleft join articles on articles.rowid = matches.rowid;\n\n/*\n+--------------------------------------------------------------+------------------+\n|                           headline                           |     distance     |\n+--------------------------------------------------------------+------------------+\n| The jury has been selected in Hunter Biden's gun trial       | 1.05906391143799 |\n+--------------------------------------------------------------+------------------+\n| Shohei Ohtani's ex-interpreter pleads guilty to charges rela | 1.2574303150177  |\n| ted to gambling and theft                                    |                  |\n+--------------------------------------------------------------+------------------+\n| An Epoch Times executive is facing money laundering charge   | 1.27144026756287 |\n+--------------------------------------------------------------+------------------+\n*/\n```\n\nNotice how \"firearm courtroom\" doesn't appear in any of these headlines, but it can still figure out that \"Hunter Biden's gun trial\" is related, and the other two justice-related articles appear on top.\n\n## Drawbacks\n\n1. **No batch support yet.** If you use `rembed()` in a batch UPDATE or INSERT in 1,000 rows, then 1,000 HTTP requests will be made. Add a :+1: to [Issue #1](https://github.com/asg017/sqlite-rembed/issues/1) if you want to see this fixed.\n2. **No builtin rate limiting.** Requests are sent sequentially so this may not come up in small demos, but `sqlite-rembed` could add features that handles rate limiting/retries implicitly. Add a :+1: to [Issue #2](https://github.com/asg017/sqlite-rembed/issues/2) if you want to see this implemented.\n","funding_links":[],"categories":["others","extentions"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasg017%2Fsqlite-rembed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fasg017%2Fsqlite-rembed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasg017%2Fsqlite-rembed/lists"}