{"id":15432909,"url":"https://github.com/simonw/datasette-faiss","last_synced_at":"2025-08-19T11:32:28.864Z","repository":{"id":65206697,"uuid":"587578070","full_name":"simonw/datasette-faiss","owner":"simonw","description":"Maintain a FAISS index for specified Datasette tables","archived":false,"fork":false,"pushed_at":"2024-06-17T18:12:08.000Z","size":34,"stargazers_count":34,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-10-18T07:53:15.909Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-11T04:28:21.000Z","updated_at":"2024-06-19T05:42:42.000Z","dependencies_parsed_at":"2024-10-20T20:19:11.526Z","dependency_job_id":null,"html_url":"https://github.com/simonw/datasette-faiss","commit_stats":{"total_commits":15,"total_committers":1,"mean_commits":15.0,"dds":0.0,"last_synced_commit":"4fe395607bd6e5a7ec63b737bd787535501b589d"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fdatasette-faiss","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fdatasette-faiss/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fdatasette-faiss/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonw%2Fdatasette-faiss/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonw","download_url":"https://codeload.github.com/simonw/datasette-faiss/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230351171,"owners_count":18212789,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T18:29:11.989Z","updated_at":"2024-12-18T23:09:38.220Z","avatar_url":"https://github.com/simonw.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# datasette-faiss\n\n[![PyPI](https://img.shields.io/pypi/v/datasette-faiss.svg)](https://pypi.org/project/datasette-faiss/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/datasette-faiss?include_prereleases\u0026label=changelog)](https://github.com/simonw/datasette-faiss/releases)\n[![Tests](https://github.com/simonw/datasette-faiss/workflows/Test/badge.svg)](https://github.com/simonw/datasette-faiss/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/datasette-faiss/blob/main/LICENSE)\n\nMaintain a [FAISS index](https://github.com/facebookresearch/faiss) for specified Datasette tables\n\nSee [Semantic search answers: Q\u0026A against documentation with GPT3 + OpenAI embeddings](https://simonwillison.net/2023/Jan/13/semantic-search-answers/) for background on this project.\n\n## Installation\n\nInstall this plugin in the same environment as Datasette.\n```bash\ndatasette install datasette-faiss\n```\n## Usage\n\nThis plugin creates in-memory FAISS indexes for specified tables on startup, using an `IndexFlatL2` [FAISS index type](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes).\n\nIf the tables are modified after the server has started the indexes will not (yet) pick up those changes.\n\n### Configuration\n\nThe tables to be indexed must have `id` and `embedding` columns. The `embedding` column must be a `blob` containing embeddings that are arrays of floating point numbers that have been encoded using the following Python function:\n\n```python\ndef encode(vector):\n    return struct.pack(\"f\" * len(vector), *vector)\n```\nYou can import that function from this package like so:\n```python\nfrom datasette_faiss import encode\n```\nYou can specify which tables should have indexes created for them by adding this to `metadata.json`:\n```json\n{\n    \"plugins\": {\n        \"datasette-faiss\": {\n            \"tables\": [\n                [\"blog\", \"embeddings\"]\n            ]\n        }\n    }\n}\n```\nEach table is an array listing the database name and the table name.\n\nIf you are using `metadata.yml` the configuration should look like this:\n\n```yaml\nplugins:\n  datasette-faiss:\n    tables:\n    - [\"blog\", \"embeddings\"]\n```\n\n### SQL functions\n\nThe plugin makes four new SQL functions available within Datasette:\n\n#### faiss_search(database, table, embedding, k)\n  \nReturns the `k` nearest neighbors to the `embedding` found in the specified database and table. For example:\n```sql\nselect faiss_search('blog', 'embeddings', (select embedding from embeddings where id = 3), 5)\n```\nThis will return a JSON array of the five IDs of the records in the `embeddings` table in the `blog` database that are closest to the specified embedding. The returned value looks like this:\n\n```json\n[\"1\", \"1249\", \"1011\", \"5\", \"10\"]\n```\nYou can use the SQLite `json_each()` function to turn that into a table-like sequence that you can join against.\n\nHere's an example query that does that:\n\n```sql\nwith related as (\n  select value from json_each(\n    faiss_search(\n      'blog',\n      'embeddings',\n      (select embedding from embeddings limit 1),\n      5\n    )\n  )\n)\nselect * from blog_entry, related\nwhere id = value\n```\n#### faiss_search_with_scores(database, table, embedding, k)\n\nTakes the same arguments as above, but the return value is a JSON array of pairs, each with an ID and a score - something like this:\n\n```json\n[\n    [\"1\", 0.0],\n    [\"1249\", 0.21042244136333466],\n    [\"1011\", 0.29391372203826904],\n    [\"5\", 0.29505783319473267],\n    [\"10\", 0.31554925441741943]\n]\n```\n\n#### faiss_encode(json_vector)\n\nGiven a JSON array of floats, returns the binary embedding blob that can be used with the other functions:\n\n```sql\nselect faiss_encode('[2.4, 4.1, 1.8]')\n-- Returns a 12 byte blob\nselect hex(faiss_encode('[2.4, 4.1, 1.8]'))\n-- Returns 9A991940333383406666E63F\n```\n\n#### faiss_decode(vector_blob)\n\nThe opposite of `faiss_encode()`.\n\n```sql\nselect faiss_decode(X'9A991940333383406666E63F')\n```\nReturns:\n```json\n[2.4000000953674316, 4.099999904632568, 1.7999999523162842]\n```\nNote that floating point arithmetic results in numbers that don't quite round-trip to the exact same expected value.\n\n#### faiss_agg(id, embedding, compare_embedding, k)\n\nThis aggregate function can be used to find the `k` nearest neighbors to `compare_embedding` for each unique value of `id` in the table. For example:\n\n```sql\nselect faiss_agg(\n    id, embedding, (select embedding from embeddings where id = 3), 5\n) from embeddings\n```\nUnlike the `faiss_search()` function, this does not depend on the per-table index that the plugin creates when it first starts running. Instead, an index is built every time the aggregation function is run.\n\nThis means that it should only be used on smaller sets of values - once you get above 10,000 or so the performance from this function is likely to become prohibitively expensive.\n\nThe function returns a JSON array of IDs representing the `k` rows with the closest distance scores, like this:\n\n```json\n[1324, 344, 5562, 553, 2534]\n```\nYou can use the `json_each()` function to turn that into a table-like sequence that you can join against.\n\n[Try an example fais_agg() query](https://datasette.simonwillison.net/simonwillisonblog?sql=with+last_500+as+%28%0D%0A++select%0D%0A++++id%2C%0D%0A++++embedding%0D%0A++from%0D%0A++++blog_entry_embeddings%0D%0A++order+by%0D%0A++++id+desc%0D%0A++limit%0D%0A++++500%0D%0A%29%2C+faiss+as+%28%0D%0A++select%0D%0A++++faiss_agg%28%0D%0A++++++id%2C%0D%0A++++++embedding%2C%0D%0A++++++%28%0D%0A++++++++select%0D%0A++++++++++embedding%0D%0A++++++++from%0D%0A++++++++++blog_entry_embeddings%0D%0A++++++++where%0D%0A++++++++++id+%3D+%3Aid%0D%0A++++++%29%2C%0D%0A++++++10%0D%0A++++%29+as+results%0D%0A++from%0D%0A++++last_500%0D%0A%29%2C%0D%0Aids+as+%28%0D%0A++select%0D%0A++++value+as+id%0D%0A++from%0D%0A++++json_each%28faiss.results%29%2C%0D%0A++++faiss%0D%0A%29%0D%0Aselect%0D%0A++blog_entry.id%2C%0D%0A++blog_entry.title%2C%0D%0A++blog_entry.created%0D%0Afrom%0D%0A++ids%0D%0A++join+blog_entry+on+ids.id+%3D+blog_entry.id\u0026id=8214).\n\n#### faiss_agg_with_scores(id, embedding, compare_embedding, k)\n\nThis is similar to the `faiss_agg()` aggregate function but it returns a list of pairs, each with an ID and the corresponding score - something that looks like this (if `k` was 2):\n\n```json\n[[2412, 0.25], [1245, 24.25]]\n```\n[Try an example fais_agg_with_scores() query](https://datasette.simonwillison.net/simonwillisonblog?sql=with+last_500+as+%28%0D%0A++select%0D%0A++++id%2C%0D%0A++++embedding%0D%0A++from%0D%0A++++blog_entry_embeddings%0D%0A++order+by%0D%0A++++id+desc%0D%0A++limit%0D%0A++++500%0D%0A%29%2C+ids_and_scores+as+%28%0D%0A++select%0D%0A++++faiss_agg_with_scores%28%0D%0A++++++id%2C%0D%0A++++++embedding%2C%0D%0A++++++%28%0D%0A++++++++select%0D%0A++++++++++embedding%0D%0A++++++++from%0D%0A++++++++++blog_entry_embeddings%0D%0A++++++++where%0D%0A++++++++++id+%3D+%3Aid%0D%0A++++++%29%2C+10%0D%0A++++%29+as+s%0D%0A++from%0D%0A++++last_500%0D%0A%29%2C%0D%0Aresults+as+%28%0D%0A++select%0D%0A++++json_extract%28value%2C+%27%24%5B0%5D%27%29+as+id%2C%0D%0A++++json_extract%28value%2C+%27%24%5B1%5D%27%29+as+score%0D%0A++from%0D%0A++++json_each%28ids_and_scores.s%29%2C%0D%0A++++ids_and_scores%0D%0A%29%0D%0Aselect%0D%0A++results.score%2C%0D%0A++blog_entry.id%2C%0D%0A++blog_entry.title%2C%0D%0A++blog_entry.created%0D%0Afrom%0D%0A++results%0D%0A++join+blog_entry+on+results.id+%3D+blog_entry.id\u0026id=8214).\n\n## Development\n\nTo set up this plugin locally, first checkout the code. Then create a new virtual environment:\n```bash\ncd datasette-faiss\npython3 -m venv venv\nsource venv/bin/activate\n```\nNow install the dependencies and test dependencies:\n```bash\npip install -e '.[test]'\n```\nTo run the tests:\n```bash\npytest\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonw%2Fdatasette-faiss","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonw%2Fdatasette-faiss","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonw%2Fdatasette-faiss/lists"}