{"id":13531645,"url":"https://github.com/netrasys/pgANN","last_synced_at":"2025-04-01T19:32:30.518Z","repository":{"id":157515659,"uuid":"194873113","full_name":"netrasys/pgANN","owner":"netrasys","description":"Fast Approximate Nearest Neighbor (ANN) searches with a PostgreSQL database. ","archived":false,"fork":false,"pushed_at":"2024-01-22T16:58:18.000Z","size":22,"stargazers_count":288,"open_issues_count":2,"forks_count":15,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-01-27T23:06:58.466Z","etag":null,"topics":["ann","approximate-nearest-neighbor-search","nearest-neighbor-search","nearest-neighbors","postgres","vectors"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/netrasys.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-07-02T13:53:56.000Z","updated_at":"2024-01-27T23:06:59.106Z","dependencies_parsed_at":"2024-01-27T23:06:59.010Z","dependency_job_id":"f240e7e6-315c-4bdd-a8b9-d04e942daf08","html_url":"https://github.com/netrasys/pgANN","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netrasys%2FpgANN","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netrasys%2FpgANN/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netrasys%2FpgANN/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netrasys%2FpgANN/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/netrasys","download_url":"https://codeload.github.com/netrasys/pgANN/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246700720,"owners_count":20819923,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ann","approximate-nearest-neighbor-search","nearest-neighbor-search","nearest-neighbors","postgres","vectors"],"created_at":"2024-08-01T07:01:04.562Z","updated_at":"2025-04-01T19:32:29.629Z","avatar_url":"https://github.com/netrasys.png","language":"Python","funding_links":[],"categories":["Python","Awesome Vector Search Engine","Libraries and Tools"],"sub_categories":["Standalone Service","2023"],"readme":"# pgANN\n\nApproximate Nearest Neighbor (ANN) searches using a PostgreSQL backend. \n\n## Background\n\nApproximate Nearest Neighbor approaches are a powerful tool for various AI/ML tasks, however many existing tools ([faiss](https://github.com/facebookresearch/faiss),[Annoy](https://github.com/spotify/annoy), [Nearpy](https://github.com/pixelogik/NearPy) etc.) are \"in memory\"i.e. the vectors needed to be loaded into RAM, and then model trained therefrom. Furthermore, such models once trained could not be updated i.e. any CRUD operation necessitates a fresh training run.\n\nThe challenge for us was to: \n- hold extremely large datasets in memory and \n- support CRUDs which makes it challenging in an \"online\" environment where fresh data is continuously accumulated.\n\nWe are open-sourcing a simple, but effective approach that provides ANN searches using the powerful PostgreSQL database. At [Netra](http://netra.io) we use this tool internally for managing \u0026 searching our image collections for further processing and/or feed into our Deep learning models. We consistently see `sub-second` response times on the order of a few million rows on a 32Gb/8 vcpu Ubuntu 16 box. We hope this is of use to the AI community. \n\nFeedback and PRs very welcome!\n\n## Advantages\n\n- leverages postgres database queries \u0026 indexes (no additional tooling needed)\n- associated metadata is fetched alongwith the \"neighbors\" (i.e. fewer moving parts)\n- no re-training needed for CRUDs due to query time ranking (\"online mode\")\n- should scale with tried \u0026 tested database scaling techniques (partioning etc.)\n\n## Challenges\n\n- `cube` type doesn't seem to work for \u003e [100 dimensions](https://www.postgresql.org/docs/current/cube.html#AEN176262), so we need to perform dimensionality reduction. Example for dim. reduction included in the sample code\n- haven't tested with sparse vectors, but in theory should work decently with appropriate dimensionality reduction techniques\n- pgANN might *not* perform as accurately as some of the better known approaches, but you can use pgANN to fetch a subset of (say) a fw thousand and then `rerank` based on your favorite metric. Unfortunately, there are no easy wins in ANN approaches, hopefully pgANN gets you a \"good enough\" subset for your reranking.\n\n**Update Oct 2, 2019: There is a docker instance available [here](https://hub.docker.com/r/expert/postgresql-large-cube) that claims \"Postgres DB docker image with increased limit for CUBE datatype (set to 2048)\". I haven't had a chance to try it, but this seems pretty useful (Credit due to @andrey-avdeev for this suggestion.)**\n\n## Requirements\n- Postgres 10.x+ or higher (we haven't tested on PG 9.6+, but `cube`,`GIST` and distance operators are available on 9.6+, so it *might* work)\n- Cube extension from Postgres\n\n## Setup\n\n1. Make sure you are logged in as superuser into pg and run:\n`create extension cube;`\n\n2. We shall use the example of an `images` table to illustrate the approach, the images table stores the url, any metadata tags, vectors and the embeddings for the vectors in the table. You can of course modify table structure to your needs.\n\n```\nCREATE TABLE images(\n   id serial PRIMARY KEY,\n   image_url text UNIQUE NOT NULL,\n   tags text,\n   vectors double precision[],\n   embeddings cube   \n);\n```\n3. Create a GIST index on the embeddings column which stores a 100-dimensional embedding of the original vector:\n\n`create index ix_gist on images using GIST (embeddings);`\n\n_Note: you might need to create other indexes (b-tree etc.) on other fields for efficient searching \u0026 sorting, but that's outside our scope_\n\n## Populating db\nNow we are ready to populate the database with  vectors and associated embeddings. \n\n_Note: we are using the [dataset](https://dataset.readthedocs.io/en/latest/) library for interfacing with postgres, but this should work just as well with your favorite driver (psycopg2 etc.)_\n\n```\ndb = dataset.connect('postgresql://user:pass@@localhost:5432/yourdb')\ntbl = db['images']\n\nfor image in images:\n   vect = get_image_vector(image) # \u003c-- use imagenet or such model to generate a vector from one of the final layers\n   emb_vect = get_embedding(vect)\n   emb_string = \"({0})\".format(','.join(\"%10.8f\" % x for x in emb_vect)) # \u003c-- pg fails beyond 100 dimensions for cube, so reduce dimensionality\n   row_dict[\"embeddings\"] = emb_string\n   row_dict[\"vectors\"] = vect.tolist()\n   row_id = tbl.insert(row_dict)\n```\n\n## Querying\nWe can start querying the database even as population is in progress\n\n```\n    query_vector = [...]\n    query_emb = get_embedding(query_vector)\n    thresh = 0.25 # \u003c-- making this larger will likely give more relevant results at the expense of time\n\t\n\n    print (\"[+] doing ANN search:\")\n    emb_string = \"'({0})'\".format(','.join(\"%10.8f\" % x for x in query_emb))\n    sql = \"select id,url,tags from images where embeddings \u003c-\u003e cube({0}) \u003c thresh order by embeddings \u003c-\u003e cube({0}) asc limit 10\".format((emb_string))\n    results = db.query(sql)\n    for result in results:\n      print(result['url'], result['tags'])\n  \n  ```\n  \n  Note: pgsql cube extension supports multiple distance parameters. Here is a quick summary:\n  \n - `\u003c-\u003e Euclidean distance`, \n - `\u003c=\u003e Chebyshev (L-inf metric) distance`, and \n - `\u003c#\u003e Taxicab distance`.\n  \n  More details [here](https://www.postgresql.org/docs/10/cube.html).\n  \n ## Improving Performance\n \n Here are some steps you can use to squeeze more performance out of pgANN:\n \n- Reduce dimensionality (using `UMAP`, `t-SNE`or `\u003cinsert your favorite approach\u003e`) and using that as an embedding\n- Horizontal partitioning data across multiple tables and using parallelism to combine results\n- Use some hashing technique (`LSH/MinHash` for example) to create a common signature for each row and use it as a filter during your query (reducing the lookup space)\n- Try different distance operators (`Euclidean`, vs `Chebyshev` vs `Taxicab`),\n- Remove sorting from your query. e.g `sql = \"select id,embeddings from images where embeddings \u003c-\u003e cube({0}) \u003c0.5\".format((emb_string))`\n\nIn these cases, you will need to fetch a significant N from the DB query and then re-rank based on your favorite similarity metric. Some combination of those might get you to some query times you can live with. Unfortunately ANN is largely ignored by the AI/DL community and there is significant research that needs to happen.\n \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetrasys%2FpgANN","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnetrasys%2FpgANN","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetrasys%2FpgANN/lists"}