{"id":14383683,"url":"https://github.com/guenthermi/postgres-word2vec","last_synced_at":"2025-04-14T13:31:57.055Z","repository":{"id":119962535,"uuid":"109417199","full_name":"guenthermi/postgres-word2vec","owner":"guenthermi","description":"utils to use word embedding models like word2vec vectors in a PostgreSQL database ","archived":false,"fork":false,"pushed_at":"2021-10-06T12:47:01.000Z","size":939,"stargazers_count":143,"open_issues_count":0,"forks_count":19,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-28T03:11:57.313Z","etag":null,"topics":["inverted-index","knn-search","postgresql","product-quantization","similarity-search","word-embeddings","word2vec"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/guenthermi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-11-03T16:19:28.000Z","updated_at":"2025-01-20T10:39:09.000Z","dependencies_parsed_at":"2023-11-13T23:45:07.442Z","dependency_job_id":null,"html_url":"https://github.com/guenthermi/postgres-word2vec","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guenthermi%2Fpostgres-word2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guenthermi%2Fpostgres-word2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guenthermi%2Fpostgres-word2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/guenthermi%2Fpostgres-word2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/guenthermi","download_url":"https://codeload.github.com/guenthermi/postgres-word2vec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248888720,"owners_count":21178097,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["inverted-index","knn-search","postgresql","product-quantization","similarity-search","word-embeddings","word2vec"],"created_at":"2024-08-28T18:00:57.506Z","updated_at":"2025-04-14T13:31:56.512Z","avatar_url":"https://github.com/guenthermi.png","language":"C","funding_links":[],"categories":["C"],"sub_categories":[],"readme":"#  FREDDY: Fast Word Embeddings in Database Systems\n\nFREDDY is a system based on Postgres which is able to use word embeddings exhibit the rich information encoded in textual values. Database systems often contain a lot of textual values which express a lot of latent semantic information which can not be exploited by standard SQL queries. We developed a Postgres extension which provides UDFs for word embedding operations to compare textual values according to there syntactic and semantic meaning.      \n\n## Word Embedding operations\n\n### Similarity Queries\n```\ncosine_similarity(float[], float[])\n```\n**Example**\n```\nSELECT keyword\nFROM keywords AS k\nINNER JOIN word_embeddings AS v ON k.keyword = v.word\nINNER JOIN word_embeddings AS w ON w.word = 'comedy'\nORDER BY cosine_similarity(w.vector, v.vector) DESC;\n```\n\n### Analogy Queries\n```\nanalogy(varchar, varchar, varchar)\n```\n**Example**\n```\nSELECT *\nFROM analogy('Francis_Ford_Coppola', 'Godfather', 'Christopher_Nolan');\n\n```\n### K Nearest Neighbour Queries\n\n```\nk_nearest_neighbour_ivfadc(float[], int)\nk_nearest_neighbour_ivfadc(varchar, int)\n```\n**Example**\n```\nSELECT m.title, t.word, t.squaredistance\nFROM movies AS m, k_nearest_neighbour_ivfadc(m.title, 3) AS t\nORDER BY m.title ASC, t.squaredistance DESC;\n```\n\n### K Nearest Neighbour Queries with Specific Output Set\n\n```\nknn_in_pq(varchar, int, varchar[]);\n```\n**Example**\n```\nSELECT * FROM\nknn_in_pq('Godfather', 5, ARRAY(SELECT title FROM movies));\n```\n\n### K Nearest Neighbour Join Queries\n\n```\nknn_join(varchar[], int, varchar[]);\n```\n**Example**\n```\nSELECT *\nFROM knn_join(ARRAY(SELECT title FROM movies), 5, ARRAY(SELECT title FROM movies));\n```\n\n### Grouping\n\n```\ngroups(varchar[], varchar[])\n```\n**Example**\n```\nSELECT *\nFROM groups(ARRAY(SELECT title FROM movies), '{Europe,America}');\n```\n\n## Indexes\n\nWe implemented several index structures to accelerate word embedding operations. One index is based on [product quantization](http://ieeexplore.ieee.org/abstract/document/5432202/) and one on IVFADC (inverted file system with asymmetric distance calculation). Product quantization provides a fast approximated distance calculation. IVFADC is even faster and provides a non-exhaustive approach which also uses product quantization.\nIn addition to that, an inverted product quantization index for kNN-Join operations can be created.\n\n\u003c!-- ![time measurement](evaluation/time_measurment.png) --\u003e\n\n### Post verification\n\nThe results of kNN queries could be improved by using post verification. The idea behind this is to obtain a larger result set with an approximated kNN search (more than k results) and run an exact search on the results afterwards.\n\nTo use post verification within a search process, use `k_nearest_neighbour_pq_pv` and `k_nearest_neighbour_ivfadc_pv`.\n\n**Example**\n```\nSELECT m.title, t.word, t.squaredistance\nFROM movies AS m, k_nearest_neighbour_ivfadc_pv(m.title, 3, 500) AS t\nORDER BY m.title ASC, t.squaredistance DESC;\n```\n\nThe effect of post verification on the response time and the precision of the results is shown below.\n\n![post verification](evaluation/postverification.png)\n\n### Parameters of the kNN-Join operation\nPrecision and execution time of the kNN-Join operation depend on the parameters `alpha` and `pvf`.\nThe selectivity `alpha` determine the factor of pre-filtering. Higher values correspond to higher execution time and higher precision.\nThe kNN-Join can also use post verification which is configurable by the post verification factor `pvf`.\nTo enable post verification one has to set the method flag (0: approximated distance calculation; 1: exact distance calculation; 2: post verification)\nThis can be done as follows:\n```\nSELECT set_method_flag(2);\n```\nThe parameters `alpha` and `pvf` can be set in a similar way:\n```\nSELECT set_pvf(20);\nSELECT set_alpha(100);\n```\n\n### Evaluation of PQ and IVFADC\n| Method                           | Response Time | Precision     |\n| ---------------------------------| ------------- | ------------- |\n| Exact Search                     | 8.79s         | 1.0           |\n| Product Quantization             | 1.06s         | 0.38          |\n| IVFADC                           | 0.03s         | 0.35          |\n| IVFADC (batchwise)               | 0.01s         | 0.35          |\n| Product Quantization (postverif.)| 1.29s         | 0.87          |\n| IVFADC (postverif.)              | 0.26s         | 0.65          |\n\n**Parameters:**\n* Number of subvectors per vector: 12\n* Number of centroids for fine quantization (PQ and IVFADC): 1024\n* Number of centroids for coarse quantization: 1000\n\n### Evaluation of kNN-Join\nAn Evaluation of the kNN-Join performance you can see here. The baseline in the diagram is a kNN-Join based on product quantization search which is implemented in the `pq_search_in_batch` function.\nThe measurements are done with different alpha values and different post verification factors (pvf).\nEvery color encodes a different alpha value.\nThe different symbols encode the different distance calculation methods.\nFor the post verifiction measurements are done with different values of pvf.\n\n ![kNN Join Evaluation](evaluation/time_precision_eval_gn.png)\n\n **Parameters:**  \n Query Vector Size: 5,000  \n Target Vector Size: 100,000  \n K: 5  \n PVF-Values: 10, 20, ..., 100  \n\n## Setup\nAt first, you need to set up a [Postgres server](https://www.postgresql.org/). You have to install [faiss](https://github.com/facebookresearch/faiss) and a few other python libraries to run the import scripts.\n\nTo build the extension you need to install the postgresql-server-dev package over the package manager first. Then, you can switch to the \"freddy_extension\" folder. Here you can run `sudo make install` to build the shared library and install the extension into the Postgres server. Hereafter you can add the extension in PSQL by running `CREATE EXTENSION freddy;`\n\n## Index creation\nTo use the extension you have to provide word embeddings. The recommendation here is the [word2vec dataset from google news](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing). The scripts for the index creation process are in the \"index_creation\" folder. You have to download the dataset and put it into a \"vectors\" folder, which should be created in the root folder in the repository. After that, you can transform it into a text format by running the \"transform_vecs.py\" script. If you want to use another vector dataset, you have to change the path constants in the script.\nPlease note also that you have to create the extension before you can execute the index creation scripts.\n\n```\nmkdir vectors\nwget -c \"https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz\" -P vectors\ngzip --decompress vectors/GoogleNews-vectors-negative300.bin.gz\ncd index_creation\npython3 transform_vecs.py\n```\n\nThen you can fill the database with the vectors with the \"vec2database.py\" script. However, at first, you need to provide information like database name, username, password etc. Therefore you have to change the properties in the \"config/db_config.json\" file.\n\nAfter that, you can use the \"vec2database.py\" script to add the word vectors to the database. You might have to adopt the configuration files \"word_vecs.json\" and \"word_vecs_norm.json\" for the word vector tables.\nExecute the following code (this can take a while):\n\n```\npython3 vec2database.py config/vecs_config.json\npython3 vec2database.py config/vecs_norm_config.json\n```\n\nTo create the product quantization Index you have to execute \"pq_index.py\":\n\n```\npython3 pq_index.py config/pq_config.json\n```\n\nThe IVFADC index tables can be created with \"ivfadc.py\":\n\n```\npython3 ivfadc.py config/ivfadc_config.json\n```\n\nFor the kNN-Join operation, an index structure can be created with \"ivpq.py\":\n\n```\npython3 ivpq.py config/ivpq_config.json\n```\n\nAfter all index tables are created, you might execute `CREATE EXTENSION freddy;` a second time. To provide the table names of the index structures for the extension, you can use the `init` function in the PSQL console (If you used the default names this might not be necessary) Replace the default names with the names defined in the JSON configuration files:\n\n```\nSELECT init('google_vecs', 'google_vecs_norm', 'pq_quantization', 'pq_codebook', 'fine_quantization', 'coarse_quantization', 'residual_codebook', 'fine_quantization_ivpq', 'codebook_ivpq', 'coarse_quantization_ivpq')\n```\n\n\n**Statistics:**\nIn addition to the index structures, the kNN-Join operation uses statistics about the distribution of the index vectors over index partitions.\nThis statistical information is essential for the search operation.\nFor the `word` column of the `google_vecs_norm` table (table with normalized word vectors) statistics can be created by the following SQL command:\n```\nSELECT create_statistics('google_vecs_norm', 'word', 'coarse_quantization_ivpq')\n```\nThis will produce a table `stat_google_vecs_norm_word` with statistic information.\nIn addition to that, one can create statistics for other text columns in the database which can improve the performance of the kNN-Join operation.\nThe statistic table used by the operation can be select by the `set_statistics_table` function:\n```\nSELECT set_statistics_table('stat_google_vecs_norm_word')\n```\n\n## Troubleshooting\n\nThe current version of the extension is updated to work with PostgreSQL 12.\nAn older version was implemented for version 10.\nFor this version check out [commit 705c1c62e83a32cba837a167ec7aabfbf7c097d9](https://github.com/guenthermi/postgres-word2vec/tree/5e469aa59d0f322980ae37683d390b0457119300).\nIf you run into problems by setting up the extension, you can create an issue in the repository.\n\n\n## References\n[FREDDY: Fast Word Embeddings in Database Systems](https://dl.acm.org/citation.cfm?id=3183717)\n```\n@inproceedings{gunther2018freddy,\n  title={FREDDY: Fast Word Embeddings in Database Systems},\n  author={G{\\\"u}nther, Michael},\n  booktitle={Proceedings of the 2018 International Conference on Management of Data},\n  pages={1817--1819},\n  year={2018},\n  organization={ACM}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguenthermi%2Fpostgres-word2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fguenthermi%2Fpostgres-word2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fguenthermi%2Fpostgres-word2vec/lists"}