{"id":20478685,"url":"https://github.com/do-me/embedding-algebra","last_synced_at":"2025-09-13T16:52:51.033Z","repository":{"id":230943481,"uuid":"780520615","full_name":"do-me/embedding-algebra","owner":"do-me","description":"Test scripts for common word embedding falsehoods like King - Man + Woman = Queen for state-of-the-art embedding model","archived":false,"fork":false,"pushed_at":"2024-04-02T06:01:18.000Z","size":26,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-13T13:15:43.126Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/do-me.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-01T16:49:48.000Z","updated_at":"2024-11-18T07:06:49.000Z","dependencies_parsed_at":"2024-11-15T15:40:33.314Z","dependency_job_id":"aab29015-94fc-4462-bb4c-fc8c07de4383","html_url":"https://github.com/do-me/embedding-algebra","commit_stats":null,"previous_names":["do-me/embedding-algebra"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fembedding-algebra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fembedding-algebra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fembedding-algebra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/do-me%2Fembedding-algebra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/do-me","download_url":"https://codeload.github.com/do-me/embedding-algebra/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248717238,"owners_count":21150389,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T15:38:41.433Z","updated_at":"2025-04-13T13:15:47.721Z","avatar_url":"https://github.com/do-me.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Embedding Algebra\n\n## Background \nWith the rise of Word2Vec it's reduction to the formula `King - Man + Woman = Queen` fueled common falsehoods about embedding algebra.\nThe idea is that you can add and substract vectors to obtain a new embedding reflecting the semantic change, like `King - Man = Royal` or `Woman + Royal = Queen`. The bigger picture is right that the vector operations reflect the semantic changes but only to a certain degree. \n\nApparently there are certain exceptions to the rule and some analogies work better (=as a human would expect) than others. \nReading this [Medium Article](https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935a85) from a couple of years ago I thought I'd give it a go with current SOTA models.\n\n## Findings \n**tl;dr** confirmed again: `King - Man + Woman = Queen` is pretty much never true!\n\n### General Findings\n- As the Medium article claims, it's always `King` that is most similar. `Queen` doesn't even come second always! \n- What's most interesting is that negative embeddings have the biggest impact, to the word `Man` will always rank last. \n- Unsurprisingly, the instruction has a high impact on these results, like in the case of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) where with instruction (that is not mentioned in the repo) it performs better than without.\n- `King - Man` leads to `King` too, but second always comes `Royal` ✔️\n- `Queen - Woman` leads to `Queen` too, but second comes `Prince` ✖️\n- Averaging doesn't change much, `Woman + Royal` and `(Woman + Royal) / 2` roughly lead to the same results\n\n### Gender Bias\nI expected to see a gender bias when testing for `King + Queen`, like that `King` is more similar to the resulting embedding than `Queen` due to a bias in the training data (like more mentions of kings in our history books than queens) but apparently that doesn't hold. Instead, it **highly depends on the model**:\n- `mixedbread-ai/mxbai-embed-large-v1`:\n```\nCosine similarity between 'queen' and analogy vector: 0.9102759957313538\nCosine similarity between 'king'  and analogy vector: 0.909360408782959\n```\n- `BAAI/bge-base-en-v1.5`\n```\nCosine similarity between 'king'  and analogy vector: 0.9067744016647339\nCosine similarity between 'queen' and analogy vector: 0.9067744016647339\n```\nSo while `BAAI/bge-base-en-v1.5` takes a mathematical approach that the summed vector has the same distance to all of its summands, that's not the case for `mixedbread-ai/mxbai-embed-large-v1`.\n\n## Scripts\nSee the notebook in this repo to reproduce the results with any model and any equation. I included all three, `Euclidian Distance`, `Dot Product` and `Cosine Similarity` but keep in mind that most models have a preferred distance metric (often cosine distance). These are the [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) results for `King - Man + Woman`:\n\n```\nDot Product:\nDot product between 'king' and analogy vector: 206.02333068847656\nDot product between 'woman' and analogy vector: 179.44287109375\nDot product between 'princess' and analogy vector: 177.8961181640625\nDot product between 'queen' and analogy vector: 177.6486053466797\nDot product between 'castle' and analogy vector: 116.86325073242188\nDot product between 'prince' and analogy vector: 113.52368927001953\nDot product between 'horse' and analogy vector: 113.3372802734375\nDot product between 'person' and analogy vector: 110.6568374633789\nDot product between 'apple' and analogy vector: 107.81037139892578\nDot product between 'banana' and analogy vector: 103.31510925292969\nDot product between 'basketball' and analogy vector: 101.27586364746094\nDot product between 'clown' and analogy vector: 97.28660583496094\nDot product between 'football' and analogy vector: 96.44972229003906\nDot product between 'man' and analogy vector: 47.41835021972656\n\n--------------------------------------------------------------------------------\n\nCosine Similarity:\nCosine similarity between 'king' and analogy vector: 0.7420865297317505\nCosine similarity between 'woman' and analogy vector: 0.6679535508155823\nCosine similarity between 'queen' and analogy vector: 0.6367943286895752\nCosine similarity between 'princess' and analogy vector: 0.6064033508300781\nCosine similarity between 'person' and analogy vector: 0.4240642786026001\nCosine similarity between 'castle' and analogy vector: 0.41255974769592285\nCosine similarity between 'horse' and analogy vector: 0.39906030893325806\nCosine similarity between 'prince' and analogy vector: 0.3888675272464752\nCosine similarity between 'apple' and analogy vector: 0.3804171681404114\nCosine similarity between 'banana' and analogy vector: 0.3553932309150696\nCosine similarity between 'basketball' and analogy vector: 0.3550359904766083\nCosine similarity between 'football' and analogy vector: 0.3462764620780945\nCosine similarity between 'clown' and analogy vector: 0.3232797086238861\nCosine similarity between 'man' and analogy vector: 0.17841237783432007\n\n--------------------------------------------------------------------------------\n\nEuclidean Distance (sorted by smallest distance, which indicates highest similarity):\nEuclidean distance between 'king' and analogy vector: 12.40994930267334\nEuclidean distance between 'woman' and analogy vector: 13.87999439239502\nEuclidean distance between 'queen' and analogy vector: 14.593585968017578\nEuclidean distance between 'princess' and analogy vector: 15.389604568481445\nEuclidean distance between 'person' and analogy vector: 17.837038040161133\nEuclidean distance between 'castle' and analogy vector: 18.484573364257812\nEuclidean distance between 'horse' and analogy vector: 18.707866668701172\nEuclidean distance between 'apple' and analogy vector: 18.974042892456055\nEuclidean distance between 'prince' and analogy vector: 19.055479049682617\nEuclidean distance between 'football' and analogy vector: 19.355772018432617\nEuclidean distance between 'basketball' and analogy vector: 19.39596176147461\nEuclidean distance between 'banana' and analogy vector: 19.529788970947266\nEuclidean distance between 'clown' and analogy vector: 20.282350540161133\nEuclidean distance between 'man' and analogy vector: 21.264333724975586\n```\n\n## PRs\nHighly appreciated, maybe some automization would be good the create a nicely formatted markdown table to be included in this readme listing the behavior of the most used embedding models. Would this even be something for MTEB?\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Fembedding-algebra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdo-me%2Fembedding-algebra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdo-me%2Fembedding-algebra/lists"}