https://github.com/do-me/embedding-algebra

Test scripts for common word embedding falsehoods like King - Man + Woman = Queen for state-of-the-art embedding model
https://github.com/do-me/embedding-algebra

Last synced: 9 months ago
JSON representation

Test scripts for common word embedding falsehoods like King - Man + Woman = Queen for state-of-the-art embedding model

Host: GitHub
URL: https://github.com/do-me/embedding-algebra
Owner: do-me
License: mit
Created: 2024-04-01T16:49:48.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-04-02T06:01:18.000Z (about 2 years ago)
Last Synced: 2025-04-13T13:15:43.126Z (about 1 year ago)
Language: Jupyter Notebook
Size: 25.4 KB
Stars: 6
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Embedding Algebra

## Background
With the rise of Word2Vec it's reduction to the formula `King - Man + Woman = Queen` fueled common falsehoods about embedding algebra.
The idea is that you can add and substract vectors to obtain a new embedding reflecting the semantic change, like `King - Man = Royal` or `Woman + Royal = Queen`. The bigger picture is right that the vector operations reflect the semantic changes but only to a certain degree.

Apparently there are certain exceptions to the rule and some analogies work better (=as a human would expect) than others.
Reading this [Medium Article](https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935a85) from a couple of years ago I thought I'd give it a go with current SOTA models.

## Findings
**tl;dr** confirmed again: `King - Man + Woman = Queen` is pretty much never true!

### General Findings
- As the Medium article claims, it's always `King` that is most similar. `Queen` doesn't even come second always!
- What's most interesting is that negative embeddings have the biggest impact, to the word `Man` will always rank last.
- Unsurprisingly, the instruction has a high impact on these results, like in the case of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) where with instruction (that is not mentioned in the repo) it performs better than without.
- `King - Man` leads to `King` too, but second always comes `Royal` ✔️
- `Queen - Woman` leads to `Queen` too, but second comes `Prince` ✖️
- Averaging doesn't change much, `Woman + Royal` and `(Woman + Royal) / 2` roughly lead to the same results

### Gender Bias
I expected to see a gender bias when testing for `King + Queen`, like that `King` is more similar to the resulting embedding than `Queen` due to a bias in the training data (like more mentions of kings in our history books than queens) but apparently that doesn't hold. Instead, it **highly depends on the model**:
- `mixedbread-ai/mxbai-embed-large-v1`:
```
Cosine similarity between 'queen' and analogy vector: 0.9102759957313538
Cosine similarity between 'king' and analogy vector: 0.909360408782959
```
- `BAAI/bge-base-en-v1.5`
```
Cosine similarity between 'king' and analogy vector: 0.9067744016647339
Cosine similarity between 'queen' and analogy vector: 0.9067744016647339
```
So while `BAAI/bge-base-en-v1.5` takes a mathematical approach that the summed vector has the same distance to all of its summands, that's not the case for `mixedbread-ai/mxbai-embed-large-v1`.

## Scripts
See the notebook in this repo to reproduce the results with any model and any equation. I included all three, `Euclidian Distance`, `Dot Product` and `Cosine Similarity` but keep in mind that most models have a preferred distance metric (often cosine distance). These are the [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) results for `King - Man + Woman`:

```
Dot Product:
Dot product between 'king' and analogy vector: 206.02333068847656
Dot product between 'woman' and analogy vector: 179.44287109375
Dot product between 'princess' and analogy vector: 177.8961181640625
Dot product between 'queen' and analogy vector: 177.6486053466797
Dot product between 'castle' and analogy vector: 116.86325073242188
Dot product between 'prince' and analogy vector: 113.52368927001953
Dot product between 'horse' and analogy vector: 113.3372802734375
Dot product between 'person' and analogy vector: 110.6568374633789
Dot product between 'apple' and analogy vector: 107.81037139892578
Dot product between 'banana' and analogy vector: 103.31510925292969
Dot product between 'basketball' and analogy vector: 101.27586364746094
Dot product between 'clown' and analogy vector: 97.28660583496094
Dot product between 'football' and analogy vector: 96.44972229003906
Dot product between 'man' and analogy vector: 47.41835021972656

--------------------------------------------------------------------------------

Cosine Similarity:
Cosine similarity between 'king' and analogy vector: 0.7420865297317505
Cosine similarity between 'woman' and analogy vector: 0.6679535508155823
Cosine similarity between 'queen' and analogy vector: 0.6367943286895752
Cosine similarity between 'princess' and analogy vector: 0.6064033508300781
Cosine similarity between 'person' and analogy vector: 0.4240642786026001
Cosine similarity between 'castle' and analogy vector: 0.41255974769592285
Cosine similarity between 'horse' and analogy vector: 0.39906030893325806
Cosine similarity between 'prince' and analogy vector: 0.3888675272464752
Cosine similarity between 'apple' and analogy vector: 0.3804171681404114
Cosine similarity between 'banana' and analogy vector: 0.3553932309150696
Cosine similarity between 'basketball' and analogy vector: 0.3550359904766083
Cosine similarity between 'football' and analogy vector: 0.3462764620780945
Cosine similarity between 'clown' and analogy vector: 0.3232797086238861
Cosine similarity between 'man' and analogy vector: 0.17841237783432007

--------------------------------------------------------------------------------

Euclidean Distance (sorted by smallest distance, which indicates highest similarity):
Euclidean distance between 'king' and analogy vector: 12.40994930267334
Euclidean distance between 'woman' and analogy vector: 13.87999439239502
Euclidean distance between 'queen' and analogy vector: 14.593585968017578
Euclidean distance between 'princess' and analogy vector: 15.389604568481445
Euclidean distance between 'person' and analogy vector: 17.837038040161133
Euclidean distance between 'castle' and analogy vector: 18.484573364257812
Euclidean distance between 'horse' and analogy vector: 18.707866668701172
Euclidean distance between 'apple' and analogy vector: 18.974042892456055
Euclidean distance between 'prince' and analogy vector: 19.055479049682617
Euclidean distance between 'football' and analogy vector: 19.355772018432617
Euclidean distance between 'basketball' and analogy vector: 19.39596176147461
Euclidean distance between 'banana' and analogy vector: 19.529788970947266
Euclidean distance between 'clown' and analogy vector: 20.282350540161133
Euclidean distance between 'man' and analogy vector: 21.264333724975586
```

## PRs
Highly appreciated, maybe some automization would be good the create a nicely formatted markdown table to be included in this readme listing the behavior of the most used embedding models. Would this even be something for MTEB?

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/do-me/embedding-algebra

Awesome Lists containing this project

README