{"id":13462929,"url":"https://github.com/koaning/whatlies","last_synced_at":"2025-03-25T06:31:25.349Z","repository":{"id":39712525,"uuid":"242405960","full_name":"koaning/whatlies","owner":"koaning","description":"Toolkit to help understand \"what lies\" in word embeddings. Also benchmarking! ","archived":true,"fork":false,"pushed_at":"2023-02-06T15:59:26.000Z","size":11933,"stargazers_count":472,"open_issues_count":4,"forks_count":50,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-03-24T11:14:02.361Z","etag":null,"topics":["embeddings","nlp","visualisations"],"latest_commit_sha":null,"homepage":"https://koaning.github.io/whatlies/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/koaning.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-22T20:19:30.000Z","updated_at":"2025-03-14T09:39:51.000Z","dependencies_parsed_at":"2023-02-19T09:46:21.710Z","dependency_job_id":null,"html_url":"https://github.com/koaning/whatlies","commit_stats":null,"previous_names":["rasahq/whatlies"],"tags_count":26,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koaning%2Fwhatlies","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koaning%2Fwhatlies/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koaning%2Fwhatlies/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koaning%2Fwhatlies/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/koaning","download_url":"https://codeload.github.com/koaning/whatlies/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245413754,"owners_count":20611353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","nlp","visualisations"],"created_at":"2024-07-31T13:00:41.272Z","updated_at":"2025-03-25T06:31:23.335Z","avatar_url":"https://github.com/koaning.png","language":"Python","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"readme":"![](https://img.shields.io/pypi/v/whatlies)\n![](https://img.shields.io/pypi/pyversions/whatlies)\n![](https://img.shields.io/github/license/koaning/whatlies)\n[![Downloads](https://pepy.tech/badge/whatlies)](https://pepy.tech/project/whatlies)\n\n# archival notice\n\nThis was a fun project for a while, but it's become a pain to maintain all the different backends. If you're looking for visualisation tools, check https://github.com/koaning/cluestar and consider https://github.com/koaning/cluestar if you're interested in the embeddings going forward. \n\n# whatlies \n\nA library that tries to help you to understand (note the pun).\n\n\u003e \"What lies in word embeddings?\"\n\nThis small library offers tools to make visualisation easier of both\nword embeddings as well as operations on them.\n\n## Produced\n\n\u003cimg src=\"docs/square-logo.svg\" width=75 height=75 align=\"right\"\u003e\n\nThis project was initiated at [Rasa](https://rasa.com) as a by-product of\nour efforts in the developer advocacy and research teams. The project is \nmaintained by [koaning](https://github.com/koaning) in order to support more use-cases. \n\n## Features\n\nThis library has tools to help you understand what lies in word embeddings. This includes:\n\n- simple tools to create (interactive) visualisations\n- support for many language backends including spaCy, fasttext, tfhub, huggingface and bpemb\n- lightweight scikit-learn featurizer support for all these backends\n\n## Installation\n\nYou can install the package via pip;\n\n```bash\npip install whatlies\n```\n\nThis will install the base dependencies. Depending on the\ntransformers and language backends that you'll be using you\nmay want to install more. Here's some of the possible installation\nsettings you could go for.\n\n```bash\npip install whatlies[spacy]\npip install whatlies[tfhub]\npip install whatlies[transformers]\n```\n\nIf you want it all you can also install via;\n\n```bash\npip install whatlies[all]\n```\n\nNote that this will install dependencies but it\n**will not** install all the language models you might\nwant to visualise. For example, you might still\nneed to manually download spaCy models if you intend\nto use that backend.\n\n## Getting Started\n\nMore in depth getting started guides can be found on the [documentation page](https://koaning.github.io/whatlies/).\n\n## Examples\n\nThe idea is that you can load embeddings from a language backend\nand use mathematical operations on it.\n\n```python\nfrom whatlies import EmbeddingSet\nfrom whatlies.language import SpacyLanguage\n\nlang = SpacyLanguage(\"en_core_web_md\")\nwords = [\"cat\", \"dog\", \"fish\", \"kitten\", \"man\", \"woman\",\n         \"king\", \"queen\", \"doctor\", \"nurse\"]\n\nemb = EmbeddingSet(*[lang[w] for w in words])\nemb.plot_interactive(x_axis=emb[\"man\"], y_axis=emb[\"woman\"])\n```\n\n![](docs/gif-zero.gif)\n\nYou can even do fancy operations. Like projecting onto and away\nfrom vector embeddings! You can perform these on embeddings as\nwell as sets of embeddings.  In the example below we attempt\nto filter away gender bias using linear algebra operations.\n\n```python\norig_chart = emb.plot_interactive('man', 'woman')\n\nnew_ts = emb | (emb['king'] - emb['queen'])\nnew_chart = new_ts.plot_interactive('man', 'woman')\n```\n\n![](docs/gif-one.gif)\n\nThere's also things like **pca** and **umap**.\n\n```python\nfrom whatlies.transformers import Pca, Umap\n\norig_chart = emb.plot_interactive('man', 'woman')\npca_plot = emb.transform(Pca(2)).plot_interactive()\numap_plot = emb.transform(Umap(2)).plot_interactive()\n\npca_plot | umap_plot\n```\n\n![](docs/gif-two.gif)\n\n## Scikit-Learn Support\n\nEvery language backend in this video is available as a scikit-learn featurizer as well.\n\n```python\nimport numpy as np\nfrom whatlies.language import BytePairLanguage\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.linear_model import LogisticRegression\n\npipe = Pipeline([\n    (\"embed\", BytePairLanguage(\"en\")),\n    (\"model\", LogisticRegression())\n])\n\nX = [\n    \"i really like this post\",\n    \"thanks for that comment\",\n    \"i enjoy this friendly forum\",\n    \"this is a bad post\",\n    \"i dislike this article\",\n    \"this is not well written\"\n]\n\ny = np.array([1, 1, 1, 0, 0, 0])\n\npipe.fit(X, y)\n```\n\n## Documentation\n\nTo learn more and for a getting started guide, check out the [documentation](https://koaning.github.io/whatlies/).\n\n## Similar Projects\n\nThere are some similar projects out and we figured it fair to mention and compare them here.\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003cb\u003eJulia Bazińska \u0026 Piotr Migdal Web App\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003eThe original inspiration for this project came from \u003ca href=\"https://lamyiowce.github.io/word2viz/\"\u003ethis web app\u003c/a\u003e\n    and \u003ca href=\"https://www.youtube.com/watch?v=AGgCqpouKSs\"\u003ethis pydata talk\u003c/a\u003e. It is a web app that takes a\n    while to load but it is really fun to play with. The goal of this project is to make it easier to make similar\n    charts from jupyter using different language backends.\u003c/p\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003eTensorflow Projector\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003eFrom google there's the \u003ca href=\"https://projector.tensorflow.org/\"\u003etensorflow projector project\u003c/a\u003e. It offers\n    highly interactive 3d visualisations as well as some transformations via tensorboard.\u003c/p\u003e\n    \u003cul\u003e\n    \u003cli\u003eThe tensorflow projector will create projections in tensorboard, which you can also load\n    into jupyter notebook but whatlies makes visualisations directly.\u003c/li\u003e\n    \u003cli\u003eThe tensorflow projector supports interactive 3d visuals, which whatlies currently doesn't.\u003c/li\u003e\n    \u003cli\u003eWhatlies offers lego bricks that you can chain together to get a visualisation started. This\n    also means that you're more flexible when it comes to transforming data before visualising it.\u003c/li\u003e\n    \u003c/ul\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n    \u003csummary\u003e\u003cb\u003eParallax\u003c/b\u003e\u003c/summary\u003e\n    \u003cp\u003eFrom Uber AI Labs there's \u003ca href=\"https://github.com/uber-research/parallax\"\u003eparallax\u003c/a\u003e which is described\n    in a paper \u003ca href=\"https://arxiv.org/abs/1905.12099\"\u003ehere\u003c/a\u003e. There's a common mindset in the two tools;\n    the goal is to use arbitrary user defined projections to understand embedding spaces better.\n    That said, some differences that are worth to mention.\u003c/p\u003e\n    \u003cul\u003e\n    \u003cli\u003eIt relies on bokeh as a visualisation backend and offers a lot of visualisation types\n    (like radar plots). Whatlies uses altair and tries to stick to simple scatter charts.\n    Altair can export interactive html/svg but it will not scale as well if you've drawing\n    many points at the same time.\u003c/li\u003e\n    \u003cli\u003eParallax is meant to be run as a stand-alone app from the command line while Whatlies is\n    meant to be run from the jupyter notebook.\u003c/li\u003e\n    \u003cli\u003eParallax gives a full user interface while Whatlies offers lego bricks that you can chain\n    together to get a visualisation started.\u003c/li\u003e\n    \u003cli\u003eWhatlies relies on language backends (like spaCy, huggingface) to fetch word embeddings.\n    Parallax allows you to instead fetch raw files on disk.\u003c/li\u003e\n    \u003cli\u003eParallax has been around for a while, Whatlies is more new and therefore more experimental.\u003c/li\u003e\n    \u003c/ul\u003e\n\u003c/details\u003e\n\n## Local Development\n\nIf you want to develop locally you can start by running this command.\n\n```bash\nmake develop\n```\n\n### Documentation\n\nThis is generated via\n\n```\nmake docs\n```\n\n### Citation\n\nPlease use the following citation when you found `whatlies` helpful for any of your work (find the `whatlies` paper [here](https://www.aclweb.org/anthology/2020.nlposs-1.8)):\n```\n@inproceedings{warmerdam-etal-2020-going,\n    title = \"Going Beyond {T}-{SNE}: Exposing whatlies in Text Embeddings\",\n    author = \"Warmerdam, Vincent  and\n      Kober, Thomas  and\n      Tatman, Rachael\",\n    booktitle = \"Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)\",\n    month = nov,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.nlposs-1.8\",\n    doi = \"10.18653/v1/2020.nlposs-1.8\",\n    pages = \"52--60\",\n    abstract = \"We introduce whatlies, an open source toolkit for visually inspecting word and sentence embeddings. The project offers a unified and extensible API with current support for a range of popular embedding backends including spaCy, tfhub, huggingface transformers, gensim, fastText and BytePair embeddings. The package combines a domain specific language for vector arithmetic with visualisation tools that make exploring word embeddings more intuitive and concise. It offers support for many popular dimensionality reduction techniques as well as many interactive visualisations that can either be statically exported or shared via Jupyter notebooks. The project documentation is available from https://koaning.github.io/whatlies/.\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoaning%2Fwhatlies","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkoaning%2Fwhatlies","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoaning%2Fwhatlies/lists"}