Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/veekaybee/what_are_embeddings
A deep dive into embeddings starting from fundamentals
https://github.com/veekaybee/what_are_embeddings
embeddings machine-learning machine-learning-algorithms nlp-machine-learning
Last synced: 15 days ago
JSON representation
A deep dive into embeddings starting from fundamentals
- Host: GitHub
- URL: https://github.com/veekaybee/what_are_embeddings
- Owner: veekaybee
- Created: 2023-05-23T10:19:37.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-24T22:13:34.000Z (8 months ago)
- Last Synced: 2024-03-24T23:22:34.681Z (8 months ago)
- Topics: embeddings, machine-learning, machine-learning-algorithms, nlp-machine-learning
- Language: Jupyter Notebook
- Homepage: http://vickiboykis.com/what_are_embeddings/
- Size: 47.4 MB
- Stars: 811
- Watchers: 10
- Forks: 67
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Citation: CITATION.cff
Awesome Lists containing this project
README
# What are embeddings?
This repository contains the generated LaTex document, website, and complementary notebook code for
["What are Embeddings".](https://vickiboykis.com/what_are_embeddings/)[![DOI](https://zenodo.org/badge/644343479.svg)](https://zenodo.org/badge/latestdoi/644343479)
## Abstract
Over the past decade, embeddings --- numerical representations of non-tabular machine learning features used as input to deep learning models --- have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.
[Google's Word2Vec paper](https://arxiv.org/abs/1301.3781) made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the [Transformer architecture](https://arxiv.org/abs/1706.03762) and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.
## Running
The [LaTex document](https://github.com/veekaybee/what_are_embeddings/blob/main/.github/workflows/main.yaml) is written in Overleaf and deployed to GitHub, where it's compiled via Actions. The site is likewise generated via Actions from the `site` branch. The notebooks are flying fast and free and not under any kind of CI whatsoever.
## Contributing
If you have any changes that you'd like to make to the document including clarification or typo fixes, you'll need to build the LaTeX artifact. I use GitHub to track issues and feature requests, as well as accept pull requests. Pull requests are the best way to propose changes to the codebase:
1. Fork the repo and create your branch from `main`.
2. Make your changes in your fork.
3. Make sure that your LaTeX document compiles. The GH action that triggers the PDF is set to run on PR into main.
4. Ensure that the document compiles to a PDF correctly and inspect the output.
5. Make sure your code lints.
6. Issue that pull request!## Citing
```bibtex
@software{Boykis_What_are_embeddings_2023,
author = {Boykis, Vicki},
doi = {10.5281/zenodo.8015029},
month = jun,
title = {{What are embeddings?}},
url = {https://github.com/veekaybee/what_are_embeddings},
version = {1.0.1},
year = {2023}
}
```