https://github.com/veldhub/veld_chain__demo_wordembeddings_multiarch
A VELD demonstration, aggregating heterogeneous modular workflows into a cohesive reproducible pipeline.
https://github.com/veldhub/veld_chain__demo_wordembeddings_multiarch
analysis etl evaluation fasttext glove nlp word-embeddings word2vec wordembeddings
Last synced: 6 months ago
JSON representation
A VELD demonstration, aggregating heterogeneous modular workflows into a cohesive reproducible pipeline.
- Host: GitHub
- URL: https://github.com/veldhub/veld_chain__demo_wordembeddings_multiarch
- Owner: veldhub
- License: mit
- Created: 2024-12-07T16:51:57.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-01-20T19:24:25.000Z (9 months ago)
- Last Synced: 2025-01-30T10:18:32.276Z (8 months ago)
- Topics: analysis, etl, evaluation, fasttext, glove, nlp, word-embeddings, word2vec, wordembeddings
- Language: Jupyter Notebook
- Homepage:
- Size: 6.71 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
#  veld_chain__demo_wordembeddings_multiarch
## TL;DR
run it all with:
**note: older versions of docker require the `docker-compose` instead of `docker compose`**
```
git clone --recurse-submodules https://github.com/veldhub/veld_chain__demo_wordembeddings_multiarch.git
cd veld_chain__demo_wordembeddings_multiarch
docker compose -f veld_step_all.yaml up
```## about
This is a [VELD](https://zenodo.org/records/13322913) demonstration.
It contains several chain velds, based on different isolated code stacks, which train word
embeddings from scratch. As training data, the bible is used and preproccessed, and the underlying
word embeddings architectures used are [fastText](https://fasttext.cc/),
[GloVe](https://nlp.stanford.edu/projects/glove/), and
[word2vec](https://radimrehurek.com/gensim/models/word2vec.html) . After training, a jupyter
notebook is launched to compare the differently trained vectors on a few sample words.The outcome of this training setup on such a small training data set is meant to be illustrative of
the reproducibility of the workflows, rather than claiming any deeper insight into the word
contexts of the bible itself.The very final step, an analysis of the entire training, is encapsulated in
[./code/analyse_vectors/notebooks/analyse_vectors.ipynb](./code/analyse_vectors/notebooks/analyse_vectors.ipynb) .## requirements
- git
- docker compose (note: older docker compose versions require running `docker-compose` instead of
`docker compose`)Clone this repo with all its submodules
```
git clone --recurse-submodules https://github.com/veldhub/veld_chain__demo_wordembeddings_multiarch.git
```## how to reproduce
The entirety of the veldified workflows in this repo can be executed in two ways:
- all together in one [multi chain](#multi-chain)
- sequentially by executing each [chain individually](#individual-chains)See each respective veld yaml file for more details.
### multi chain
**[./veld_step_all.yaml](./veld_step_all.yaml)**
Runs all chains in one multi chain. It simply references the individual chains with docker compose's
`extends` functionality.```
docker compose -f veld_step_all.yaml up
```### individual chains
**[./veld_step_1_download.yaml](./veld_step_1_download.yaml)**
Downloads the bible from https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt
.```
docker compose -f veld_step_1_download.yaml up
```**[./veld_step_2_preprocess.yaml](./veld_step_2_preprocess.yaml)**
Cleans the downloaded bible and transforms it into a format compatible for training by the three
word embeddings architectures.```
docker compose -f veld_step_2_preprocess.yaml up
```**[./veld_step_3_train_fasttext.yaml](./veld_step_3_train_fasttext.yaml)**
Trains a fastText model. Also exports its vectors into a pickle as a python dict, with keys
being the word and values being the multidimensional word vector.```
docker compose -f veld_step_3_train_fasttext.yaml up
```**[./veld_step_4_train_glove.yaml](./veld_step_4_train_glove.yaml)**
Trains a GloVe model. Also exports its vectors into a pickle as a python dict, with keys
being the word and values being the multidimensional word vector.```
docker compose -f veld_step_4_train_glove.yaml up
```**[./veld_step_5_train_word2vec.yaml](./veld_step_5_train_word2vec.yaml)**
Trains a word2vec model. Also exports its vectors into a pickle as a python dict, with keys
being the word and values being the multidimensional word vector.```
docker compose -f veld_step_5_train_word2vec.yaml up
```**[./veld_step_6_analyse_vectors.yaml](./veld_step_6_analyse_vectors.yaml)**
Launches a jupyter notebook at http://localhost:8888/ which loads the previously exported word
vectors and compares them numerically and visually on some sample words. The notebook is persisted
at:
[./code/analyse_vectors/notebooks/analyse_vectors.ipynb](./code/analyse_vectors/notebooks/analyse_vectors.ipynb) .After reproducing the entire previous sequences yourself and execution of the notebook, feel free to
save the notebook and compare the resulting differences with `git diff
./code/analyse_vectors/notebooks/analyse_vectors.ipynb`, where the reproduced vector similarities
will have only slight differences to the record of previously trained ones. This difference is due
to randomization within the training, but should be small enough to indicate approximate
reproduction.```
docker compose -f veld_step_6_analyse_vectors.yaml up
```