https://github.com/jamesdconley/embeddageddon-organized
https://github.com/jamesdconley/embeddageddon-organized
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/jamesdconley/embeddageddon-organized
- Owner: JamesDConley
- Created: 2025-10-21T02:25:41.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-10-31T04:22:54.000Z (8 months ago)
- Last Synced: 2025-10-31T06:14:01.014Z (8 months ago)
- Language: Python
- Size: 78.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Embedageddon
Good artists copy, great artists steal!
## Pull Embeddings
`python src/extract_embeddings.py`
## Setup Dataset
`python src/preprocess_embeddings.py --embedding_dir data/embedding_extraction/embedding_dicts --output_dir data/autoencoder/datasets/full`
## Train Autoencoder
Train with dataset via
`python src/ae_train_memmap.py --preprocessed_dir data/autoencoder/datasets/full --batch_size 512 --num_workers 2 --epochs 100 --device cuda --output_dir data/autoencoder/trained_models/epochs_100_$(date +%Y%m%d_%H%M%S)`
You can also use the `ae_train_preloaded.py` if you have enough memory to fit all of the dictionaries into RAM. Note this will happen per worker, and multiple workers are needed to keep up with a single RTX Pro 6000. Even with it all preloaded into memory.
The memmap script I also find to get poor GPU utilization. There's definitely optimization that can be done here feeding the data to the training code.
# Generate Embeddageddon Embeddings
`python src/generate_embeddings.py --embedding_dicts_dir data/embedding_extraction/embedding_dicts --encoder_model_path data/autoencoder/trained_models/epochs_100_20251019_160346/models/embeddageddon_model_final.pth --output_dir data/embeddageddon_embeddings/xl --bottleneck_dim 7168`
`python src/generate_embeddings.py --embedding_dicts_dir data/embedding_extraction/embedding_dicts --encoder_model_path data/autoencoder/trained_models/no_dropout_epochs_100_20251021_092845/models/embeddageddon_model_final.pth --output_dir data/embeddageddon_embeddings/no_dropout --bottleneck_dim 7168`
`python src/generate_embeddings.py --embedding_dicts_dir data/embedding_extraction/embedding_dicts --encoder_model_path data/autoencoder/trained_models/no_dropout_epochs_100_20251021_092845/checkpoints/embeddageddon_model_epoch_50.pth --output_dir data/embeddageddon_embeddings/50_epochs_no_dropout --bottleneck_dim 7168`
# Setup a Training Dataset
`python src/generate_dataset.py --output_dir data/llm_datasets/redpajama_small`
# Parameter Counts
S - 298_447_744
M - 1_147_412_224
L - 5_597_896_192
XL - 24_173_026_304
# Chinchilla Token Counts
S - 5_968_954_880
M - 22_948_244_480
L - 111_957_923_840
XL - 483_460_526_080