https://github.com/jamesdconley/embeddageddon-organized

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/jamesdconley/embeddageddon-organized
Owner: JamesDConley
Created: 2025-10-21T02:25:41.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-10-31T04:22:54.000Z (8 months ago)
Last Synced: 2025-10-31T06:14:01.014Z (8 months ago)
Language: Python
Size: 78.1 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Embedageddon
Good artists copy, great artists steal!

## Pull Embeddings
`python src/extract_embeddings.py`

## Setup Dataset
`python src/preprocess_embeddings.py --embedding_dir data/embedding_extraction/embedding_dicts --output_dir data/autoencoder/datasets/full`

## Train Autoencoder

Train with dataset via
`python src/ae_train_memmap.py --preprocessed_dir data/autoencoder/datasets/full --batch_size 512 --num_workers 2 --epochs 100 --device cuda --output_dir data/autoencoder/trained_models/epochs_100_$(date +%Y%m%d_%H%M%S)`

You can also use the `ae_train_preloaded.py` if you have enough memory to fit all of the dictionaries into RAM. Note this will happen per worker, and multiple workers are needed to keep up with a single RTX Pro 6000. Even with it all preloaded into memory.

The memmap script I also find to get poor GPU utilization. There's definitely optimization that can be done here feeding the data to the training code.

# Generate Embeddageddon Embeddings
`python src/generate_embeddings.py --embedding_dicts_dir data/embedding_extraction/embedding_dicts --encoder_model_path data/autoencoder/trained_models/epochs_100_20251019_160346/models/embeddageddon_model_final.pth --output_dir data/embeddageddon_embeddings/xl --bottleneck_dim 7168`

`python src/generate_embeddings.py --embedding_dicts_dir data/embedding_extraction/embedding_dicts --encoder_model_path data/autoencoder/trained_models/no_dropout_epochs_100_20251021_092845/models/embeddageddon_model_final.pth --output_dir data/embeddageddon_embeddings/no_dropout --bottleneck_dim 7168`

`python src/generate_embeddings.py --embedding_dicts_dir data/embedding_extraction/embedding_dicts --encoder_model_path data/autoencoder/trained_models/no_dropout_epochs_100_20251021_092845/checkpoints/embeddageddon_model_epoch_50.pth --output_dir data/embeddageddon_embeddings/50_epochs_no_dropout --bottleneck_dim 7168`

# Setup a Training Dataset
`python src/generate_dataset.py --output_dir data/llm_datasets/redpajama_small`

# Parameter Counts

S - 298_447_744
M - 1_147_412_224
L - 5_597_896_192
XL - 24_173_026_304

# Chinchilla Token Counts
S - 5_968_954_880
M - 22_948_244_480
L - 111_957_923_840
XL - 483_460_526_080

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jamesdconley/embeddageddon-organized

Awesome Lists containing this project

README