https://github.com/hugojarkoff/rapgpt
SLM (Small Language Model) trained on French Rap Lyrics
https://github.com/hugojarkoff/rapgpt
llm-training rap slm
Last synced: 10 months ago
JSON representation
SLM (Small Language Model) trained on French Rap Lyrics
- Host: GitHub
- URL: https://github.com/hugojarkoff/rapgpt
- Owner: hugojarkoff
- Created: 2024-03-07T17:54:11.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-17T11:20:34.000Z (over 1 year ago)
- Last Synced: 2024-11-17T11:37:34.481Z (over 1 year ago)
- Topics: llm-training, rap, slm
- Language: Python
- Homepage:
- Size: 101 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# rapGPT
Train a GPT-like model to generate French rap lyrics.
Essentially a fun and educational personal project, learning how to design and train a GPT-like architecture from scratch.
## 0. Dependencies management
This project uses [Rye](https://rye-up.com/). Make sure it's installed in your system.
To install **all** dependencies (downloading data, training etc.), run `rye sync --all-features` in project directory.
## 1. Data
This project uses [French Rap Lyrics Kaggle dataset](https://www.kaggle.com/datasets/adibhabbou/french-rap-lyrics?resource=download).
To download it, register your kaggle API token. See instructions [here](https://www.kaggle.com/docs/api). Basically simply download and move your `kaggle.json` token to `~/.kaggle/kaggle.json`.
Then run `python scripts/download_data.py`.
## 2. Train
Make sure you have access to a decent GPU, as the default model config is pretty VRAM-heavy.
From repo root, run `python scripts/train.py` with an optional `config` arg (by default pointed to `configs/config.toml`).
The best model is tracked and saved on disk by [`torcheval.metrics.Perplexity`](https://pytorch.org/torcheval/main/generated/torcheval.metrics.Perplexity.html). By default, checkpoints are saved in `checkpoints/`.
**NOTE**: This project uses [WandB](https://wandb.ai/) to log and record experiments. If your training config specifies `wandb.mode = online`, make sure you've registered your account with your API key.
## 3. Pushing to HF Hub
Once your model is trained, you can push the checkpoint to [HF](https://huggingface.co/) using `scripts/push_to_hf_hub.py` with the correct specified arguments. It will push the following three components:
- `model.pt` (specified argument), converted to `model.safetensors` (using the `rapgpt.model.HFHubTransformerModel` mixin) for ease of inference on HF Space;
- `config.toml` (specified argument);
- `artists_tokens.txt` (specified argument).
These three components are required for inference (see next section).
## 4. Local Inference
This project uses [Gradio](https://www.gradio.app/) for local and online inference.
Local inference is done using `python app/app.py` script. Some additional arguments can be passed, essentially indicating wether to use the [default checkpoint on HF Hub](https://huggingface.co/hugojarkoff/rapGPT/tree/main) or some local checkpoint.
## 5. Online Inference
Online inference is served [on HF](https://huggingface.co/spaces/hugojarkoff/rapGPT) through the (more or less) same Gradio `app`. It automatically calls the [default checkpoint on HF Hub](https://huggingface.co/hugojarkoff/rapGPT/tree/main) for inference.
## Future Works / Ideas
Since this project is mostly personal / educational (and since I'm GPU poor), it is probably not production-ready in its current state (and has no intention of being in the planned future). However, here are some interesting leads I plan on exploring:
- I noticed the style of each rapper isn't sufficiently marked. To enforce this more in model training, I want to try adding a classification head and backpropagate using logits + classification losses;
- Clean-up code / use more production-ready modules (e.g [FlashAttention](https://github.com/Dao-AILab/flash-attention))
- Train in fp16
- Find a way to select multiple artists tokens (for mixing styles, could be fun)
## Credits
Inspired by the great [nanoGPT](https://github.com/karpathy/nanoGPT)