{"id":30799353,"url":"https://github.com/hugojarkoff/rapgpt","last_synced_at":"2025-09-05T19:11:37.022Z","repository":{"id":263258525,"uuid":"768775080","full_name":"hugojarkoff/rapGPT","owner":"hugojarkoff","description":"SLM (Small Language Model) trained on French Rap Lyrics","archived":false,"fork":false,"pushed_at":"2024-11-17T11:20:34.000Z","size":103,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-17T11:37:34.481Z","etag":null,"topics":["llm-training","rap","slm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hugojarkoff.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-07T17:54:11.000Z","updated_at":"2024-11-17T11:20:38.000Z","dependencies_parsed_at":"2024-11-18T01:35:11.722Z","dependency_job_id":null,"html_url":"https://github.com/hugojarkoff/rapGPT","commit_stats":null,"previous_names":["hugojarkoff/rapgpt"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hugojarkoff/rapGPT","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hugojarkoff%2FrapGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hugojarkoff%2FrapGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hugojarkoff%2FrapGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hugojarkoff%2FrapGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hugojarkoff","download_url":"https://codeload.github.com/hugojarkoff/rapGPT/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hugojarkoff%2FrapGPT/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273806204,"owners_count":25171568,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-05T02:00:09.113Z","response_time":402,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm-training","rap","slm"],"created_at":"2025-09-05T19:11:23.515Z","updated_at":"2025-09-05T19:11:37.005Z","avatar_url":"https://github.com/hugojarkoff.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# rapGPT\n\nTrain a GPT-like model to generate French rap lyrics.\n\nEssentially a fun and educational personal project, learning how to design and train a GPT-like architecture from scratch.\n\n## 0. Dependencies management\nThis project uses [Rye](https://rye-up.com/). Make sure it's installed in your system.\n\nTo install **all** dependencies (downloading data, training etc.), run `rye sync --all-features` in project directory.\n\n## 1. Data\nThis project uses [French Rap Lyrics Kaggle dataset](https://www.kaggle.com/datasets/adibhabbou/french-rap-lyrics?resource=download).\n\nTo download it, register your kaggle API token. See instructions [here](https://www.kaggle.com/docs/api). Basically simply download and move your `kaggle.json` token to `~/.kaggle/kaggle.json`.\n\nThen run `python scripts/download_data.py`.\n\n## 2. Train\nMake sure you have access to a decent GPU, as the default model config is pretty VRAM-heavy.\n\nFrom repo root, run `python scripts/train.py` with an optional `config` arg (by default pointed to `configs/config.toml`).\n\nThe best model is tracked and saved on disk by [`torcheval.metrics.Perplexity`](https://pytorch.org/torcheval/main/generated/torcheval.metrics.Perplexity.html). By default, checkpoints are saved in `checkpoints/\u003crun_name\u003e`.\n\n**NOTE**: This project uses [WandB](https://wandb.ai/) to log and record experiments. If your training config specifies `wandb.mode = online`, make sure you've registered your account with your API key.\n\n## 3. Pushing to HF Hub\nOnce your model is trained, you can push the checkpoint to [HF](https://huggingface.co/) using `scripts/push_to_hf_hub.py` with the correct specified arguments. It will push the following three components:\n- `model.pt` (specified argument), converted to `model.safetensors` (using the `rapgpt.model.HFHubTransformerModel` mixin) for ease of inference on HF Space;\n- `config.toml` (specified argument);\n- `artists_tokens.txt` (specified argument).\n\nThese three components are required for inference (see next section).\n\n## 4. Local Inference\nThis project uses [Gradio](https://www.gradio.app/) for local and online inference.\n\nLocal inference is done using `python app/app.py` script. Some additional arguments can be passed, essentially indicating wether to use the [default checkpoint on HF Hub](https://huggingface.co/hugojarkoff/rapGPT/tree/main) or some local checkpoint.\n\n## 5. Online Inference\nOnline inference is served [on HF](https://huggingface.co/spaces/hugojarkoff/rapGPT) through the (more or less) same Gradio `app`. It automatically calls the [default checkpoint on HF Hub](https://huggingface.co/hugojarkoff/rapGPT/tree/main) for inference.\n\n## Future Works / Ideas\n\nSince this project is mostly personal / educational (and since I'm GPU poor), it is probably not production-ready in its current state (and has no intention of being in the planned future). However, here are some interesting leads I plan on exploring:\n\n- I noticed the style of each rapper isn't sufficiently marked. To enforce this more in model training, I want to try adding a classification head and backpropagate using logits + classification losses;\n- Clean-up code / use more production-ready modules (e.g [FlashAttention](https://github.com/Dao-AILab/flash-attention))\n- Train in fp16\n- Find a way to select multiple artists tokens (for mixing styles, could be fun)\n\n## Credits\n\nInspired by the great [nanoGPT](https://github.com/karpathy/nanoGPT)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhugojarkoff%2Frapgpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhugojarkoff%2Frapgpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhugojarkoff%2Frapgpt/lists"}