{"id":23046779,"url":"https://github.com/0xnu/tiny_llm_trainer","last_synced_at":"2025-04-03T02:43:39.548Z","repository":{"id":246497699,"uuid":"820524151","full_name":"0xnu/tiny_llm_trainer","owner":"0xnu","description":"The experiment implements a tiny language model trainer using PyTorch.","archived":false,"fork":false,"pushed_at":"2024-06-30T10:26:09.000Z","size":35,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-25T18:12:14.775Z","etag":null,"topics":["large-language-model","large-language-models","llm","llm-training","pytorch","text-generation","text-to-speech","tts","visual-question-answering","vqa","wiki","wikipedia"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/0xnu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-26T16:31:49.000Z","updated_at":"2024-06-30T10:26:12.000Z","dependencies_parsed_at":"2024-06-28T10:14:02.766Z","dependency_job_id":null,"html_url":"https://github.com/0xnu/tiny_llm_trainer","commit_stats":null,"previous_names":["0xnu/tiny_llm_trainer"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xnu%2Ftiny_llm_trainer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xnu%2Ftiny_llm_trainer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xnu%2Ftiny_llm_trainer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xnu%2Ftiny_llm_trainer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/0xnu","download_url":"https://codeload.github.com/0xnu/tiny_llm_trainer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246927809,"owners_count":20856193,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-model","large-language-models","llm","llm-training","pytorch","text-generation","text-to-speech","tts","visual-question-answering","vqa","wiki","wikipedia"],"created_at":"2024-12-15T22:29:14.205Z","updated_at":"2025-04-03T02:43:39.520Z","avatar_url":"https://github.com/0xnu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Tiny LLM Trainer\n\nThe experiment implements a tiny language model trainer using [PyTorch](https://pytorch.org/). I designed it to train on Wikipedia data and generate text based on the learned patterns.\n\n### Features\n\n- PyTorch-based implementation\n- Transformer architecture\n- Configurable model size and training parameters\n- Text generation with temperature and top-k sampling\n\n### Requirements\n\n- Python 3.7+\n- PyTorch\n- NumPy\n- Pillow\n\n### Project Structure\n\n```sh\n.\n├── data\n├── models\n├── wikipedia_data.py\n├── tiny_llm_trainer.py\n├── flickr_data.py\n├── tiny_llm_trainer_vqa.py\n├── cvc_data.py\n└── tiny_llm_trainer_cvc.py\n```\n\n### Files\n\n- `data/`: Directory where preprocessed training data from Wikipedia is saved.\n- `models/`: Directory where trained models are saved.\n- `wikipedia_data.py`: Script for downloading and preprocessing [Wikipedia](https://www.wikipedia.org) data.\n- `tiny_llm_trainer.py`: The main script for training the model.\n- `flickr_data.py`: Script for downloading and preprocessing [Flickr](https://flickr.com) image data.\n- `tiny_llm_trainer_vqa.py`: Script for training the model on Visual Question Answering (VQA) tasks using Flickr data.\n- `cvc_data.py`: Script for downloading and preprocessing [Common Voice Corpus 1](https://commonvoice.mozilla.org/en/datasets) data.\n- `tiny_llm_trainer_cvc.py`: Script for training a TTS model using Common Voice Corpus 1 data.\n\n### Usage\n\n1. Python Package Installer:\n\n   ```sh\n   pip3 install uv\n   ```\n\n2. Prerequisites:\n\n   ```sh\n   python3 -m venv .venv\n   source .venv/bin/activate\n   uv pip install -r requirements.txt\n   python3 -m pip install --upgrade pip\n   deactivate # deactivate virtual environment\n   ```\n\n### Text Generation\n\n1. Prepare Data:\n\n   ```sh\n   python3 wikipedia_data.py\n   ```\n\n2. Train LLM:\n\n   ```sh\n   python3 tiny_llm_trainer.py\n   ```\n\n### Visual Question Answering (VQA)\n\n1. Prepare Data:\n\n   ```sh\n   python3 flickr_data.py\n   ```\n\n2. Train VQA — Multimodal:\n\n   ```sh\n   python3 tiny_llm_trainer_vqa.py\n   ```\n\n### Text-to-Speech (TTS)\n\n1. Prepare Data:\n\n   ```sh\n   python3 cvc_data.py\n   ```\n\n2. Train TTS:\n\n   ```sh\n   tiny_llm_trainer_cvc.py\n   ```\n\n### References\n\n+ [Large Language Model (LLM) AI text generation detection based on transformer deep learning algorithm](https://arxiv.org/abs/2405.06652)\n+ [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044)\n+ [From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models](https://arxiv.org/abs/2212.10846)\n+ [Enhancing Image Caption Generation Using Reinforcement Learning with Human Feedback](https://arxiv.org/abs/2403.06735)\n+ [VQA: Visual Question Answering](https://arxiv.org/abs/1505.00468)\n+ [Meta Learning Text-to-Speech Synthesis in over 7000 Languages](https://arxiv.org/abs/2406.06403)\n+ [Text to Speech Synthesis](https://arxiv.org/abs/2401.13891)\n\n### License\n\nThis project is licensed under the [Apache License 2.0](./LICENSE).\n\n### Citation\n\n```tex\n@misc{tlt2024,\n  author       = {Oketunji, A.F.},\n  title        = {Tiny LLM Trainer},\n  year         = 2024,\n  version      = {0.0.6},\n  publisher    = {Zenodo},\n  doi          = {10.5281/zenodo.12593929},\n  url          = {https://doi.org/10.5281/zenodo.12593929}\n}\n```\n\n### Copyright\n\n(c) 2024 [Finbarrs Oketunji](https://finbarrs.eu). All Rights Reserved.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xnu%2Ftiny_llm_trainer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F0xnu%2Ftiny_llm_trainer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xnu%2Ftiny_llm_trainer/lists"}