{"id":51221063,"url":"https://github.com/huggon1/llm-from-scratch","last_synced_at":"2026-06-28T07:03:30.425Z","repository":{"id":344404790,"uuid":"1181691511","full_name":"huggon1/llm-from-scratch","owner":"huggon1","description":"Small, readable experiments from tokenizer training to LoRA and DPO.","archived":false,"fork":false,"pushed_at":"2026-03-14T14:00:37.000Z","size":180,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-15T01:12:38.803Z","etag":null,"topics":["dpo","llm","lora","pretraining","tokenizer"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/huggon1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-14T13:49:47.000Z","updated_at":"2026-03-14T14:01:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/huggon1/llm-from-scratch","commit_stats":null,"previous_names":["huggon1/llm-from-scratch"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/huggon1/llm-from-scratch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggon1%2Fllm-from-scratch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggon1%2Fllm-from-scratch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggon1%2Fllm-from-scratch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggon1%2Fllm-from-scratch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/huggon1","download_url":"https://codeload.github.com/huggon1/llm-from-scratch/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/huggon1%2Fllm-from-scratch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34880191,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dpo","llm","lora","pretraining","tokenizer"],"created_at":"2026-06-28T07:03:30.022Z","updated_at":"2026-06-28T07:03:30.418Z","avatar_url":"https://github.com/huggon1.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# llm-from-scratch\n\nSmall, readable experiments that walk through a mini LLM pipeline:\n\n1. train a tokenizer\n2. pretrain a compact language model\n3. run supervised fine-tuning\n4. try DPO\n5. try LoRA fine-tuning\n\n## Highlights\n\n- Follows a clear learning path from tokenizer training to preference optimization\n- Keeps each stage in a separate folder so experiments stay easy to inspect\n- Uses tiny public-safe samples so the repository remains runnable and lightweight\n\n## Layout\n\n```text\nllm-from-scratch/\n  tokenizer/\n  pretrain/\n  sft/\n  dpo/\n  lora/\n  docs/\n```\n\n## Requirements\n\n- Python 3.10+\n- PyTorch\n- Transformers\n- Tokenizers\n- Pandas and NumPy\n\nInstall:\n\n```bash\npip install -r requirements.txt\n```\n\n## What's Included\n\n- Training scripts and model definitions for each stage\n- Small sample datasets for demonstration\n- Notes copied from the original study folders under `docs/`\n- Minimal public-safe data samples that keep the repository lightweight\n\n## What's Omitted\n\n- Large checkpoints and `.pth` weights\n- Full training corpora and large raw datasets\n- Temporary training outputs\n- Cache files and Python bytecode\n\nSome scripts expect base model weights produced by an earlier stage. Those weights are intentionally not committed, so you should place them in the expected local path before training or inference.\n\n## Suggested Order\n\n### 1. Tokenizer\n\n```bash\ncd tokenizer\npython main.py\n```\n\n### 2. Pretrain\n\nUse the tokenizer artifacts under `pretrain/tokenizer/`, then run:\n\n```bash\ncd pretrain\npython main.py\n```\n\n### 3. SFT / DPO / LoRA\n\nThese stages depend on locally available base weights from prior training.\n\n```bash\ncd sft\npython main.py\n```\n\n```bash\ncd dpo\npython main.py\n```\n\n```bash\ncd lora\npython main.py\n```\n\nIn practice, the easiest way to explore the repo is to read `docs/` first, then run the tokenizer and pretrain stages before looking at SFT, DPO, and LoRA.\n\n## Notes\n\n- The repository keeps each stage self-contained so the training flow stays easy to follow.\n- The original Chinese notes are preserved as markdown files in `docs/`.\n- Sample datasets are intentionally small and mainly serve as runnable examples.\n- The original larger datasets used during experimentation came from open-source or publicly available materials.\n- Only a few sanitized sample rows are committed here so the repository stays lightweight and easy to share.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggon1%2Fllm-from-scratch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhuggon1%2Fllm-from-scratch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhuggon1%2Fllm-from-scratch/lists"}