{"id":31809262,"url":"https://github.com/shekswess/tiny-reasoning-language-model","last_synced_at":"2025-10-11T05:25:11.813Z","repository":{"id":317075093,"uuid":"1061410448","full_name":"Shekswess/tiny-reasoning-language-model","owner":"Shekswess","description":"Code repository dedicated to experimenting and research with tiny reasoning language model","archived":false,"fork":false,"pushed_at":"2025-09-28T16:14:13.000Z","size":935,"stargazers_count":0,"open_issues_count":5,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-28T18:18:32.988Z","etag":null,"topics":["llm","post-training","reasoning","research","sft","slm","transformers","trl"],"latest_commit_sha":null,"homepage":"https://huggingface.co/collections/Shekswess/tiny-reasoning-language-model-68d924929c17ad8300544ae4","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Shekswess.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-21T21:07:22.000Z","updated_at":"2025-09-28T16:14:16.000Z","dependencies_parsed_at":"2025-10-01T06:46:13.769Z","dependency_job_id":null,"html_url":"https://github.com/Shekswess/tiny-reasoning-language-model","commit_stats":null,"previous_names":["shekswess/tiny-reasoning-language-model"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Shekswess/tiny-reasoning-language-model","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shekswess%2Ftiny-reasoning-language-model","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shekswess%2Ftiny-reasoning-language-model/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shekswess%2Ftiny-reasoning-language-model/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shekswess%2Ftiny-reasoning-language-model/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Shekswess","download_url":"https://codeload.github.com/Shekswess/tiny-reasoning-language-model/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shekswess%2Ftiny-reasoning-language-model/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279006362,"owners_count":26084084,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","post-training","reasoning","research","sft","slm","transformers","trl"],"created_at":"2025-10-11T05:25:10.510Z","updated_at":"2025-10-11T05:25:11.807Z","avatar_url":"https://github.com/Shekswess.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg width=\"1536\" height=\"1024\" alt=\"image\" src=\"https://github.com/user-attachments/assets/5f453496-8180-4cf4-94da-26ebbe1159d4\" /\u003e\n\u003c/p\u003e\n\n# Tiny Reasoning Language Model (trlm)\n\nTiny Reasoning Language Model (trlm) is an open pipeline for teaching a 135M-parameter SmolLM2 based model to handle step-by-step reasoning. The repository captures every stage of the project: \n- sourcing and curating task-specific datasets\n- supervised fine-tuning\n- preference alignment\n- evaluation. \n\nThe code is mainly designed to run on AMD (ROCm) hardware and to publish intermediate artefacts to the Hugging Face Hub, but can be adapted to other setups.\nThe goal of this project is to demonstrate that even small models can learn to reason with the right training strategy and data.\n\n\u003e [!IMPORTANT]  \n\u003e The current model is not intended to be used in everyday applications. It is a research prototype and should be treated as such, mostly because of its limited capabilities, hallucination tendencies, etc.\n\n## [Hugging Face Collection](https://huggingface.co/collections/Shekswess/tiny-reasoning-language-model-68d924929c17ad8300544ae4)\n\n\u003ca href=\"https://huggingface.co/collections/Shekswess/tiny-reasoning-language-model-68d924929c17ad8300544ae4\"\u003e\n  \u003cimg width=\"912\" height=\"786\" alt=\"image\" src=\"https://github.com/user-attachments/assets/35367115-27dd-4262-9f9a-ddc104f4dc31\" /\u003e\n\u003c/a\u003e\n\n\n## Project Stages\n\n| Stage | Objective | Artefact on Hugging Face | Status |\n| --- | --- | --- | --- |\n| Stage 1 SFT | Teach general intelligence without chain-of-thought | [`Shekswess/trlm-stage-1-sft-final-2`](https://huggingface.co/Shekswess/trlm-stage-1-sft-final-2) | complete (final weights + dataset) |\n| Stage 2 SFT | Introduce structured chain-of-thought reasoning with `\u003cthink\u003e` tags | [`Shekswess/trlm-stage-2-sft-final-2`](https://huggingface.co/Shekswess/trlm-stage-2-sft-final-2) | complete (final weights + dataset) |\n| Stage 3 DPO | Align reasoning style with preference data | [`Shekswess/trlm-stage-3-dpo-final-2`](https://huggingface.co/Shekswess/trlm-stage-3-dpo-final-2) | complete (final weights + dataset) |\n\n\n## Post-Training Pipeline\n\u003cimg width=\"1014\" height=\"563\" alt=\"image\" src=\"https://github.com/user-attachments/assets/195ef389-6aa9-4527-b4f0-bea68c0841ae\" /\u003e\n\n## Installation and Usage\n\n```bash\n# Install uv (Linux/macOS)\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n\n# Install uv (Windows - PowerShell)\npowershell -c \"irm https://astral.sh/uv/install.ps1 | iex\"\n\n# Check installation\nuv --version\n\n# create \u0026 sync environment from pyproject.toml\nuv sync\n\n# activate virtualenv (if not auto-activated)\nsource .venv/bin/activate   # Linux/macOS\n.venv\\Scripts\\activate      # Windows PowerShell\n```\n\nOptional extras live in `pyproject.toml` (`cpu`, `rocm`, `dev`). Install ROCm wheels when targeting AMD accelerators.\n\nSet up credentials before training:\n\n```bash\nhuggingface-cli login\nwandb login \n```\n\nCreate a `.env` file if you want to store secrets (loaded automatically by `post_training` scripts).\n\n## Dataset Pipeline\n\nEach stage uses a YAML spec under `data/config/`. The builder streams source datasets, cleans them, logs per-source counts, and can push the shuffled result to the Hub together with an auto-generated dataset card.\n\n### Usage\n\n```bash\nuv run data/data_collection.py \\\n --config-path data/config/stage_1.yaml \\\n --output-dir data/artefacts/stage_1 \\\n --upload-to-hub\n```\n\nSome specifics that can be controlled via the config:\n- Streaming download with optional entry caps per source.\n- Column drops / renames (`drop_columns`, `rename_columns`) applied per dataset.\n- Automatic `_source_dataset`, `_source_subset`, `_source_split` traceability columns.\n- Adaptive strategy: JSON chunking for small corpora, parquet shards for \u003e50k rows.\n- Metadata summarising requested vs actual counts and percentage contribution.\n- Optional Hugging Face push with README generation (`generate_dataset_card`).\n\n### Stage 1 - Non-Reasoning SFT Blend ([`Shekswess/trlm-sft-stage-1-final-2`](https://huggingface.co/datasets/Shekswess/trlm-sft-stage-1-final-2))\n\n| Source (subset/split) | Samples | Share |\n| --- | ---: | ---: |\n| HuggingFaceTB/smoltalk2 / smoltalk_smollm3_smol_magpie_ultra_no_think | 33,500 | 57.8% |\n| HuggingFaceTB/smoltalk2 / smoltalk_smollm3_smol_summarize_no_think | 7,500 | 12.9% |\n| HuggingFaceTB/smoltalk2 / smoltalk_smollm3_smol_rewrite_no_think | 7,500 | 12.9% |\n| HuggingFaceTB/smoltalk2 / smoltalk_smollm3_systemchats_30k_no_think | 2,500 | 4.3% |\n| HuggingFaceTB/smoltalk2 / smoltalk_smollm3_explore_instruct_rewriting_no_think | 2,500 | 4.3% |\n| HuggingFaceTB/smoltalk2 / tulu_3_sft_personas_instruction_following_no_think | 2,500 | 4.3% |\n| HuggingFaceTB/smoltalk2 / smoltalk_smollm3_everyday_conversations_no_think | 2,000 | 3.5% |\n\nFocus: everyday conversations, rewrites, summarisation - no chain-of-thought.\n\n### Stage 2 - Reasoning SFT Blend ([`Shekswess/trlm-sft-stage-2-final-2`](https://huggingface.co/datasets/Shekswess/trlm-sft-stage-2-final-2))\n\n| Source (subset/split) | Samples | Share |\n| --- | ---: | ---: |\n| HuggingFaceTB/smoltalk2 / Llama_Nemotron_Post_Training_Dataset_reasoning_r1 | 40,200 | 51.5% |\n| HuggingFaceTB/smoltalk2 / OpenThoughts3_1.2M | 20,000 | 25.6% |\n| HuggingFaceTB/smoltalk2 / multi_turn_reasoning_if_think | 10,000 | 12.8% |\n| HuggingFaceTB/smoltalk2 / aya_dataset_Qwen3_32B_think | 5,000 | 6.4% |\n| HuggingFaceTB/smoltalk2 / smoltalk_everyday_convs_reasoning_Qwen3_32B_think | 2,000 | 2.6% |\n| HuggingFaceTB/smoltalk2 / s1k_1.1_think | 800 | 1.0% |\n\nFocus: multi-step reasoning traces with `\u003cthink\u003e` delimiters and explicit thought sections. All sources drop `chat_template_kwargs` to keep a uniform schema.\n\n### Stage 3 - Preference Alignment ([`Shekswess/trlm-dpo-stage-3-final-2`](https://huggingface.co/datasets/Shekswess/trlm-dpo-stage-3-final-2))\n\n| Source | Samples | Notes |\n| --- | ---: | --- |\n| scottgeng00/olmo-3-preference-mix-deltas_reasoning-yolo_scottmix-DECON-chfiltered / train | 50,000 | Drops legacy metadata, renames `dataset` -\u003e `source`. |\n\nFocus: pairwise chosen/rejected reasoning completions for Direct Preference Optimization.\n\n## Training Pipeline\n\nAll training scripts accept a `--config-path` pointing to the relevant YAML. Modify hyperparameters by editing the YAML file rather than the Python script.\n\nThe whole training process was run on a single AMD MI300x Instance with these specs:\n1x 192GB MI300x Virtual Machine\nCPU: 8 or 13 cores • RAM: 224GB • Disk: 12288GB NVMe\n\nThe following Docker image was used for all stages:\n\n```\ndocker pull rocm/pytorch:rocm7.0_ubuntu24.04_py3.12_pytorch_release_2.7.1\n```\n\n### Stage 1 - Supervised Fine-Tuning (non-reasoning)\n\n```bash\nuv run post_training/sft.py \\\n --config-path post_training/config/stage_1.yaml\n```\n\n- Initial weights: `HuggingFaceTB/SmolLM2-135M-Instruct` (chat tuned).\n- Dataset: `Shekswess/trlm-sft-stage-1-final-2` (58k dialogues).\n- Chat template: system preamble for `Tiny Reasoning Language Model`, no `\u003cthink\u003e` injection.\n- Training: 3 epochs, `per_device_train_batch_size=32`, grad accumulation 4, cosine LR (`3e-4` peak), `neftune_noise_alpha=0.01`, BF16/gradient checkpointing.\n- Artefacts: checkpoints every 1,500 steps, auto push to `Shekswess/trlm-stage-1-sft-final-2`.\n\n### Stage 2 - Supervised Fine-Tuning (reasoning)\n\n```bash\nuv run post_training/sft.py \\\n --config-path post_training/config/stage_2.yaml\n```\n\n- Initial weights: `Shekswess/trlm-stage-1-sft`.\n- Dataset: `Shekswess/trlm-sft-stage-2-final-2` (78k reasoning traces).\n- Chat template: forces `\u003cthink\u003e...\u003c/think\u003e` segments when the system prompt or data indicates reasoning; adds special tokens to the tokenizer before training.\n- Training: 1 epoch, same batch/accumulation schedule, stronger `neftune_noise_alpha=0.02`, LR=3e-4, cosine decay.\n- Artefacts: saved to `Shekswess/trlm-stage-2-sft-final-2` and pushed on every save.\n\n### Stage 3 - Direct Preference Optimization\n\n```bash\nuv run post_training/dpo.py \\\n --config-path post_training/config/stage_3.yaml\n```\n\n- Initial weights: `Shekswess/trlm-stage-2-sft-final-2`.\n- Dataset: `Shekswess/trlm-dpo-stage-3-final-2` (50k chosen/rejected examples).\n- Training: 1 epoch, LR=1e-5 with cosine schedule + floor (`min_lr_rate=0.1`), beta=0.1, `apo_zero` objective, grad norm clipped to 0.2.\n- Artefacts: output dir `outputs/stage_3_final`, auto-push to `Shekswess/trlm-stage-3-dpo-final-2`.\n\n### Environment \u0026 Monitoring\n\n- Mixed precision defaults to BF16;.\n- Gradient checkpointing and `dataloader_persistent_workers=true` keep memory in check; adjust worker counts to fit your CPU.\n- WANDB logging is enabled by default (`report_to=\"wandb\"`); set `WANDB_PROJECT` etc. in your environment or `.env`.\n\n## Evaluation \u0026 Results\n\n- Baseline comparisons with `lm-eval-harness`:\n  ```bash\n  uv run lm_eval \\\n  --model_args pretrained=Shekswess/trlm-stage-3-dpo-final-2,trust_remote_code=True,dtype=bfloat16 \\\n  --tasks gsm8k,bbh,arc_challenge,boolq,piqa,ifeval,mmlu \\\n  --apply_chat_template \\\n  --fewshot_as_multiturn \\\n  --batch_size auto \\\n  --output_path ./results_tiny_reasoning\n  ```\n\n| **Benchmark**        | **Tiny Reasoning Language Model (trlm-135M)**  | **SmolLM2-135M-Instruct** | **Improvements** |\n| -------------------- | ---------------------------- | ------------------------- | ---------------------------- |\n| **ARC Challenge**    | **40.61** (avg)              | 37.3 (avg)                | **+3.31**                    |\n| **BBH**              | **36.80** (3-shot)           | 28.2 (3-shot)             | **+8.6**                     |\n| **BoolQ**            | **62.17**                    | –                         | N/A                          |\n| **GSM8K**            | **2.59** (5-shot)            | 1.4 (5-shot)              | **+1.19**                    |\n| **IFEval**           | **35.49** (avg)              | 29.9 (avg)                | **+5.59**                    |\n| **MMLU**             | **34.95**                    | 29.3                      | **+5.65**                    |\n| **PIQA**             | **64.91**                    | 66.3                      | **–1.39**                    |\n| **HellaSwag**        | –                            | 40.9                      | N/A                          |\n| **MT-Bench**         | –                            | 19.8                      | N/A                          |\n\n\n## Potential Improvements \u0026 Next Steps\n\nThis project is a research prototype, but there are several directions that could strengthen reasoning performance and robustness:\n\n1. **Scale Up Model Size**  \n   Train larger backbones in the 250M–300M parameter range, or experiment with architectures optimized for reasoning (e.g., mixture-of-experts, deeper attention layers).\n\n2. **Longer or Multi-Epoch Reasoning SFT**  \n   Stage 2 currently runs for a single epoch. Increasing epochs or experimenting with curriculum-style SFT could improve reasoning consistency.\n\n3. **Reinforcement Learning Extensions**  \n   Explore GRPO (Generalized Reward Policy Optimization) or other RLHF variants to refine step-by-step reasoning fidelity beyond DPO.\n\n4. **Continued Pretraining with Reasoning Data**  \n   Pretrain on synthetic or curated reasoning-heavy corpora before alignment stages, to strengthen inductive biases for reasoning traces.\n\n\n## Repository Structure\n\n```\n.\n├── .github\n│   ├── workflows\n│   │   └── uv_ci.yaml\n│   └── dependabot.yml\n├── data\n│   ├── config\n│   │   ├── stage_1.yaml\n│   │   ├── stage_2.yaml\n│   │   └── stage_3.yaml\n│   ├── data_collection.py\n├── post_training\n│   ├── config\n│   │   ├── stage_1.yaml\n│   │   ├── stage_2.yaml\n│   │   └── stage_3.yaml\n│   ├── dpo.py\n│   └── sft.py\n├── .gitignore\n├── .pre-commit-config.yaml\n├── .python-version\n├── interesting_examples.md\n├── pyproject.toml\n├── README.md\n└── uv.lock\n```\n\n## Acknowledgements\n- [@HotAisle](https://x.com/HotAisle) for providing the compute resources to train all three stages on a awesome AMD MI300x setup.\n- [@mkurman88](https://x.com/mkurman88) for ideas, feedback and code samples.\n- [HuggingFaceTB team](https://huggingface.co/HuggingFaceTB) for SmolLM2-135M-Instruct model and the Smoltalk2 dataset collection.\n- [@scottgeng00](https://huggingface.co/scottgeng00) for the OLmO-3-Preference-Mix-Deltas dataset.\n- [@eliebakouchi](https://x.com/eliebakouch) for help with the tokenization.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshekswess%2Ftiny-reasoning-language-model","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshekswess%2Ftiny-reasoning-language-model","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshekswess%2Ftiny-reasoning-language-model/lists"}