{"id":34776906,"url":"https://github.com/cloudwithax/autolearn-llm","last_synced_at":"2026-05-21T12:33:37.953Z","repository":{"id":326300005,"uuid":"1104973350","full_name":"cloudwithax/autolearn-llm","owner":"cloudwithax","description":null,"archived":false,"fork":false,"pushed_at":"2025-11-27T01:30:36.000Z","size":57,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-29T18:51:39.979Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cloudwithax.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-27T00:40:23.000Z","updated_at":"2025-11-27T01:30:40.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/cloudwithax/autolearn-llm","commit_stats":null,"previous_names":["cloudwithax/autolearn-llm"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/cloudwithax/autolearn-llm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudwithax%2Fautolearn-llm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudwithax%2Fautolearn-llm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudwithax%2Fautolearn-llm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudwithax%2Fautolearn-llm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cloudwithax","download_url":"https://codeload.github.com/cloudwithax/autolearn-llm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudwithax%2Fautolearn-llm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33300849,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-21T12:23:38.849Z","status":"ssl_error","status_checked_at":"2026-05-21T12:22:11.673Z","response_time":62,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-25T08:35:06.372Z","updated_at":"2026-05-21T12:33:37.947Z","avatar_url":"https://github.com/cloudwithax.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AutoLearn-LLM: FP8 GRPO Training\n\nTrain LLMs with Group Relative Policy Optimization (GRPO) using FP8 precision on consumer GPUs.\n\n**Now with execution-based code rewards!** Train models that actually pass tests, not just pattern-match.\n\n## Why FP8 GRPO?\n\n| Feature | Benefit |\n|---------|---------|\n| **60% less VRAM** | Train larger models on consumer GPUs |\n| **1.4x faster inference** | vLLM FP8 kernels via TorchAO |\n| **96% inference** | Training overhead is only 4% |\n| **Memory sharing** | vLLM and training share weight buffers |\n\n### VRAM Requirements (Approximate)\n\n| Model | BF16 | FP8 |\n|-------|------|-----|\n| Qwen3-1.7B | 8GB | **5GB** |\n| Llama-3.2-3B | 12GB | **8GB** |\n| Qwen3-8B | 24GB | **16GB** |\n| Qwen3-14B | 40GB | **24GB** |\n\n## Quick Start\n\n### 1. Install Dependencies\n\n```bash\n# Create virtual environment\npython -m venv .venv\n.venv\\Scripts\\activate  # Windows\n# source .venv/bin/activate  # Linux/Mac\n\n# Install base packages\npip install unsloth vllm trl transformers datasets pyyaml\n\n# Install FP8 support (CUDA 12.8)\npip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128 --force-reinstall\npip install --pre fbgemm-gpu fbgemm-gpu-genai --index-url https://download.pytorch.org/whl/cu128 --force-reinstall\npip install --upgrade numba numpy\n```\n\n### 2. Configure Training\n\nEdit `config.yaml`:\n\n```yaml\nmodel:\n  name: \"unsloth/Qwen3-1.7B\"  # Change based on your VRAM\n  load_in_fp8: true\n\ntraining:\n  num_generations: 4\n  learning_rate: 5.0e-6\n  \ndataset:\n  name: \"openai/gsm8k\"\n  max_samples: 1000\n```\n\n### 3. Train\n\n```bash\npython train_fp8_grpo.py --config config.yaml\n```\n\nOr with command-line overrides:\n\n```bash\npython train_fp8_grpo.py \\\n    --model unsloth/Qwen3-4B \\\n    --dataset openai/gsm8k \\\n    --max_samples 500\n```\n\n### 4. Inference\n\n```bash\n# Interactive mode\npython inference.py --model ./outputs/merged --mode interactive\n\n# Benchmark mode  \npython inference.py --model ./outputs/merged --mode benchmark\n```\n\n## How GRPO Works\n\nGRPO (Group Relative Policy Optimization) is DeepSeek's RL algorithm:\n\n1. **Generate** multiple candidate completions per prompt\n2. **Score** each completion with reward functions\n3. **Rank** completions within each group (relative rewards)\n4. **Update** policy to favor higher-ranked completions\n\n```\nPrompt: \"What is 2 + 2?\"\n  ├── Completion A: \"4\" → reward: 1.0\n  ├── Completion B: \"The answer is 4\" → reward: 0.9  \n  ├── Completion C: \"22\" → reward: 0.0\n  └── Completion D: \"2+2=4\" → reward: 0.8\n\nPolicy update: Increase P(A), P(B), P(D); Decrease P(C)\n```\n\n## Reward Functions\n\nBuilt-in reward functions in `rewards.py`:\n\n| Function | Description |\n|----------|-------------|\n| `correctness` | Checks if extracted answer matches ground truth |\n| `format` | Rewards step-by-step reasoning structure |\n| `reasoning` | Rewards appropriate length (not too short/long) |\n| `xml_format` | Rewards DeepSeek-R1 style `\u003cthink\u003e` tags |\n| `combined` | Weighted combination of above |\n\n### Custom Rewards\n\n```python\ndef my_reward(prompts, completions, **kwargs):\n    rewards = []\n    for completion in completions:\n        # Your scoring logic\n        score = 1.0 if \"correct\" in completion else 0.0\n        rewards.append(score)\n    return rewards\n```\n\n## Project Structure\n\n```\nautolearn-llm/\n├── train_fp8_grpo.py   # Math/reasoning GRPO training\n├── train_code_grpo.py  # Code generation GRPO training\n├── rewards.py          # Math reward functions\n├── code_rewards.py     # Execution-based code rewards\n├── eval_code.py        # Benchmark evaluation (HumanEval, MBPP)\n├── inference.py        # Model inference/testing\n├── config.yaml         # Math training config\n├── code_config.yaml    # Code training config\n├── requirements.txt    # Python dependencies\n├── examples/\n│   └── custom_tests.json  # Example test cases\n└── README.md\n```\n\n## Code Training (NEW!)\n\n### Execution-Based Rewards\n\nUnlike pattern-matching rewards, these use **actual code execution**:\n\n| Reward | Signal | Weight |\n|--------|--------|--------|\n| `test_pass` | Unit tests pass | 0.50 |\n| `execution` | Code runs without error | 0.20 |\n| `syntax` | Valid Python AST | 0.15 |\n| `lint` | Ruff linter score | 0.10 |\n| `complexity` | Low cyclomatic complexity | 0.05 |\n| `type_safety` | Mypy compliance | (optional) |\n| `performance` | Execution time delta | (optional) |\n\n### Train on Code\n\n```bash\n# HumanEval\npython train_code_grpo.py --model unsloth/Qwen3-1.7B --dataset humaneval\n\n# MBPP\npython train_code_grpo.py --model unsloth/Qwen3-4B --dataset mbpp\n\n# With custom config\npython train_code_grpo.py --config code_config.yaml\n```\n\n### Evaluate\n\n```bash\n# HumanEval pass@1\npython eval_code.py --model ./outputs/code/merged --benchmark humaneval\n\n# pass@10 (sample 10 solutions)\npython eval_code.py --model ./outputs/code/merged --benchmark humaneval --n_samples 10\n\n# Custom test suite\npython eval_code.py --model ./outputs/code/merged --benchmark custom --test_file examples/custom_tests.json\n```\n\n## SWE-Bench \u0026 Terminal-Bench Training\n\n### Real-World Benchmarks\n\n| Benchmark | Task | Size | Reward |\n|-----------|------|------|--------|\n| **SWE-Bench Lite** | Fix GitHub issues | 300 | Patch similarity + format |\n| **SWE-Bench Verified** | Fix GitHub issues | 500 | Patch similarity + format |\n| **Terminal-Bench** | Terminal tasks | 100+ | Command execution + safety |\n\n### Train on SWE-Bench\n\n```bash\n# SWE-Bench Lite (easier, 300 samples)\npython train_bench_grpo.py --benchmark swe-lite --max_samples 100\n\n# SWE-Bench Verified (harder, human-verified)\npython train_bench_grpo.py --benchmark swe-verified --max_samples 100\n```\n\n### Train on Terminal-Bench\n\n```bash\npython train_bench_grpo.py --benchmark terminal --max_samples 25\n```\n\n### Reward Signals\n\n**SWE-Bench rewards:**\n- `format` (0.3) — Valid unified diff patch\n- `similarity` (0.5) — Similar to gold patch\n- `files` (0.2) — Targets correct files\n\n**Terminal-Bench rewards:**\n- `format` (0.2) — Extractable commands\n- `safety` (0.3) — No dangerous patterns (rm -rf /, etc.)\n- `execution` (0.5) — Commands run successfully\n\n## Tips\n\n1. **Start small**: Test with `max_samples: 100` first\n2. **Monitor rewards**: Watch for reward hacking\n3. **Adjust weights**: Tune reward weights in `config.yaml`\n4. **Use Wandb**: Set `report_to: \"wandb\"` for logging\n\n## References\n\n- [Unsloth FP8 RL Docs](https://docs.unsloth.ai/new/fp8-reinforcement-learning)\n- [DeepSeek R1 Paper](https://arxiv.org/abs/2501.12948)\n- [TRL GRPO Trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)\n- [TorchAO FP8](https://github.com/pytorch/ao/blob/main/torchao/float8/README.md)\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcloudwithax%2Fautolearn-llm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcloudwithax%2Fautolearn-llm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcloudwithax%2Fautolearn-llm/lists"}