{"id":51281859,"url":"https://github.com/aliivaezii/math-llm-poc","last_synced_at":"2026-06-30T02:05:56.622Z","repository":{"id":358945231,"uuid":"1243835800","full_name":"aliivaezii/math-llm-poc","owner":"aliivaezii","description":"Proof-of-concept arithmetic LLM: synthetic data, SFT training, REST API, Docker deployment","archived":false,"fork":false,"pushed_at":"2026-05-19T18:16:49.000Z","size":4640,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-19T21:21:04.898Z","etag":null,"topics":["deep-learning","docker","fastapi","python","pytorch","transformer"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aliivaezii.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-19T17:58:35.000Z","updated_at":"2026-05-19T18:16:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/aliivaezii/math-llm-poc","commit_stats":null,"previous_names":["aliivaezii/math-llm-poc"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/aliivaezii/math-llm-poc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aliivaezii%2Fmath-llm-poc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aliivaezii%2Fmath-llm-poc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aliivaezii%2Fmath-llm-poc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aliivaezii%2Fmath-llm-poc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aliivaezii","download_url":"https://codeload.github.com/aliivaezii/math-llm-poc/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aliivaezii%2Fmath-llm-poc/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34949256,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-30T02:00:05.919Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","docker","fastapi","python","pytorch","transformer"],"created_at":"2026-06-30T02:05:53.690Z","updated_at":"2026-06-30T02:05:56.613Z","avatar_url":"https://github.com/aliivaezii.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Math-LLM-PoC\n\n## 1. Project Overview\n\nA proof-of-concept pipeline for training and serving a small decoder-only\nTransformer that performs integer arithmetic from scratch. The system covers\nthe full ML lifecycle: synthetic data generation, supervised fine-tuning,\noffline evaluation, and REST inference. Built using only PyTorch, FastAPI, and the\nPython standard library. Every stage runs on CPU and is independently\nexecutable from a clean clone.\n\n### Results\n\n| Metric | Value |\n|---|---|\n| Exact-match accuracy | **98.44%** |\n| Addition accuracy | **98.96%** |\n| Subtraction accuracy | **97.92%** |\n| Hallucination rate | **0.00%** |\n| Infinite generation | **0.00%** |\n| Out-of-range predictions | **0.00%** |\n\n---\n\n## 2. Repository Structure\n\n```\nmath-llm-poc/\n├── artifacts/\n│   ├── model.pt               # Trained model state_dict (4.1 MB, 1.07M params)\n│   └── training_logs.txt      # Epoch-level loss, accuracy, and LR log (60 epochs)\n├── data/\n│   ├── train.csv              # ~40,000 training equations (stratified by answer length)\n│   ├── val.csv                # ~5,000 validation equations\n│   └── test.csv               # ~5,000 held-out test equations\n├── docs/\n│   ├── system_design.md       # Architecture, SFT/RL design, metrics, drift strategy\n│   ├── audit_report.md        # Full improvement history and compliance tables\n│   ├── accuracy_phases.png    # Bar chart: accuracy across training phases\n│   ├── accuracy_by_length.png # Bar chart: accuracy by answer digit length\n│   └── loss_curve.png         # Line chart: train/val loss over 60 epochs\n├── src/\n│   ├── generate_dataset.py    # Synthetic dataset generator (stratified sampling)\n│   ├── tokenizer.py           # Character-level tokenizer (vocab size 16)\n│   ├── model.py               # TinyDecoderLM: decoder-only Transformer definition\n│   ├── train.py               # SFT training loop with AdamW and checkpoint saving\n│   ├── evaluate.py            # Greedy decoding, exact-match accuracy, hallucination report\n│   └── api.py                 # FastAPI server: /health and /predict endpoints\n├── tools/\n│   └── plot_metrics.py        # Generates the three charts under docs/\n├── Dockerfile                 # Single-stage CPU-only inference image\n├── requirements.txt           # Pinned dependencies (CPU torch wheel)\n├── .dockerignore              # Excludes data/, logs, caches from image\n├── .gitignore                 # Python + ML defaults\n└── README.md\n```\n\n---\n\n## 3. Quick Verification\n\nThe trained checkpoint is committed. No retraining is required to verify the results.\n\n```bash\n# 1. Evaluate the committed model against the held-out test set (~30 seconds on CPU)\npython src/evaluate.py\n\n# 2. Start the REST API locally and query it\nuvicorn src.api:app --port 8000\ncurl -s -X POST http://localhost:8000/predict \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"equation\": \"951+11=\"}' | python -m json.tool\n\n# 3. Build and run the Docker image (requires Docker Desktop)\ndocker build -t math-llm . \u0026\u0026 docker run --rm -p 8000:8000 math-llm\n```\n\n---\n\n## 4. Setup (Full Reproduction)\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate          # Windows: .venv\\Scripts\\activate\npip install -r requirements.txt\n```\n\n\u003e **Note:** `requirements.txt` pulls the CPU-only PyTorch wheel (~200 MB) from\n\u003e `download.pytorch.org/whl/cpu`. Installation takes 2–3 minutes on a fresh\n\u003e environment.\n\n---\n\n## 5. Usage\n\nRun all commands from the repository root.\n\n### a. Generate dataset\n\n```bash\npython src/generate_dataset.py\n```\n\nWrites `data/train.csv`, `data/val.csv`, and `data/test.csv`. The generator\nuses stratified sampling by answer digit length so that short answers\n(1- and 2-digit) are well represented in training — a critical fix for\nsmall-number accuracy. Dataset is balanced 50/50 across `+` and `-`,\ndeterministic at seed 42.\n\n### b. Train\n\n```bash\npython src/train.py\n```\n\nTrains for up to 60 epochs on `data/train.csv`, validates each epoch against\n`data/val.csv`, and saves the best checkpoint to `artifacts/model.pt`.\nThe LR schedule uses a 5-epoch linear warmup (1e-4 → 1e-3), a flat phase\nthrough epoch 40, then cosine decay to 1e-5. Early stopping (patience=8)\nrestores the best checkpoint before writing the final artifact. Expected\nruntime: ~25–30 minutes on CPU. Example log:\n\n```\nepoch  train_loss  val_loss  val_token_acc          lr\n------------------------------------------------------\n    1      2.0156    1.8037         0.3065    1.00e-04\n    5      1.6614    1.6171         0.3696    1.00e-03\n   10      1.3515    1.2696         0.5035    1.00e-03\n   40      1.2306    1.2094         0.5239    1.00e-03\n   55      1.2114    1.2061         0.5254    1.55e-04\n   60      1.2095    1.2062         0.5253    1.00e-05\n```\n\n### c. Evaluate\n\n```bash\npython src/evaluate.py\n```\n\nRuns greedy decoding over `data/test.csv` and prints a report covering\nexact-match accuracy (overall and per-operation), hallucination rate,\ninfinite-generation rate, and out-of-range predictions.\n\nCurrent results on the 5,006-sample test set:\n\n```\nexact_match_accuracy  : 0.9844  (4928 / 5006)\n[+] exact_match       : 0.9896  (2480 / 2506)\n[-] exact_match       : 0.9792  (2448 / 2500)\nhallucinations        : 0.0000  (   0 / 5006)\ninfinite_generation   : 0.0000  (   0 / 5006)\nout_of_range          : 0.0000  (   0 / 5006)\n```\n\nAccuracy by answer digit length:\n\n| Answer length | Accuracy |\n|---|---|\n| 1-digit | 99.06% (105/106) |\n| 2-digit | 95.57% (669/700) |\n| 3-digit | 98.67% (2960/3000) |\n| 4-digit | 99.50% (1194/1200) |\n\n### Visual Summary\n\nThree charts are committed under `docs/` and generated by `tools/plot_metrics.py`.\n\n**Accuracy progression across training phases** — shows the jump from 4.22% (baseline)\nto 86.00% (T2-A) to 98.44% (final model with stratified data and LR schedule).\n\n![Accuracy by phase](docs/accuracy_phases.png)\n\n**Accuracy by answer digit length** — breaks down the final model's exact-match rate\nfor 1-, 2-, 3-, and 4-digit answers, demonstrating that the stratified dataset\neliminated the previous near-zero performance on short answers.\n\n![Accuracy by answer length](docs/accuracy_by_length.png)\n\n**Train vs val loss — 60-epoch run** — full loss curve with LR phase shading\n(warmup / flat / cosine decay) and the best-checkpoint marker at epoch 55.\n\n![Train vs val loss](docs/loss_curve.png)\n\n### d. Run API locally\n\n```bash\nuvicorn src.api:app --host 0.0.0.0 --port 8000\n```\n\nThe server loads `artifacts/model.pt` at startup. Visit\n`http://localhost:8000/docs` for the auto-generated OpenAPI UI.\n\n---\n\n## 6. Docker\n\n### Build\n\n```bash\ndocker build -t math-llm .\n```\n\n\u003e Requires `artifacts/model.pt` to exist (run `train.py` first).\n\n### Run\n\n```bash\ndocker run --rm -p 8000:8000 math-llm\n```\n\n---\n\n## 7. API Reference\n\n### `GET /health`\n\nReturns server and model status.\n\n```bash\ncurl http://localhost:8000/health\n```\n\n```json\n{\n  \"status\": \"ok\",\n  \"model_loaded\": true\n}\n```\n\n### `POST /predict`\n\nAccepts an arithmetic equation (must end with `=`) and returns the model's\npredicted answer.\n\n**Request body**\n\n| Field    | Type   | Constraint                        |\n|----------|--------|-----------------------------------|\n| equation | string | Matches `^\\d{1,3}[+-]\\d{1,3}=$`  |\n\n```bash\ncurl -X POST http://localhost:8000/predict \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"equation\": \"951+11=\"}'\n```\n\n**Response**\n\n```json\n{\n  \"equation\": \"951+11=\",\n  \"predicted_answer\": \"962\",\n  \"full_output\": \"951+11=962\",\n  \"latency_ms\": 36.82\n}\n```\n\n**Error responses**\n\n| Status | Cause |\n|--------|-------|\n| 422    | Equation fails regex validation (wrong format, 4-digit operand, missing `=`) |\n| 503    | Model checkpoint failed to load at startup |\n\n---\n\n## 8. Engineering Assumptions\n\n- **Operand range is fixed at [0, 999].** The model is not expected to\n  generalise to larger numbers; out-of-range inputs are rejected at the API\n  boundary by the Pydantic validator.\n- **Subtraction results are non-negative.** The dataset enforces `op1 ≥ op2`,\n  keeping the answer vocabulary purely numeric and avoiding a sign token.\n- **Character-level tokenisation is sufficient.** With a vocabulary of 16\n  tokens (digits, operators, `=`, three specials), a subword tokeniser adds\n  complexity without benefit for this fixed domain.\n- **CPU-only training and inference.** The model (1.07M parameters) trains in\n  ~25–30 minutes on a modern laptop CPU and serves requests in ~35 ms.\n- **The model checkpoint is committed to the repository.** This allows\n  `docker build` to run without a training step and keeps the artefact\n  co-located with the code for assessment purposes. In production, the\n  checkpoint would live in a model registry.\n- **No authentication or rate limiting on the API.** The server is intended\n  for local and container-internal use. Adding an API key header or a reverse\n  proxy is the first hardening step before any public exposure.\n- **Greedy decoding is the only inference strategy.** Beam search or\n  temperature sampling would add complexity for no measurable gain on a\n  deterministic arithmetic task.\n- **`max_new_tokens=8` is sufficient for all valid answers.** The longest\n  possible answer is `1998` (4 digits + `\u003cEOS\u003e` = 5 tokens), leaving 3 tokens\n  of headroom.\n\n---\n\n## 9. Known Limitations\n\n- **No generalisation beyond training range.** Operands outside [0, 999] are\n  rejected at the API level; operands inside the range but rare in training\n  (e.g., boundary values) may produce incorrect answers.\n- **Carry/borrow cascade edge cases.** The model still fails on certain\n  multi-step carry/borrow patterns such as `99+1=100` and `100-99=1`. These\n  require consecutive digit-position interactions that are rare in the training\n  distribution; targeted oversampling of these patterns or increased model\n  capacity would reduce the remaining gap.\n- **No RLHF or outcome-supervised training.** The model is SFT-only. Errors\n  are not penalised beyond next-token loss, so the model can generate plausible\n  but wrong sequences with high confidence.\n- **Sequential inference.** The API handles one request at a time on a single\n  CPU thread. Concurrent requests queue behind the GIL and the synchronous\n  FastAPI route handler.\n- **No model versioning.** `artifacts/model.pt` is overwritten on each\n  training run. Retraining without branching destroys the previous checkpoint.\n\n---\n\n## 10. Future Improvements\n\n- **Hard-case oversampling.** Explicitly oversample carry/borrow cascade\n  patterns (e.g., `X9+1`, `X00-X99`) in the training dataset to close the\n  remaining accuracy gap on those edge cases.\n- **GRPO/PPO fine-tuning.** Add an RL stage after SFT using a rule-based\n  verifier (exact-match reward) to optimise for answer correctness rather than\n  token likelihood. See `docs/system_design.md` § 5 for the plug-in design.\n- **Extend to multiplication and division.** Requires adding `*` and `/`\n  tokens to the vocabulary and regenerating the dataset; the model architecture\n  and training loop are unchanged.\n- **Beam search decoding.** A width-2 or width-4 beam would improve accuracy\n  on remaining carry positions at negligible latency cost at this model size.\n- **Async FastAPI routes + request batching.** Replace the synchronous route\n  handler with `async def` and a batching queue to serve concurrent requests\n  efficiently.\n- **Model registry integration.** Replace the committed checkpoint with a\n  versioned artefact store (MLflow, W\u0026B, or S3 + DVC) so training runs are\n  tracked and rollback is possible.\n- **Structured JSON logging.** Emit per-request logs as JSON to stdout so a\n  log aggregator can compute rolling latency percentiles, hallucination rates,\n  and answer-distribution histograms without code changes.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faliivaezii%2Fmath-llm-poc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faliivaezii%2Fmath-llm-poc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faliivaezii%2Fmath-llm-poc/lists"}