{"id":49758671,"url":"https://github.com/pumacp/puma","last_synced_at":"2026-05-16T03:28:43.524Z","repository":{"id":349899325,"uuid":"1204479057","full_name":"pumacp/puma","owner":"pumacp","description":"Reproducible local LLM benchmarking framework for Project Management Office (PMO) tasks","archived":false,"fork":false,"pushed_at":"2026-05-11T00:11:53.000Z","size":2581,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-11T01:36:13.163Z","etag":null,"topics":["benchmarking","llm-evaluation","local-llm","machine-learning","ollama","pmo","project-management","python","reproducible-research","streamlit","sustainability"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pumacp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-08T03:36:46.000Z","updated_at":"2026-05-10T23:57:45.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/pumacp/puma","commit_stats":null,"previous_names":["pumacp/puma"],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/pumacp/puma","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pumacp%2Fpuma","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pumacp%2Fpuma/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pumacp%2Fpuma/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pumacp%2Fpuma/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pumacp","download_url":"https://codeload.github.com/pumacp/puma/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pumacp%2Fpuma/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32965875,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-12T23:30:32.555Z","status":"online","status_checked_at":"2026-05-13T02:00:07.132Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarking","llm-evaluation","local-llm","machine-learning","ollama","pmo","project-management","python","reproducible-research","streamlit","sustainability"],"created_at":"2026-05-11T01:28:43.050Z","updated_at":"2026-05-16T03:28:43.517Z","avatar_url":"https://github.com/pumacp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PUMA — Project Understanding and Management with Agents\n\n![PUMA Logo](https://raw.githubusercontent.com/pumacp/puma/main/assets/img/PUMA.png)\n\n\u003e **PUMA — Project Understanding and Management with Agents**\n\u003e\n\u003e *Can language models manage ICT projects? An empirical benchmark of local LLM agents for issue triage and effort estimation in ICT projects.*\n\nPUMA is a research-driven platform that benchmarks autonomous AI agents on practical project management tasks. This repository contains the **evaluation platform**: orchestrator, scenarios, metrics, sustainability tracking, dashboard, and reproducible specifications. All inference runs locally via [Ollama](https://ollama.ai) — no external API calls, no data leaves your machine.\n\n![Tests](https://img.shields.io/badge/tests-402%20passing-brightgreen)\n![Python](https://img.shields.io/badge/python-3.11%2B-blue)\n![License](https://img.shields.io/badge/license-MIT-lightgrey)\n![Docker](https://img.shields.io/badge/runs%20on-Docker-2496ED)\n![Version](https://img.shields.io/badge/version-v2.6.0-blue)\n\n## Related resources\n\n- 📚 [PUMA Research Vault](https://github.com/pumacp/puma-vault) — Unified knowledge management for the PUMA project\n- 🌐 [Vault Published](https://pumacp.github.io/puma-vault/) — Published knowledge garden\n- 📦 [Releases](https://github.com/pumacp/puma/releases) — All published versions\n- 📋 [Project Index](INDEX.md) — Phase status, releases, debt tracking\n- 🏗️ [Project Overview](docs/overview.md) — Architecture, scenarios, model catalog\n\n## Status\n\n| Item | Value |\n|------|-------|\n| Current release | [v2.6.0](https://github.com/pumacp/puma/releases/tag/v2.6.0) |\n| Tests | 402 passing |\n| Coverage | 58% |\n| CI | ✓ green on main + develop |\n| Phases A, B, C, D, E | ✓ All complete |\n| Debt | 15 resolved; 7 open (0 critical, 5 medium, 2 low) |\n\n\u003e **PUMA is an independent benchmarking framework, fully self-contained, designed specifically for evaluating local LLMs on Project Management Office (PMO) tasks. All evaluation methodology, scenarios, and metrics are developed and maintained independently as part of this work.**\n\n---\n\n## Requirements\n\n| Requirement | Minimum |\n|-------------|---------|\n| Docker Engine | 24+ |\n| Docker Compose | v2 |\n| RAM | 8 GB (16 GB recommended) |\n| Disk | 10 GB free (for models + data) |\n| GPU | Optional — NVIDIA (validated). Apple Silicon M3/M4/M5 detected in v2.6.0 (native mode via `./start_puma.sh --native`; empirical validation pending). AMD ROCm not yet detected. See [`docs/MACOS_NOTES.md`](docs/MACOS_NOTES.md) and [`docs/CROSS_ARCH_REPRODUCIBILITY.md`](docs/CROSS_ARCH_REPRODUCIBILITY.md). |\n\n\u003e No Python installation needed on the host. Everything runs inside Docker.\n\n---\n\n## Quickstart\n\n```bash\n# Clone and provision (downloads models, datasets, applies DB schema)\ngit clone \u003crepo-url\u003e \u0026\u0026 cd puma\n./start_puma.sh\n\n# Enable the project's commit-msg hook (strips Co-Authored-By trailers)\ngit config core.hooksPath .githooks\n\n# Run a benchmark (dry-run — no Ollama needed)\ndocker compose run --rm puma_runner puma run specs/runs/smoke_triage.yaml --dry-run\n\n# Run a live benchmark (requires Ollama + model)\ndocker compose run --rm puma_runner puma run specs/runs/smoke_triage.yaml\n\n# Open the dashboard\nopen http://localhost:8501\n```\n\nFor host-only pre-commit setup and the cross-container development\nworkflow, see [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md). For the\nreference hardware specification (RTX 2060 Mobile, 6 GB VRAM) and\nreproducibility scope, see [docs/HARDWARE.md](docs/HARDWARE.md).\n\n---\n\n## CLI Reference\n\nAll commands run inside the `puma_runner` container:\n\n```bash\ndocker compose run --rm puma_runner puma \u003ccommand\u003e\n```\n\nOr use the shorthand after `./start_puma.sh`:\n\n```bash\nalias puma='docker compose run --rm puma_runner puma'\n```\n\n### `puma preflight`\n\nDetect hardware, select an execution profile, and check provisioning readiness.\n\n```bash\npuma preflight\npuma preflight --profile cpu-standard     # override auto-detection\npuma preflight --no-write-config          # skip writing config/runtime_profile.yaml\n```\n\nProfiles: `cpu-lite`, `cpu-standard`, `gpu-entry`, `gpu-mid`, `gpu-high`, `auto`.\n\n---\n\n### `puma models`\n\nList or pull models from the catalog.\n\n```bash\npuma models list                   # show all catalog models with size and compatible profiles\npuma models pull qwen2.5:3b        # pull a specific model via Ollama\n```\n\n---\n\n### `puma datasets`\n\nVerify dataset integrity and show statistics.\n\n```bash\npuma datasets verify               # check checksums and row counts for all datasets\n```\n\n---\n\n### `puma run`\n\nExecute a benchmark defined by a run-spec YAML.\n\n```bash\npuma run specs/runs/smoke_triage.yaml\npuma run specs/runs/smoke_triage.yaml --dry-run          # skip Ollama, test pipeline\npuma run specs/runs/smoke_triage.yaml --ollama-host http://puma_ollama:11434\npuma run specs/runs/smoke_triage.yaml --db data/puma.db\n```\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--dry-run` | false | Build prompts and persist results without calling Ollama |\n| `--ollama-host` | `http://localhost:11434` | Ollama API base URL (env: `OLLAMA_HOST`) |\n| `--db` | `data/puma.db` | SQLite database path |\n\n---\n\n### `puma compare`\n\nCompare metrics across two or more runs.\n\n```bash\npuma compare run_id_1 run_id_2\npuma compare run_id_1 run_id_2 --output comparison.json\npuma compare run_id_1 run_id_2 run_id_3\n```\n\nOutputs a Markdown table and, for exactly two runs, shows `run2 − run1` differences per metric.\n\n---\n\n### `puma report`\n\nGenerate a Markdown report for a completed run.\n\n```bash\npuma report \u003crun_id\u003e\npuma report \u003crun_id\u003e --format pdf          # convert via Pandoc (if installed)\npuma report \u003crun_id\u003e --db data/puma.db\n```\n\nThe report is written to `results/\u003crun_id\u003e/report.md` and includes: executive summary, metrics table, per-model breakdown, perturbation analysis, sustainability section, and latency percentiles.\n\n---\n\n### `puma dashboard`\n\nLaunch the interactive Streamlit dashboard.\n\n```bash\npuma dashboard\npuma dashboard --port 8502\n```\n\nThe dashboard is also available as a persistent Docker service:\n\n```bash\ndocker compose up -d puma_dashboard\nopen http://localhost:8501\n```\n\n---\n\n### `puma db`\n\nManage the SQLite database schema.\n\n```bash\npuma db migrate               # create or update tables\npuma db status                # show database file size\n```\n\n---\n\n### `puma cache`\n\nManage the inference response cache.\n\n```bash\npuma cache stats              # show entry count and cache size\npuma cache clear              # delete all cached responses\n```\n\n---\n\n### `puma validate-baseline`\n\nReproducibility guard for CI: runs the canonical baseline spec and\nexits 0 only if F1-macro is within tolerance of the expected value.\n\n```bash\npuma validate-baseline                                        # uses defaults\npuma validate-baseline --expected-f1 0.5867 --tolerance 0.01\npuma validate-baseline --spec specs/runs/baseline_triage.yaml\n```\n\nDefaults: `--spec specs/runs/baseline_triage.yaml`,\n`--expected-f1 0.5867`, `--tolerance 0.01`. Exit code is non-zero on\ndrift; useful as a release gate before tagging.\n\n---\n\n## Run-Spec Format\n\nEvery benchmark is fully described by a YAML run-spec. Example:\n\n```yaml\nid: my_benchmark_v1\ndescription: \"Triage with few-shot and typo perturbations\"\nscenario: triage_jira           # triage_jira | estimation_tawos | prioritization_jira\nsample_size: 50\nmodels:\n  - qwen2.5:3b\n  - qwen2.5:1.5b\nadaptation:\n  strategy:\n    - zero-shot\n    - few-shot-3\ninference:\n  temperature: 0.0\n  seed: 42\n  max_tokens: 256\n  logprobs: false\nperturbations:\n  - typos_5pct                  # also: case_upper, case_lower, truncate_50pct, tech_noise\nmetrics:\n  - f1_macro\nsustainability:\n  codecarbon: false\nrepeat: 1\n```\n\nRun it:\n\n```bash\npuma run my_benchmark_v1.yaml --dry-run    # validate pipeline\npuma run my_benchmark_v1.yaml              # live run\n```\n\n---\n\n## Scenarios\n\n| Scenario | Task | Dataset | Primary Metric |\n|----------|------|---------|----------------|\n| `triage_jira` | Assign priority (Critical/Major/Minor/Trivial) to a Jira issue | Jira balanced (200 issues) | F1 macro |\n| `estimation_tawos` | Estimate story points (Fibonacci) for a user story | TAWOS (9 020 items) | MAE |\n| `prioritization_jira` | Given two issues A/B, which has higher priority? | Jira pairwise | Accuracy |\n\n---\n\n## Models\n\nThe catalog (`config/models_catalog.yaml`) is the single source of\ntruth for `(model → hardware profile)` compatibility. `puma models\nlist` prints the live catalog; the table below summarises the v2.1.0\nstate.\n\n| Model | Params (B) | GGUF (GB) | Compatible profiles | Notes |\n|-------|-----------:|----------:|---------------------|-------|\n| `qwen2.5:0.5b` | 0.5 | 0.4 | cpu-lite, cpu-standard, gpu-entry, gpu-mid, gpu-high | |\n| `qwen2.5:1.5b` | 1.5 | 1.0 | cpu-standard, gpu-entry, gpu-mid, gpu-high | |\n| `qwen2.5:3b` | 3.0 | 1.9 | gpu-entry, gpu-mid, gpu-high | **canonical baseline model** |\n| `qwen2.5:7b` | 7.0 | 4.7 | cpu-lite, cpu-standard, gpu-entry, gpu-mid, gpu-high | |\n| `qwen2.5:14b` | 14.0 | 9.0 | gpu-mid, gpu-high | |\n| `gemma3:1b` | 1.0 | 0.8 | cpu-standard, gpu-entry, gpu-mid, gpu-high | |\n| `gemma3:4b` | 4.0 | 3.3 | gpu-entry, gpu-mid, gpu-high | |\n| `gemma3:12b` | 12.0 | 8.1 | gpu-high | `timeout_s=1800` (B.3 evidence) |\n| `gemma4:e2b` | 2.0 effective / 7.2 full | 7.2 | cpu-standard, gpu-mid, gpu-high | MoE; **excluded from gpu-entry (D18)**: Ollama detokenizer breaks under CPU offload on structured prompts |\n| `gemma4:e4b` | 4.0 effective | 4.0 (est.) | gpu-mid, gpu-high | MoE; gpu-entry exclusion by analogy with e2b |\n| `gemma4:26b-a4b` | 26.0 total / 4.0 active | 16.0 (est.) | gpu-high | MoE |\n| `llama3.1:8b` | 8.0 | 4.9 | gpu-entry, gpu-mid, gpu-high | |\n| `mistral:7b` | 7.0 | 4.4 | gpu-entry, gpu-mid, gpu-high | |\n| `deepseek-r1:7b` | 7.0 | 4.7 | gpu-entry, gpu-mid, gpu-high | reasoning model (`\u003cthink\u003e` blocks); `timeout_s=300` |\n| `deepseek-r1:14b` | 14.0 | 9.0 | gpu-mid, gpu-high | reasoning model |\n\nProfile detection runs automatically via `puma preflight`. Override\nwith `--profile \u003cname\u003e`.\n\n---\n\n## Baseline \u0026 Results\n\n**Canonical baseline:** `qwen2.5:3b` + `contextual-anchoring` + `seed=42`\n+ `temperature=0.0` on `triage_jira` with 200 instances yields\n**F1-macro = 0.5867 ± 0.01** on the reference hardware. Reproducible\nend-to-end via `puma validate-baseline`. The spec lives at\n`specs/runs/baseline_triage.yaml`; the tolerance is verified in CI.\n\n**Multi-model evaluation:** 9 models compatible with `gpu-entry`\nevaluated across the 3 PMO scenarios at N=100 instances per cell\n(2,700 inferences; ~67.5 Wh / 11.75 g CO₂ total compute budget). Best\nperformers vary by task; small models (1B–3B) are competitive with\nlarger ones on several PMO scenarios. Full comparative analysis,\nper-scenario tables, sustainability efficiency, and the\n\"60.5 % wasted compute\" finding (gemma4 family on `gpu-entry`,\nresolved as debt D18) in\n[docs/results/phase_b_analysis.md](docs/results/phase_b_analysis.md).\nPlots are reproducible via `scripts/generate_phase_b_plots.py`.\n\n**Statistical analysis (v2.2.0):** calibration via Expected Calibration\nError (Guo et al. 2017) — the canonical baseline shows ECE=0.39 against\nits 200 logprob-enabled predictions, surfacing significant\nmiscalibration typical of out-of-the-box LLMs. Pairwise model\ncomparison via the Wilcoxon signed-rank test (Demšar 2006); see\n[docs/results/wilcoxon_demo.md](docs/results/wilcoxon_demo.md). Three\nseeds {42, 123, 456} confirm bit-exact reproducibility under\ntemperature=0.0\n([docs/results/multi_seed_baseline.md](docs/results/multi_seed_baseline.md)).\n\n**Bias evaluation (v2.2.0):** the `triage_jira` corpus contains 0 %\ngendered terms, so a textbook gender_swap would be a no-op. Sprint 5\nadopts the **signal-injection** methodology of Caliskan et al. (2017)\nand Bolukbasi et al. (2016): identity prefixes (`John Smith reported:`\nvs `Mary Smith reported:`) are prepended to instances and the\nprediction flips counted. qwen2.5:3b exhibits ~3× less directional\ngender bias than qwen2.5:1.5b at the same prediction-flip rate; both\nmodels are robust to register variation (`register_shift_informal`).\nFull report in\n[docs/results/bias_evaluation.md](docs/results/bias_evaluation.md).\n\n---\n\n## Prompting Strategies\n\n| Strategy | Key | Description |\n|----------|-----|-------------|\n| Zero-shot | `zero-shot` | Direct question, no examples |\n| Zero-shot CoT | `zero-shot-cot` | Ask model to think step-by-step |\n| One-shot | `one-shot` | Single example |\n| Few-shot (k=3) | `few-shot-3` | Three stratified examples |\n| Few-shot (k=5) | `few-shot-5` | Five stratified examples |\n| Few-shot (k=8) | `few-shot-8` | Eight stratified examples |\n| Chain-of-Thought few-shot | `cot-few-shot` | Few-shot with CoT rationales |\n| RCOIF | `rcoif` | Role + Context + Output + Instruction + Format |\n| Contextual Anchoring | `contextual-anchoring` | Grounds prediction to project context |\n| Self-Consistency | `self-consistency` | Multiple samples + majority vote (requires temperature \u003e 0) |\n| EGI | `egi` | Example-Guided Inference |\n\n---\n\n## Dashboard Views\n\nOpen `http://localhost:8501` after `docker compose up -d puma_dashboard`.\nPUMA-themed sidebar with logo and a runtime dark-mode toggle.\n\n| View | Description |\n|------|-------------|\n| **Overview** | Cohort cards (total runs, total CO₂, kWh, avg ECE, avg F1, avg latency.p95) plus per-run expanders with model / F1 / ECE / CO₂ / parse-fail. Sidebar filters applied. |\n| **Model Comparison** | Mean±std aggregation over seeds, run × metric heatmap, and inline Wilcoxon signed-rank results when `docs/results/wilcoxon_demo.md` is present. |\n| **Reliability** | Per-model ECE and reliability diagram computed from real logprob-derived confidences (Guo et al. 2017). Falls back to a message when no logprob runs exist. |\n| **Robustness** | Disparity / flip-rate table and bar chart for every perturbation in the cohort. |\n| **Fairness** | Gender-prefix injection bias (Caliskan et al. 2017) — disparity vs baseline plus directional male-vs-female comparison. |\n| **Sustainability Frontier** | Pareto scatter: F1 vs CO₂ (g) using the `emissions` table populated by CodeCarbon (Sprint 2 D15). |\n| **Instance Drill-down** | Per-prediction inspection: gold/parsed labels, outcome filter (correct / incorrect / parse failure), confidence, top-K logprobs, raw response, prompt hash. |\n\n---\n\n## Development\n\n```bash\nmake build    # build the puma_runner Docker image\nmake lint     # ruff check + format check (src/puma/ and tests/)\nmake test     # run unit + integration tests inside Docker\nmake smoke    # smoke tests (AppTest, no Ollama required)\n```\n\nAll dev tooling runs inside the container. See [CONTRIBUTING.md](CONTRIBUTING.md).\n\n---\n\n## Roadmap\n\nTrabajo identificado para releases posteriores a v2.3.0:\n\n- **Cross-strategy comparison at scale** — pairwise comparisons\n  across `zero-shot`, `few-shot-{3,5,8}`, and CoT variants on the\n  same model cohort, with the new Wilcoxon driver applied.\n- **Hardware tier extension** — re-run the B.3 sweep on `gpu-mid` to\n  empirically validate `gemma3:12b`, the `gemma4` family, and 14B\n  reasoning models with adequate VRAM (resolves D16 verification).\n- **TAWOS SHA-256 end-to-end** — checksum-verified fetch path for the\n  upstream TAWOS dump (D14) and the bash regeneration test (Gate D\n  criterion 3).\n- **Multi-backend GPU detection** — AMD ROCm and Apple Metal\n  detection alongside the existing NVIDIA path in\n  `puma.preflight.detect`.\n- **`triage_jira` ticket-text persistence (D22)** — populate\n  `instances.input_text` so the Dashboard Instance Drill-down can\n  show the original ticket description rather than the empty-text\n  placeholder.\n\nClosed in v2.3.0 (no longer in roadmap):\n\n- ✓ Phase C polish (Sprint 6 → `app.py` 803→168 LOC refactor +\n  10 polish improvements + guided tour, all five Gate-C criteria met)\n\nClosed in v2.2.0:\n\n- ✓ ECE / calibration metrics completion (Sprint 3 → Reliability view)\n- ✓ Multi-seed validation (Sprint 3 → bit-exact under T=0.0)\n- ✓ Bias perturbation suite (Sprint 5 → `gender_swap_prefix`,\n  `register_shift`, `fairness.perturbation_disparity`)\n- ✓ Phase C — Dashboard core (Sprint 4 → 5 functional views)\n\nDetailed debt inventory in\n[docs/known_debt.md](docs/known_debt.md).\n\n---\n\n## Documentation\n\n| Document | Contents |\n|----------|---------|\n| [docs/architecture.md](docs/architecture.md) | Data flow, package map, Docker services, design decisions |\n| [docs/user_guide.md](docs/user_guide.md) | Step-by-step guide: provision → run → compare → report → dashboard |\n| [docs/metrics_reference.md](docs/metrics_reference.md) | Formula for every metric (classification, regression, calibration, efficiency) |\n| [docs/scenarios_reference.md](docs/scenarios_reference.md) | Scenario specs, parse logic, gold label definition |\n| [docs/adding_models.md](docs/adding_models.md) | How to add a model to the catalog |\n| [docs/adding_scenarios.md](docs/adding_scenarios.md) | How to implement a new benchmark scenario |\n| [docs/troubleshooting.md](docs/troubleshooting.md) | Common problems and fixes |\n| [docs/HARDWARE.md](docs/HARDWARE.md) | Reference hardware spec, profile detection, thermal/VRAM observations, CodeCarbon accuracy, reproducibility scope |\n| [docs/results/phase_b_analysis.md](docs/results/phase_b_analysis.md) | Comparative analysis of the 9 models × 3 PMO scenarios sweep |\n| [docs/results/multi_seed_baseline.md](docs/results/multi_seed_baseline.md) | Bit-exact reproducibility across seeds {42, 123, 456} (Sprint 3) |\n| [docs/results/wilcoxon_demo.md](docs/results/wilcoxon_demo.md) | Wilcoxon signed-rank pairwise comparison empirical demo (Sprint 3) |\n| [docs/results/bias_evaluation.md](docs/results/bias_evaluation.md) | Bias evaluation empirical findings (Sprint 5) |\n| [docs/known_debt.md](docs/known_debt.md) | Open and resolved technical debt with diagnostic write-ups |\n| [docs/RELEASES/v2.6.0.md](docs/RELEASES/v2.6.0.md) | v2.6.0 release notes |\n| [docs/RELEASES/v2.5.0.md](docs/RELEASES/v2.5.0.md) | v2.5.0 release notes |\n| [docs/RELEASES/v2.4.0.md](docs/RELEASES/v2.4.0.md) | v2.4.0 release notes |\n| [docs/RELEASES/v2.3.0.md](docs/RELEASES/v2.3.0.md) | v2.3.0 release notes |\n| [docs/RELEASES/v2.2.0.md](docs/RELEASES/v2.2.0.md) | v2.2.0 release notes |\n| [docs/RELEASES/v2.1.0.md](docs/RELEASES/v2.1.0.md) | v2.1.0 release notes |\n| [docs/anexo_F_cli_reference.md](docs/anexo_F_cli_reference.md) | Anexo F: CLI command catalog (source of truth) |\n| [docs/MACOS_NOTES.md](docs/MACOS_NOTES.md) | macOS operational modes — Docker (CPU) vs native Ollama (Metal); v2.6.0 native mode |\n| [docs/CROSS_ARCH_REPRODUCIBILITY.md](docs/CROSS_ARCH_REPRODUCIBILITY.md) | x86_64 ↔ arm64 reproducibility — open question, theoretical expectations, testing protocol |\n| [docs/CATALOG_HISTORY.md](docs/CATALOG_HISTORY.md) | Models-catalog version history |\n| [docs/baseline_references.md](docs/baseline_references.md) | Canonical empirical baselines for `validate-baseline` |\n| [docs/TESTING.md](docs/TESTING.md) | Test layout, markers, per-module coverage breakdown |\n| [CONTRIBUTING.md](CONTRIBUTING.md) | Code conventions, commit format, PR process |\n| [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md) | Host-only pre-commit setup, hooks |\n| [CHANGELOG.md](CHANGELOG.md) | Version history |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpumacp%2Fpuma","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpumacp%2Fpuma","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpumacp%2Fpuma/lists"}