{"id":50359292,"url":"https://github.com/rahulsamant37/meta-hackathon","last_synced_at":"2026-05-30T00:03:31.722Z","repository":{"id":350090589,"uuid":"1200489608","full_name":"rahulsamant37/meta-hackathon","owner":"rahulsamant37","description":"A complete, runnable ITSM benchmark environment with 181 deterministic tasks, graded scoring, dense rewards, and a standard API plus baseline runner for structured execution.","archived":false,"fork":false,"pushed_at":"2026-04-08T20:19:53.000Z","size":288,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-08T22:08:41.195Z","etag":null,"topics":["docker","fastapi","grpo","hackathon","openai","openenv-environment","rl-algorithms"],"latest_commit_sha":null,"homepage":"https://huggingface.co/spaces/rahulsamant37/itsm-openenv-benchmark/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rahulsamant37.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-03T13:26:52.000Z","updated_at":"2026-04-08T20:19:57.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/rahulsamant37/meta-hackathon","commit_stats":null,"previous_names":["rahulsamant37/meta-hackathon"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/rahulsamant37/meta-hackathon","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2Fmeta-hackathon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2Fmeta-hackathon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2Fmeta-hackathon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2Fmeta-hackathon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rahulsamant37","download_url":"https://codeload.github.com/rahulsamant37/meta-hackathon/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rahulsamant37%2Fmeta-hackathon/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33675019,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-29T02:00:06.066Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","fastapi","grpo","hackathon","openai","openenv-environment","rl-algorithms"],"created_at":"2026-05-30T00:03:31.642Z","updated_at":"2026-05-30T00:03:31.712Z","avatar_url":"https://github.com/rahulsamant37.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: itsm-openenv-benchmark\ncolorFrom: blue\ncolorTo: green\nsdk: docker\napp_port: 8000\n---\n\n# ITSM OpenEnv Benchmark\n\n[![Python](https://img.shields.io/badge/python-3.11%2B-3776AB?logo=python\u0026logoColor=white)](https://www.python.org/)\n[![FastAPI](https://img.shields.io/badge/FastAPI-0.111%2B-009688?logo=fastapi\u0026logoColor=white)](https://fastapi.tiangolo.com/)\n[![Docker](https://img.shields.io/badge/Docker-ready-2496ED?logo=docker\u0026logoColor=white)](https://www.docker.com/)\n[![Tasks](https://img.shields.io/badge/Benchmark%20Tasks-181-2E7D32)](tasks.jsonl)\n[![Deterministic](https://img.shields.io/badge/Scoring-Deterministic-1565C0)](openenv.yaml)\n\nDeterministic enterprise benchmark for IT Service Management (ITSM) agents.  \nThe environment is designed for reproducible evaluation of multi-step operational workflows across incident handling, SLA updates, problem management, and incident-knowledge linking.\n\n## Why This Repository\n\nMost enterprise agent benchmarks suffer from hidden transitions, non-deterministic scoring, and weak alignment between objectives and backend state.\n\nThis repository addresses those gaps with:\n\n- A canonical SQL seed and a fixed 181-task manifest.\n- Deterministic transition and grading logic.\n- Typed action/observation/reward contracts.\n- End-to-end reproducibility from server start to metrics plots.\n\n## Key Features\n\n- 181 deterministic ITSM tasks.\n- Four task families: incident, incident_sla, problem, incident_knowledge.\n- Dense reward shaping with explicit components.\n- Family-specific deterministic graders with bounded scores.\n- OpenEnv-compatible HTTP surface: health, reset, step, state, web/docs.\n- Baseline inference script with structured START/STEP/END logs.\n- Reproducible analytics artifacts and dashboard generation.\n\n## Benchmark Snapshot\n\n| Metric | Value |\n|---|---:|\n| Total tasks | 181 |\n| Success rate (baseline) | 100.0% |\n| Avg steps per episode | 3.000 |\n| Avg return per episode | 1.467 |\n| Avg terminal reward | 0.651 |\n\nTask-family composition:\n\n| Family | Count |\n|---|---:|\n| incident | 100 |\n| incident_sla | 26 |\n| problem | 10 |\n| incident_knowledge | 45 |\n\n## Repository Structure\n\n```text\n.\n├── itsm_openenv_benchmark/                # Canonical package\n│   ├── models.py                          # Typed contracts\n│   ├── client.py                          # OpenEnv client\n│   ├── environment.py                     # Canonical environment export\n│   └── env/\n│       ├── core.py                        # Transition + reward logic\n│       ├── loaders.py                     # Task/SQL loading\n│       └── graders/                       # Deterministic family graders\n├── server/\n│   ├── app.py                             # FastAPI service\n│   └── env/                               # Compatibility wrappers for legacy imports\n├── itsm/dbs/                              # SQL seeds (canonical + historical snapshots)\n├── assets/                                # Metrics artifacts\n├── scripts/generate_metrics_plot.py       # Plot + summary generator\n├── inference.py                           # Full benchmark inference runner\n├── sample-inference.py                    # Single-task sample runner\n├── tasks.jsonl                            # Canonical task manifest\n├── openenv.yaml                           # Benchmark metadata\n└── README.md\n```\n\nDesign note: root-level compatibility modules are intentionally retained for OpenEnv deployment compatibility while canonical logic now lives under itsm_openenv_benchmark.\n\nArchitecture details: docs/architecture.md\n\n## Installation\n\n### Prerequisites\n\n- Python 3.11+\n- uv (recommended)\n- Docker (optional)\n\n### Option A: Local\n\n```bash\nuv venv\nsource .venv/bin/activate\nuv pip install -r requirements.txt\n\n# Optional: developer tooling (tests, lint, notebook extras)\nuv pip install -e '.[dev]'\n```\n\n### Option B: Docker\n\n```bash\ndocker build -t itsm-openenv -f server/Dockerfile .\ndocker run --rm -p 8000:8000 itsm-openenv\n```\n\n## Quickstart\n\nStart the server locally:\n\n```bash\nuv run uvicorn server.app:app --host 0.0.0.0 --port 8000\n```\n\nHealth check:\n\n```bash\ncurl -s http://localhost:8000/health\n```\n\nReset one task:\n\n```bash\ncurl -s -X POST http://localhost:8000/reset \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"task_id\":\"ITSM-001\"}'\n```\n\nTake one step:\n\n```bash\ncurl -s -X POST http://localhost:8000/step \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"action_type\":\"query\",\"target_id\":\"ITSM-001\",\"payload\":{}}'\n```\n\n## API Contract\n\n### Endpoints\n\n- GET /health\n- POST /reset\n- POST /step\n- GET /state\n- GET /web\n- GET /docs\n\n### Typed Models\n\n- ITSMAction: action envelope with action_type, target_id, payload, reasoning.\n- ITSMObservation: task metadata, snapshots, allowed actions, progress signals.\n- ITSMReward: total reward + decomposed components.\n- ITSMInfo: completion, grader score, violations, state delta.\n\nThe canonical types are defined in itsm_openenv_benchmark/models.py and exposed via compatibility modules for OpenEnv deployment paths.\n\n## Running Baseline Inference\n\n```bash\nexport ENV_BASE_URL=http://localhost:8000\nexport API_BASE_URL=https://router.huggingface.co/v1\nexport MODEL_NAME=Qwen/Qwen2.5-7B-Instruct\nexport HF_TOKEN=\u003cyour_hf_token\u003e\n\nuv run python inference.py\n```\n\nExpected protocol logs:\n\n- [START] task=... env=... model=...\n- [STEP] step=... action=... reward=... done=... error=...\n- [END] success=... steps=... score=... rewards=...\n\n## Reproducing Metrics\n\n```bash\nuv run python scripts/generate_metrics_plot.py \\\n  --tasks tasks.jsonl \\\n  --log full_run.log \\\n  --out assets/metrics.png \\\n  --line-out assets/metrics_line.png \\\n  --summary-out assets/metrics_summary.json\n```\n\nGenerated artifacts:\n\n- assets/metrics.png\n- assets/metrics_line.png\n- assets/metrics_summary.json\n\n## Developer Workflow\n\nCommon commands via Makefile:\n\n```bash\nmake install\nmake dev\nmake run\nmake test\nmake validate\nmake metrics\n```\n\n## Visual Results\n\n![ITSM Benchmark Metrics](assets/metrics.png)\n\nFigure: multi-panel view of return profile, interaction cost, and reward composition by task family.\n\n![ITSM Benchmark Line Chart](assets/metrics_line.png)\n\nFigure: trend comparison across task families for reward and efficiency metrics.\n\n## Determinism and Reproducibility\n\nFor strict reproducibility:\n\n- Keep task ordering fixed.\n- Keep canonical seed and task manifest unchanged.\n- Keep grader logic unchanged across runs.\n- Re-run inference and metrics with the same runtime configuration.\n\n## Quality and Validation\n\nRecommended checks:\n\n```bash\nopenenv validate\nbash pre-validation-script.sh \u003cspace_url\u003e\npytest -q\n```\n\nThe test suite includes a smoke test that validates reset/step behavior with typed actions.\n\n## Notebook Workflow (Optional)\n\nThe notebook itsm_grpo_torchforge.ipynb provides an optional GRPO/QLoRA training workflow for policy fine-tuning and quick evaluation.\n\n## References\n\n- Meta PyTorch OpenEnv: https://github.com/meta-pytorch/OpenEnv\n- Hugging Face OpenEnv Course: https://github.com/huggingface/openenv-course\n- Meta PyTorch TorchForge: https://github.com/meta-pytorch/torchforge\n- FastAPI docs: https://fastapi.tiangolo.com/\n- Transformers: https://github.com/huggingface/transformers\n- TRL: https://github.com/huggingface/trl\n- PEFT: https://github.com/huggingface/peft\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahulsamant37%2Fmeta-hackathon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frahulsamant37%2Fmeta-hackathon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahulsamant37%2Fmeta-hackathon/lists"}