{"id":34599492,"url":"https://github.com/pattabhia/dataset-generator","last_synced_at":"2025-12-24T12:08:42.439Z","repository":{"id":326896749,"uuid":"1106565803","full_name":"pattabhia/dataset-generator","owner":"pattabhia","description":"A flexible, template-based dataset generator for creating high-quality training data for enterprise AI and RAG (Retrieval-Augmented Generation) systems.","archived":false,"fork":false,"pushed_at":"2025-12-02T05:30:04.000Z","size":149,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-02T22:25:10.029Z","etag":null,"topics":["dataset-generation","fine-tuning","knowledge-graph","llama","llama-factory","machine-learning","python","python3","rag","training-data","vector-database"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pattabhia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-29T14:05:21.000Z","updated_at":"2025-12-02T05:30:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/pattabhia/dataset-generator","commit_stats":null,"previous_names":["pattabhia/dataset-generator"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/pattabhia/dataset-generator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pattabhia%2Fdataset-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pattabhia%2Fdataset-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pattabhia%2Fdataset-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pattabhia%2Fdataset-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pattabhia","download_url":"https://codeload.github.com/pattabhia/dataset-generator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pattabhia%2Fdataset-generator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28002250,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-24T02:00:07.193Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset-generation","fine-tuning","knowledge-graph","llama","llama-factory","machine-learning","python","python3","rag","training-data","vector-database"],"created_at":"2025-12-24T12:08:40.777Z","updated_at":"2025-12-24T12:08:42.434Z","avatar_url":"https://github.com/pattabhia.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 📦 LLaMAFactory Dataset Generator\n\nA CLI utility for producing high-quality, domain-aware training datasets for LLM fine-tuning. Configure your domains in YAML, generate datasets in one command, and keep quality high with validation, deduplication, and statistics files.\n\n## Table of Contents\n- [Features](#features)\n- [Prerequisites](#prerequisites)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Output Structure](#output-structure)\n- [Project Structure](#project-structure)\n- [Adding a New Domain](#adding-a-new-domain)\n- [Development](#development)\n\n## Features\n### Multi-domain configuration\nSelect any domain using CLI arguments without editing code:\n\n```bash\n--domain haiintel_core\n--domain expense\n--domain \u003cyour-custom-domain\u003e\n```\n\nAll domain details (company, agents, regions, currencies, and more) live in `config.yaml`.\n\n### YAML-driven pipeline\nNo Python edits required to add or adjust a domain. Update `config.yaml` and regenerate datasets.\n\n### Dataset quality safeguards\n- Validation for every generated example\n- Automatic deduplication\n- Companion `*_stats.json` files with totals, token estimates, and section breakdowns\n\n### Entity classification\nRule-based keyword classifier to produce meaningful entity labels:\n\n```json\n{\n  \"system\": \"You are HAIIndexer classification module. Classify the given string into one or more entity types.\",\n  \"instruction\": \"What type of entity is Global Invoice for CFO 001?\",\n  \"input\": \"Global Invoice for CFO 001\",\n  \"output\": \"This entity belongs to the following types: Person, Invoice\",\n  \"metadata\": {\n    \"section\": \"entity_classification\",\n    \"classified_as\": [\"Person\", \"Invoice\"],\n    \"possible_labels\": [\"Person\", \"CostCenter\", \"ExpensePolicy\", \"Vendor\", ...]\n  }\n}\n```\n\n## Prerequisites\n- Python 3.8 or higher\n- pip\n- Optional: `jq` for JSON formatting\n\n## Installation\nInstall dependencies directly:\n\n```bash\npip install -r requirements.txt\n```\n\nOr use the Makefile shortcut:\n\n```bash\nmake install\n```\n\n## Usage\n1. **Prepare `config.yaml`** — define one or more domains. Example:\n   ```yaml\n   domains:\n     - id: expense\n       company_name: \"\u003cCompany Name\u003e\"\n       agent_name: \"\u003cAgent Name\u003e\"\n       chat_agent_name: \"HAI Expense Agent\"\n       domain_name: \"Expense Management\"\n       kb_label: \"HaiIntel Expense Knowledge Base\"\n       primary_products: [\"HAIExpenseLens\", \"HAIIndexer\"]\n       primary_roles: [\"CFO\", \"Finance Controller\"]\n       primary_regions: [\"Global\", \"UAE\", \"India\"]\n       entity_types: [\"Invoice\", \"Receipt\", \"ExpensePolicy\", \"Vendor\"]\n       expense_doc_types: [\"Invoice\", \"Bill\", \"Receipt\"]\n       currencies: [\"INR\", \"USD\", \"AED\"]\n   ```\n\n2. **Generate datasets**\n   - Direct Python command:\n     ```bash\n     python -m src.cli --config config.yaml --domain expense --out-dir ./training-jsons\n     ```\n   - Using the Makefile:\n     ```bash\n     make generate DOMAIN=expense  # Single domain\n     make generate-all             # All domains in config.yaml\n     ```\n\n### Output structure\nThe generator writes JSON datasets plus per-section statistics:\n\n```\ntraining-jsons/\n├─ intro-training.json                        # Greetings and introductions\n├─ operator-training.json                     # Operator logic examples\n├─ rag_context_training.json                  # RAG context handling\n├─ entity-classification-training.json        # Entity type classification\n├─ safety_guardrails_training.json            # Safety and guardrails\n├─ hard_negatives_hallucinations.json         # Hard negative examples\n├─ company_kb_training.json                   # Company knowledge base Q\u0026A\n├─ company_kb_no_hallucinations_training.json # Anti-hallucination KB\n├─ business_integration_training.json         # Business integration scenarios\n├─ expense_documents_training.json            # Domain-specific: Expense docs (if configured)\n└─ *_stats.json                               # Stats for each dataset above\n```\n\n## Project Structure\n```\ndataset-generator/\n├── config.yaml              # Multi-domain configuration\n├── requirements.txt         # Python dependencies\n├── Makefile                 # Build automation\n├── README.md                # Project overview and usage\n└── src/\n    ├── cli.py               # Command-line interface\n    ├── domain_config.py     # Domain configuration data class\n    ├── factory.py           # Section builder factory\n    ├── generator.py         # Main dataset generator\n    ├── utils.py             # Shared utilities and entity classifier\n    └── sections/            # Section builders (one per training type)\n        ├── base.py\n        ├── intro.py\n        ├── operator.py\n        ├── entity_classification.py\n        ├── rag_context.py\n        ├── safety.py\n        └── ...\n```\n\n## Adding a New Domain\n1. **Edit `config.yaml`** to add a domain entry:\n   ```yaml\n   domains:\n     - id: my_new_domain\n       company_name: \"MyCompany\"\n       agent_name: \"MyAgent\"\n       domain_name: \"My Domain\"\n       entity_types: [\"TypeA\", \"TypeB\"]\n       # ... other configuration\n   ```\n2. **Extend the entity classifier (optional)** — add keyword patterns in `src/utils.py:classify_entity_name()`.\n3. **Generate datasets** with `make generate DOMAIN=my_new_domain`.\n\n## Development\n### Code quality tools\nThe project supports common Python tooling:\n\n```bash\npip install -r requirements.txt  # Dev dependencies included\nblack src/\nmypy src/\nruff src/\npytest\n```\n\n### Design principles\n- **Dependency inversion** — the generator depends on factories rather than concrete builders.\n- **DRY utilities** — shared helpers live in `utils.py`.\n- **Extensibility** — add new JSON schemas or builders with minimal changes.\n- **Testability** — builders are pure functions returning examples; easy to validate and unit test.\n- **Quality-first** — deduplication, validation, and statistics are built into the generation pipeline.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpattabhia%2Fdataset-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpattabhia%2Fdataset-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpattabhia%2Fdataset-generator/lists"}