{"id":30910248,"url":"https://github.com/copyleftdev/specmint","last_synced_at":"2026-04-19T02:01:21.684Z","repository":{"id":310393261,"uuid":"1039684513","full_name":"copyleftdev/specmint","owner":"copyleftdev","description":"🎯 Production-ready synthetic dataset generator with local LLM integration. Create realistic, schema-compliant test data for healthcare, fintech, and e-commerce applications. Privacy-first with deterministic generation.","archived":false,"fork":false,"pushed_at":"2025-08-18T05:49:00.000Z","size":8957,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-09T18:46:30.630Z","etag":null,"topics":["cli-tool","dataset-generator","deterministic","ecommerce","fintech","golang","healthcare","json-schema","llm-integration","ollama","privacy-first","synthetic-data","test-data"],"latest_commit_sha":null,"homepage":"https://specmint-n7.vercel.app/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/copyleftdev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"docs/SECURITY_AUDIT_REPORT.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-17T19:05:12.000Z","updated_at":"2025-08-28T10:38:01.000Z","dependencies_parsed_at":"2025-08-17T21:14:15.324Z","dependency_job_id":"6d0aef81-c7ca-401c-90c1-7d6860a0b62a","html_url":"https://github.com/copyleftdev/specmint","commit_stats":null,"previous_names":["copyleftdev/specmint"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/copyleftdev/specmint","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/copyleftdev%2Fspecmint","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/copyleftdev%2Fspecmint/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/copyleftdev%2Fspecmint/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/copyleftdev%2Fspecmint/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/copyleftdev","download_url":"https://codeload.github.com/copyleftdev/specmint/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/copyleftdev%2Fspecmint/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31991720,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"online","status_checked_at":"2026-04-19T02:00:07.110Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli-tool","dataset-generator","deterministic","ecommerce","fintech","golang","healthcare","json-schema","llm-integration","ollama","privacy-first","synthetic-data","test-data"],"created_at":"2025-09-09T17:03:28.555Z","updated_at":"2026-04-19T02:01:21.670Z","avatar_url":"https://github.com/copyleftdev.png","language":"Go","readme":"# SpecMint: Synthetic Dataset Generator\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"media/specmint.png\" alt=\"SpecMint Logo\" width=\"200\"/\u003e\n\u003c/div\u003e\n\n[![CI/CD Pipeline](https://github.com/copyleftdev/specmint/actions/workflows/ci.yml/badge.svg)](https://github.com/copyleftdev/specmint/actions/workflows/ci.yml)\n[![Security Audit](https://github.com/copyleftdev/specmint/actions/workflows/security.yml/badge.svg)](https://github.com/copyleftdev/specmint/actions/workflows/security.yml)\n[![Go Report Card](https://goreportcard.com/badge/github.com/copyleftdev/specmint)](https://goreportcard.com/report/github.com/copyleftdev/specmint)\n[![codecov](https://codecov.io/gh/copyleftdev/specmint/branch/main/graph/badge.svg)](https://codecov.io/gh/copyleftdev/specmint)\n[![Go Version](https://img.shields.io/badge/Go-1.25.0-blue.svg)](https://golang.org/)\n[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](LICENSE)\n[![GitHub release](https://img.shields.io/github/release/copyleftdev/specmint.svg)](https://github.com/copyleftdev/specmint/releases)\n[![GitHub stars](https://img.shields.io/github/stars/copyleftdev/specmint.svg)](https://github.com/copyleftdev/specmint/stargazers)\n[![GitHub issues](https://img.shields.io/github/issues/copyleftdev/specmint.svg)](https://github.com/copyleftdev/specmint/issues)\n\n**SpecMint** is an intelligent synthetic dataset generator that transforms business scenarios into realistic datasets. Instead of manually configuring schemas and record counts, simply describe your business context (e.g., \"500-bed hospital\", \"community bank with 12 branches\") and SpecMint automatically calculates realistic record counts, relationships, and generates comprehensive datasets.\n\n## 🎯 Population-Based Intelligence\n\nSpecMint's breakthrough feature is **population-based simulation** - analyze real-world business scenarios and automatically generate realistic datasets:\n\n```bash\n# Hospital simulation - automatically calculates patients, claims, prescriptions, etc.\n./bin/specmint simulate --population \"100-bed regional hospital\" --execute --output ./hospital-data\n\n# Banking simulation - generates customers, accounts, transactions, loans\n./bin/specmint simulate --population \"community bank with 5 branches\" --execute --output ./bank-data\n\n# E-commerce simulation - creates users, products, orders, reviews\n./bin/specmint simulate --population \"e-commerce platform with 50K users\" --execute --output ./ecommerce-data\n\n# Retail simulation - generates stores, products, customers, inventory\n./bin/specmint simulate --population \"retail chain with 10 stores\" --execute --output ./retail-data\n```\n\n## 🚀 Traditional Schema-Based Generation\n\n```bash\n# Generate specific record types with custom counts\n./bin/specmint generate -s test/schemas/ecommerce/products.json -o output -c 1000\n\n# Generate healthcare claims with LLM enrichment\n./bin/specmint generate -s test/schemas/medical/healthcare-claims-837.json -o claims --count 100 --llm-mode fields\n\n# Generate pharmacy claims\n./bin/specmint generate -s test/schemas/medical/rx-claims-ncpdp.json -o rx-data --count 500\n\n# Validate existing dataset\n./bin/specmint validate -s schema.json -d dataset.jsonl\n\n# System health check\n./bin/specmint doctor\n```\n\n## 📊 Project Metrics\n\n| Metric | Value | Details |\n|--------|-------|---------|\n| **Development Time** | ~6 hours | August 17, 2025 (05:00 - 11:00 PST) |\n| **Total Lines of Code** | 3,186 | Pure Go implementation |\n| **Go Files** | 11 | Modular architecture |\n| **Security Rating** | A (Excellent) | Zero vulnerabilities |\n| **Test Coverage** | Comprehensive | Golden dataset validation |\n\n## 🏗️ Architecture\n\nSpecMint follows a clean, modular architecture designed for maintainability and extensibility:\n\n```\n┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐\n│   CLI Commands  │───▶│  Core Generator  │───▶│  Output Writer  │\n│  (Cobra-based)  │    │   (Deterministic │    │   (JSONL/JSON)  │\n└─────────────────┘    │   + LLM Enhanced)│    └─────────────────┘\n         │              └──────────────────┘              │\n         ▼                        │                       ▼\n┌─────────────────┐              ▼                ┌─────────────────┐\n│ Schema Parser   │    ┌──────────────────┐      │ Domain Validator│\n│ (JSON Schema)   │    │  LLM Integration │      │ (Business Rules)│\n└─────────────────┘    │  (Local Ollama)  │      └─────────────────┘\n                       └──────────────────┘\n```\n\n### Core Components\n\n- **`cmd/specmint/`** - CLI interface with 6 commands (generate, simulate, validate, inspect, doctor, benchmark)\n- **`pkg/generator/`** - Deterministic generation engine with optional LLM enrichment\n- **`pkg/population/`** - Population-based simulation and business scenario analysis\n- **`pkg/schema/`** - JSON Schema parsing and validation\n- **`pkg/llm/`** - Local Ollama integration for realistic data enhancement\n- **`pkg/validator/`** - Domain-specific business rule validation\n- **`pkg/writer/`** - Multi-format output handling\n- **`internal/config/`** - Configuration management\n- **`internal/logger/`** - Structured logging with zerolog\n\n## 🤝 Development Collaboration\n\nThis project represents a unique **Human-AI collaborative development** approach:\n\n### Human Role (Project Lead)\n- **Strategic Vision**: Defined requirements for privacy-focused synthetic data generation\n- **Architecture Guidance**: Directed modular design decisions and Go best practices\n- **Domain Expertise**: Provided business logic for healthcare, fintech, and e-commerce validation\n- **Quality Assurance**: Guided testing strategies and security requirements\n- **Project Management**: Managed scope, priorities, and deliverable timelines\n\n### AI Role (Cascade Assistant)\n- **Code Implementation**: Wrote 100% of the 3,186 lines of Go code\n- **Technical Architecture**: Implemented clean architecture patterns and interfaces\n- **Testing Strategy**: Developed comprehensive golden dataset testing approach\n- **Security Implementation**: Integrated security scanning and vulnerability management\n- **Documentation**: Created comprehensive technical documentation and reports\n\n### Collaborative Highlights\n- **Real-time Feedback Loop**: Immediate iteration on requirements and implementation\n- **Knowledge Transfer**: AI learned domain-specific validation rules through human guidance\n- **Quality Standards**: Human oversight ensured enterprise-grade code quality\n- **Problem Solving**: Combined human strategic thinking with AI implementation speed\n\n## 🧪 Testing Strategies\n\nSpecMint employs multiple testing methodologies for comprehensive quality assurance:\n\n### 1. Golden Dataset Testing\n```bash\n./test/golden-test-suite.sh\n```\n- **Purpose**: Regression testing with known-good datasets\n- **Coverage**: All three domains (healthcare, fintech, e-commerce)\n- **Validation**: Schema compliance + domain business rules\n- **Datasets**: 175 total records across domains\n\n### 2. Domain-Specific Validation\n- **Healthcare**: 837 Claims (ICD-10/CPT codes), NCPDP pharmacy claims, NPI validation, HIPAA compliance\n- **Fintech**: ABA routing numbers, transaction limits, risk scoring\n- **E-commerce**: SKU formats, inventory consistency, pricing validation\n- **X12 EDI**: Purchase order validation, party ID verification, business transaction compliance\n\n### 3. LLM Integration Testing\n- **Connectivity**: Automated Ollama health checks\n- **Fallback Logic**: Graceful degradation to deterministic generation\n- **Quality Assurance**: LLM output validation against schema constraints\n\n### 4. Security Testing\n- **Static Analysis**: gosec security scanner integration\n- **Vulnerability Scanning**: govulncheck for Go stdlib issues\n- **Dependency Auditing**: nancy for third-party package security\n\n### 5. Performance Benchmarking\n```bash\n./bin/specmint benchmark -s schema.json --counts 100,1000,10000\n```\n- **Scalability**: Multi-record generation performance\n- **Memory Usage**: Resource consumption monitoring\n- **Deterministic Verification**: Seed-based reproducibility testing\n\n## 🔧 Build System \u0026 CI/CD\n\n### Local Development\nComprehensive Makefile with 15+ targets for complete development lifecycle:\n\n```bash\n# Development\nmake build test lint\n\n# Security\nmake audit vulncheck\n\n# CI/CD Pipeline\nmake ci\n\n# Dependency Management\nmake deps-update deps-verify\n\n# System Diagnostics\nmake doctor\n```\n\n### Automated CI/CD Pipeline\nProduction-grade GitHub Actions workflows with expert separation of concerns:\n\n- **CI/CD Pipeline**: Multi-platform builds, test matrix, golden dataset validation\n- **Security Audit**: Daily automated security scanning with SARIF integration\n- **Release Automation**: Multi-platform binary builds with automated GitHub releases\n- **Coverage Reporting**: Automated code coverage via Codecov integration\n- **Quality Gates**: Go Report Card integration for code quality metrics\n\n## 🛡️ Security\n\nSpecMint maintains an **A-grade security rating** with:\n\n- ✅ **Zero vulnerabilities** (post Go 1.25.0 upgrade)\n- ✅ **Automated security scanning** in CI/CD pipeline\n- ✅ **Hardened file permissions** (0600 for logs, 0750 for directories)\n- ✅ **Clean dependency tree** with regular vulnerability monitoring\n- ✅ **Static code analysis** with 54% security issue reduction\n- ✅ **Daily security audits** via GitHub Actions\n- ✅ **SARIF integration** for GitHub Security tab\n\nSee [SECURITY_AUDIT_REPORT.md](./docs/SECURITY_AUDIT_REPORT.md) for detailed security assessment.\n\n## 🎯 Key Features\n\n### Population-Based Intelligence\n- **Business Context Understanding**: Analyze real-world scenarios and suggest realistic data volumes\n- **Automatic Scaling**: Calculate appropriate record counts based on business size\n- **Domain Templates**: Built-in knowledge for Healthcare, Banking, Retail, E-commerce, Insurance\n- **Relationship Modeling**: Understand data dependencies and realistic proportions\n\n### Deterministic Generation\n- **Reproducible**: Same seed produces identical datasets\n- **Scalable**: Efficient generation of large datasets\n- **Schema-Compliant**: Strict adherence to JSON Schema specifications\n\n### LLM Enhancement\n- **Local Privacy**: Uses local Ollama instance (no data leaves your machine)\n- **Selective Enrichment**: Field-level LLM enhancement with fallback\n- **Configurable**: Adjustable workers, rate limiting, and model selection\n\n### Domain Intelligence\n- **Healthcare**: 837 Healthcare Claims (NCPDP D.0), NCPDP pharmacy claims with medical coding\n- **Fintech**: Transaction processing, ABA routing validation, risk scoring\n- **E-commerce**: Product catalogs, inventory management, SKU generation\n- **X12 EDI**: Purchase orders (850), business transactions with party validation\n- **Business Rules**: Industry-specific validation logic with cross-field constraints\n- **Medical Coding**: ICD-10 diagnosis codes, CPT procedure codes, NPI provider validation\n- **Realistic Data**: LLM-enhanced medical descriptions and contextually appropriate values\n\n### Production Ready\n- **CLI Interface**: Professional command-line tool with comprehensive help\n- **Multiple Formats**: JSON, JSONL output with manifest generation\n- **Monitoring**: Built-in health checks and system diagnostics\n- **Extensible**: Plugin-ready architecture for new domains\n\n## 📈 Performance\n\n- **Generation Speed**: 1000+ records/second (deterministic mode)\n- **Memory Efficiency**: Streaming output for large datasets\n- **LLM Integration**: Configurable rate limiting and worker pools\n- **Scalability**: Tested up to 10,000+ record generation\n\n## 🏥 Healthcare \u0026 Medical Data\n\nSpecMint excels at generating **enterprise-grade healthcare datasets** with medical accuracy:\n\n### 837 Healthcare Claims (X12 EDI)\n- **Complete NCPDP D.0 structure**: Professional, institutional, and dental claims\n- **Medical coding compliance**: Valid ICD-10 diagnosis codes, CPT procedure codes\n- **Provider validation**: NPI identifiers, taxonomy codes, federal tax IDs\n- **LLM-enhanced realism**: Medical diagnoses and procedure descriptions\n- **Cross-field validation**: Medical logic enforcement across claim hierarchies\n- **Performance optimized**: 5x faster than generic tools (2 LLM calls vs 10+ per record)\n\n### NCPDP Pharmacy Claims\n- **Prescription accuracy**: NDC codes, DEA numbers, prior authorization\n- **Drug information**: Realistic medication names, strengths, quantities\n- **Insurance processing**: BIN/PCN numbers, copay calculations\n- **Regulatory compliance**: HIPAA-safe synthetic data generation\n\n### Key Healthcare Features\n- **Medical realism**: Clinically plausible diagnosis-procedure relationships\n- **Regulatory compliance**: No real PHI/PII in synthetic data\n- **Scalable generation**: Thousands of compliant claims efficiently\n- **Industry validation**: Healthcare-specific business rules and constraints\n\n## 🔮 Future Enhancements\n\n- **Additional Medical**: 270/271 Eligibility, 835 Payment/Remittance, 856 ASN\n- **Additional Domains**: Legal, manufacturing, retail verticals\n- **Output Formats**: CSV, Parquet, database direct insertion\n- **Cloud LLM Support**: OpenAI, Anthropic, Google integration\n- **Web Interface**: Browser-based dataset generation UI\n- **API Mode**: REST API for programmatic access\n\n## 📄 License\n\nBSD 3-Clause License - see [LICENSE](LICENSE) for details.\n\n**Attribution Required**: When using SpecMint, please include attribution as specified in the LICENSE file.\n\n## 🙏 Acknowledgments\n\nThis project demonstrates the power of **Human-AI collaboration** in software development, combining human strategic vision with AI implementation capabilities to create enterprise-grade solutions in record time.\n\n---\n\n**Built with ❤️ using Go 1.25.0 and collaborative AI development**\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcopyleftdev%2Fspecmint","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcopyleftdev%2Fspecmint","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcopyleftdev%2Fspecmint/lists"}