https://github.com/copyleftdev/specmint
๐ฏ Production-ready synthetic dataset generator with local LLM integration. Create realistic, schema-compliant test data for healthcare, fintech, and e-commerce applications. Privacy-first with deterministic generation.
https://github.com/copyleftdev/specmint
cli-tool dataset-generator deterministic ecommerce fintech golang healthcare json-schema llm-integration ollama privacy-first synthetic-data test-data
Last synced: 7 days ago
JSON representation
๐ฏ Production-ready synthetic dataset generator with local LLM integration. Create realistic, schema-compliant test data for healthcare, fintech, and e-commerce applications. Privacy-first with deterministic generation.
- Host: GitHub
- URL: https://github.com/copyleftdev/specmint
- Owner: copyleftdev
- License: other
- Created: 2025-08-17T19:05:12.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-08-18T05:49:00.000Z (8 months ago)
- Last Synced: 2025-09-09T18:46:30.630Z (8 months ago)
- Topics: cli-tool, dataset-generator, deterministic, ecommerce, fintech, golang, healthcare, json-schema, llm-integration, ollama, privacy-first, synthetic-data, test-data
- Language: Go
- Homepage: https://specmint-n7.vercel.app/
- Size: 8.54 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: docs/SECURITY_AUDIT_REPORT.md
Awesome Lists containing this project
README
# SpecMint: Synthetic Dataset Generator
[](https://github.com/copyleftdev/specmint/actions/workflows/ci.yml)
[](https://github.com/copyleftdev/specmint/actions/workflows/security.yml)
[](https://goreportcard.com/report/github.com/copyleftdev/specmint)
[](https://codecov.io/gh/copyleftdev/specmint)
[](https://golang.org/)
[](LICENSE)
[](https://github.com/copyleftdev/specmint/releases)
[](https://github.com/copyleftdev/specmint/stargazers)
[](https://github.com/copyleftdev/specmint/issues)
**SpecMint** is an intelligent synthetic dataset generator that transforms business scenarios into realistic datasets. Instead of manually configuring schemas and record counts, simply describe your business context (e.g., "500-bed hospital", "community bank with 12 branches") and SpecMint automatically calculates realistic record counts, relationships, and generates comprehensive datasets.
## ๐ฏ Population-Based Intelligence
SpecMint's breakthrough feature is **population-based simulation** - analyze real-world business scenarios and automatically generate realistic datasets:
```bash
# Hospital simulation - automatically calculates patients, claims, prescriptions, etc.
./bin/specmint simulate --population "100-bed regional hospital" --execute --output ./hospital-data
# Banking simulation - generates customers, accounts, transactions, loans
./bin/specmint simulate --population "community bank with 5 branches" --execute --output ./bank-data
# E-commerce simulation - creates users, products, orders, reviews
./bin/specmint simulate --population "e-commerce platform with 50K users" --execute --output ./ecommerce-data
# Retail simulation - generates stores, products, customers, inventory
./bin/specmint simulate --population "retail chain with 10 stores" --execute --output ./retail-data
```
## ๐ Traditional Schema-Based Generation
```bash
# Generate specific record types with custom counts
./bin/specmint generate -s test/schemas/ecommerce/products.json -o output -c 1000
# Generate healthcare claims with LLM enrichment
./bin/specmint generate -s test/schemas/medical/healthcare-claims-837.json -o claims --count 100 --llm-mode fields
# Generate pharmacy claims
./bin/specmint generate -s test/schemas/medical/rx-claims-ncpdp.json -o rx-data --count 500
# Validate existing dataset
./bin/specmint validate -s schema.json -d dataset.jsonl
# System health check
./bin/specmint doctor
```
## ๐ Project Metrics
| Metric | Value | Details |
|--------|-------|---------|
| **Development Time** | ~6 hours | August 17, 2025 (05:00 - 11:00 PST) |
| **Total Lines of Code** | 3,186 | Pure Go implementation |
| **Go Files** | 11 | Modular architecture |
| **Security Rating** | A (Excellent) | Zero vulnerabilities |
| **Test Coverage** | Comprehensive | Golden dataset validation |
## ๐๏ธ Architecture
SpecMint follows a clean, modular architecture designed for maintainability and extensibility:
```
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ CLI Commands โโโโโถโ Core Generator โโโโโถโ Output Writer โ
โ (Cobra-based) โ โ (Deterministic โ โ (JSONL/JSON) โ
โโโโโโโโโโโโโโโโโโโ โ + LLM Enhanced)โ โโโโโโโโโโโโโโโโโโโ
โ โโโโโโโโโโโโโโโโโโโโ โ
โผ โ โผ
โโโโโโโโโโโโโโโโโโโ โผ โโโโโโโโโโโโโโโโโโโ
โ Schema Parser โ โโโโโโโโโโโโโโโโโโโโ โ Domain Validatorโ
โ (JSON Schema) โ โ LLM Integration โ โ (Business Rules)โ
โโโโโโโโโโโโโโโโโโโ โ (Local Ollama) โ โโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโ
```
### Core Components
- **`cmd/specmint/`** - CLI interface with 6 commands (generate, simulate, validate, inspect, doctor, benchmark)
- **`pkg/generator/`** - Deterministic generation engine with optional LLM enrichment
- **`pkg/population/`** - Population-based simulation and business scenario analysis
- **`pkg/schema/`** - JSON Schema parsing and validation
- **`pkg/llm/`** - Local Ollama integration for realistic data enhancement
- **`pkg/validator/`** - Domain-specific business rule validation
- **`pkg/writer/`** - Multi-format output handling
- **`internal/config/`** - Configuration management
- **`internal/logger/`** - Structured logging with zerolog
## ๐ค Development Collaboration
This project represents a unique **Human-AI collaborative development** approach:
### Human Role (Project Lead)
- **Strategic Vision**: Defined requirements for privacy-focused synthetic data generation
- **Architecture Guidance**: Directed modular design decisions and Go best practices
- **Domain Expertise**: Provided business logic for healthcare, fintech, and e-commerce validation
- **Quality Assurance**: Guided testing strategies and security requirements
- **Project Management**: Managed scope, priorities, and deliverable timelines
### AI Role (Cascade Assistant)
- **Code Implementation**: Wrote 100% of the 3,186 lines of Go code
- **Technical Architecture**: Implemented clean architecture patterns and interfaces
- **Testing Strategy**: Developed comprehensive golden dataset testing approach
- **Security Implementation**: Integrated security scanning and vulnerability management
- **Documentation**: Created comprehensive technical documentation and reports
### Collaborative Highlights
- **Real-time Feedback Loop**: Immediate iteration on requirements and implementation
- **Knowledge Transfer**: AI learned domain-specific validation rules through human guidance
- **Quality Standards**: Human oversight ensured enterprise-grade code quality
- **Problem Solving**: Combined human strategic thinking with AI implementation speed
## ๐งช Testing Strategies
SpecMint employs multiple testing methodologies for comprehensive quality assurance:
### 1. Golden Dataset Testing
```bash
./test/golden-test-suite.sh
```
- **Purpose**: Regression testing with known-good datasets
- **Coverage**: All three domains (healthcare, fintech, e-commerce)
- **Validation**: Schema compliance + domain business rules
- **Datasets**: 175 total records across domains
### 2. Domain-Specific Validation
- **Healthcare**: 837 Claims (ICD-10/CPT codes), NCPDP pharmacy claims, NPI validation, HIPAA compliance
- **Fintech**: ABA routing numbers, transaction limits, risk scoring
- **E-commerce**: SKU formats, inventory consistency, pricing validation
- **X12 EDI**: Purchase order validation, party ID verification, business transaction compliance
### 3. LLM Integration Testing
- **Connectivity**: Automated Ollama health checks
- **Fallback Logic**: Graceful degradation to deterministic generation
- **Quality Assurance**: LLM output validation against schema constraints
### 4. Security Testing
- **Static Analysis**: gosec security scanner integration
- **Vulnerability Scanning**: govulncheck for Go stdlib issues
- **Dependency Auditing**: nancy for third-party package security
### 5. Performance Benchmarking
```bash
./bin/specmint benchmark -s schema.json --counts 100,1000,10000
```
- **Scalability**: Multi-record generation performance
- **Memory Usage**: Resource consumption monitoring
- **Deterministic Verification**: Seed-based reproducibility testing
## ๐ง Build System & CI/CD
### Local Development
Comprehensive Makefile with 15+ targets for complete development lifecycle:
```bash
# Development
make build test lint
# Security
make audit vulncheck
# CI/CD Pipeline
make ci
# Dependency Management
make deps-update deps-verify
# System Diagnostics
make doctor
```
### Automated CI/CD Pipeline
Production-grade GitHub Actions workflows with expert separation of concerns:
- **CI/CD Pipeline**: Multi-platform builds, test matrix, golden dataset validation
- **Security Audit**: Daily automated security scanning with SARIF integration
- **Release Automation**: Multi-platform binary builds with automated GitHub releases
- **Coverage Reporting**: Automated code coverage via Codecov integration
- **Quality Gates**: Go Report Card integration for code quality metrics
## ๐ก๏ธ Security
SpecMint maintains an **A-grade security rating** with:
- โ
**Zero vulnerabilities** (post Go 1.25.0 upgrade)
- โ
**Automated security scanning** in CI/CD pipeline
- โ
**Hardened file permissions** (0600 for logs, 0750 for directories)
- โ
**Clean dependency tree** with regular vulnerability monitoring
- โ
**Static code analysis** with 54% security issue reduction
- โ
**Daily security audits** via GitHub Actions
- โ
**SARIF integration** for GitHub Security tab
See [SECURITY_AUDIT_REPORT.md](./docs/SECURITY_AUDIT_REPORT.md) for detailed security assessment.
## ๐ฏ Key Features
### Population-Based Intelligence
- **Business Context Understanding**: Analyze real-world scenarios and suggest realistic data volumes
- **Automatic Scaling**: Calculate appropriate record counts based on business size
- **Domain Templates**: Built-in knowledge for Healthcare, Banking, Retail, E-commerce, Insurance
- **Relationship Modeling**: Understand data dependencies and realistic proportions
### Deterministic Generation
- **Reproducible**: Same seed produces identical datasets
- **Scalable**: Efficient generation of large datasets
- **Schema-Compliant**: Strict adherence to JSON Schema specifications
### LLM Enhancement
- **Local Privacy**: Uses local Ollama instance (no data leaves your machine)
- **Selective Enrichment**: Field-level LLM enhancement with fallback
- **Configurable**: Adjustable workers, rate limiting, and model selection
### Domain Intelligence
- **Healthcare**: 837 Healthcare Claims (NCPDP D.0), NCPDP pharmacy claims with medical coding
- **Fintech**: Transaction processing, ABA routing validation, risk scoring
- **E-commerce**: Product catalogs, inventory management, SKU generation
- **X12 EDI**: Purchase orders (850), business transactions with party validation
- **Business Rules**: Industry-specific validation logic with cross-field constraints
- **Medical Coding**: ICD-10 diagnosis codes, CPT procedure codes, NPI provider validation
- **Realistic Data**: LLM-enhanced medical descriptions and contextually appropriate values
### Production Ready
- **CLI Interface**: Professional command-line tool with comprehensive help
- **Multiple Formats**: JSON, JSONL output with manifest generation
- **Monitoring**: Built-in health checks and system diagnostics
- **Extensible**: Plugin-ready architecture for new domains
## ๐ Performance
- **Generation Speed**: 1000+ records/second (deterministic mode)
- **Memory Efficiency**: Streaming output for large datasets
- **LLM Integration**: Configurable rate limiting and worker pools
- **Scalability**: Tested up to 10,000+ record generation
## ๐ฅ Healthcare & Medical Data
SpecMint excels at generating **enterprise-grade healthcare datasets** with medical accuracy:
### 837 Healthcare Claims (X12 EDI)
- **Complete NCPDP D.0 structure**: Professional, institutional, and dental claims
- **Medical coding compliance**: Valid ICD-10 diagnosis codes, CPT procedure codes
- **Provider validation**: NPI identifiers, taxonomy codes, federal tax IDs
- **LLM-enhanced realism**: Medical diagnoses and procedure descriptions
- **Cross-field validation**: Medical logic enforcement across claim hierarchies
- **Performance optimized**: 5x faster than generic tools (2 LLM calls vs 10+ per record)
### NCPDP Pharmacy Claims
- **Prescription accuracy**: NDC codes, DEA numbers, prior authorization
- **Drug information**: Realistic medication names, strengths, quantities
- **Insurance processing**: BIN/PCN numbers, copay calculations
- **Regulatory compliance**: HIPAA-safe synthetic data generation
### Key Healthcare Features
- **Medical realism**: Clinically plausible diagnosis-procedure relationships
- **Regulatory compliance**: No real PHI/PII in synthetic data
- **Scalable generation**: Thousands of compliant claims efficiently
- **Industry validation**: Healthcare-specific business rules and constraints
## ๐ฎ Future Enhancements
- **Additional Medical**: 270/271 Eligibility, 835 Payment/Remittance, 856 ASN
- **Additional Domains**: Legal, manufacturing, retail verticals
- **Output Formats**: CSV, Parquet, database direct insertion
- **Cloud LLM Support**: OpenAI, Anthropic, Google integration
- **Web Interface**: Browser-based dataset generation UI
- **API Mode**: REST API for programmatic access
## ๐ License
BSD 3-Clause License - see [LICENSE](LICENSE) for details.
**Attribution Required**: When using SpecMint, please include attribution as specified in the LICENSE file.
## ๐ Acknowledgments
This project demonstrates the power of **Human-AI collaboration** in software development, combining human strategic vision with AI implementation capabilities to create enterprise-grade solutions in record time.
---
**Built with โค๏ธ using Go 1.25.0 and collaborative AI development**