{"id":33319288,"url":"https://github.com/codersacademy006/data-sanitizer","last_synced_at":"2026-06-06T20:32:02.784Z","repository":{"id":324658467,"uuid":"1097987396","full_name":"CodersAcademy006/Data-Sanitizer","owner":"CodersAcademy006","description":"AI-powered data sanitizer with schema detection, dedupe, outlier detection, and LLM enrichment.","archived":false,"fork":false,"pushed_at":"2025-11-17T05:38:23.000Z","size":3943,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-17T07:23:49.335Z","etag":null,"topics":["csv-cleaning","data-cleaning","data-engineering","data-enrichment","data-pipeline","data-quality","etl","jsonl","outlier-detection","sqlite","streaming-pipeline"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CodersAcademy006.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-17T05:33:12.000Z","updated_at":"2025-11-17T05:44:24.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/CodersAcademy006/Data-Sanitizer","commit_stats":null,"previous_names":["codersacademy006/data-sanitizer"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/CodersAcademy006/Data-Sanitizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FData-Sanitizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FData-Sanitizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FData-Sanitizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FData-Sanitizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CodersAcademy006","download_url":"https://codeload.github.com/CodersAcademy006/Data-Sanitizer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CodersAcademy006%2FData-Sanitizer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285319005,"owners_count":27151474,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-19T02:00:05.673Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv-cleaning","data-cleaning","data-engineering","data-enrichment","data-pipeline","data-quality","etl","jsonl","outlier-detection","sqlite","streaming-pipeline"],"created_at":"2025-11-19T20:00:30.707Z","updated_at":"2025-11-19T20:01:29.146Z","avatar_url":"https://github.com/CodersAcademy006.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Sanitizer: Production Data Cleaning Platform\n\n\u003e **Automatically dedupe, impute, normalize, and monitor data quality at scale with deterministic, auditable fixes.**\n\n[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-brightgreen.svg)](https://www.python.org/)\n[![Code Coverage](https://img.shields.io/badge/coverage-%3E80%25-brightgreen.svg)]()\n\n## 🎯 Overview\n\nData Sanitizer is a **production-ready data cleaning platform** designed for:\n\n- **Data Engineers**: Automatically dedupe, impute, normalize data at scale\n- **ML/Model Ops**: Reduce model retraining from bad upstream data\n- **Business/Analytics**: Cleaner data → fewer billing errors \u0026 faster BI insights\n\n### Key Features\n\n✅ **High-Quality Deduplication**  \n- 90%+ accuracy duplicate detection (MinHash + LSH)\n- Exact + near-duplicate detection\n- Deterministic, auditable fixes\n\n✅ **Multi-Format Ingestion**  \n- CSV, JSON, JSONL, Parquet, Excel\n- S3 / GCS / Azure Blob Storage\n- Streaming processing (O(chunk) memory)\n\n✅ **Intelligent Imputation**  \n- Median/mode-based fills\n- Confidence scoring (0.0–1.0)\n- Per-cell provenance tracking\n\n✅ **Production-Grade Architecture**  \n- Stateless, horizontally scalable workers\n- Postgres metadata + Milvus vector DB + Redis cache\n- REST API with authentication \u0026 rate limiting\n- Full audit trail \u0026 compliance-ready\n\n✅ **Enterprise Features**  \n- PII detection \u0026 redaction\n- Multi-tenant isolation\n- Customizable cleaning rules\n- Human-in-the-loop review flow\n\n---\n\n## 🚀 Quick Start (5 Minutes)\n\n### Prerequisites\n- Python 3.11+\n- Docker \u0026 Docker Compose\n- 4GB RAM minimum\n\n### 1. Clone \u0026 Install\n\n```bash\ngit clone https://github.com/CodersAcademy006/Data-Sanitizer.git\ncd data-sanitizer\n\n# Create virtual environment\npython3 -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install dependencies\npip install -r requirements.txt\n```\n\n### 2. Start Infrastructure (Local)\n\n```bash\n# Start Postgres, Milvus, Redis, API server\ndocker-compose up -d\n\n# Verify health\ncurl http://localhost:8000/api/v1/health\n# Expected: {\"status\": \"healthy\", \"storage_backend\": \"ready\"}\n```\n\n### 3. Generate Test Data\n\n```bash\npython benchmark_generator.py --size 1m --output-dir ./test_data\n# Generates: test_data/benchmark_1000000_rows.csv (~500 MB)\n```\n\n### 4. Clean Your Data\n\n```bash\n# Option A: Via Python\nfrom data_cleaning import run_full_cleaning_pipeline_two_pass_sqlite_batched\n\ncleaned_path, report_path = run_full_cleaning_pipeline_two_pass_sqlite_batched(\n    path=\"test_data/benchmark_1000000_rows.csv\",\n    output_dir=\"./output\",\n    chunksize=50_000\n)\n\n# Option B: Via REST API\ncurl -X POST http://localhost:8000/api/v1/datasets/my-tenant/ingest \\\n  -H \"X-API-Key: my-tenant:key123\" \\\n  -F \"file=@test_data/benchmark_1000000_rows.csv\" \\\n  -F \"dataset_name=test_dataset\"\n\n# Response: {\"job_id\": \"abc-123-def\", \"status\": \"queued\"}\n\n# Check status\ncurl http://localhost:8000/api/v1/jobs/abc-123-def\n\n# Download report\ncurl http://localhost:8000/api/v1/jobs/abc-123-def/report \u003e report.json\n```\n\n### 5. View Results\n\n```bash\n# Cleaned data (CSV)\nhead output/cleaned_data.csv\n\n# Cleaning report (JSON)\ncat output/cleaning_report.json | jq '.summary'\n# Output:\n# {\n#   \"original_row_count\": 1000000,\n#   \"cleaned_row_count\": 950000,\n#   \"rows_dropped\": 50000,\n#   \"deduplication_rate\": 0.95\n# }\n```\n\n---\n\n## 📊 Architecture\n\n### System Diagram\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│                      CLIENT LAYER                           │\n│  REST API (FastAPI) │ Admin UI │ Python/JS SDKs             │\n└────────────┬────────────────────────────────┬──────────────┘\n             │                                │\n┌────────────▼────────────────────────────────▼──────────────┐\n│             ORCHESTRATION LAYER                             │\n│  Job Scheduler (RabbitMQ/Redis)                             │\n│  - Job state machine (queued → running → complete)          │\n│  - Retries, idempotency, tenant quotas                      │\n└────────────┬────────────────────────────────┬──────────────┘\n             │                                │\n┌────────────▼──────────────────────────────────▼────────────┐\n│           COMPUTE WORKERS (Stateless, Scalable)             │\n│  Pass 1: Sampling → LSH index → Postgres                    │\n│  Pass 2: Dedupe → Impute → Clean → S3 (Parquet)             │\n└──────────────────────────────────────────────────────────────┘\n             │                    │\n┌────────────▼────────┐  ┌────────▼──────────┐\n│  Metadata Storage   │  │  Vector Storage   │\n│  Postgres           │  │  Milvus           │\n│  - Jobs, hashes     │  │  - LSH samples    │\n│  - Audit logs       │  │  - Similarity     │\n│  - Confidence       │  │    queries        │\n│  - Cell provenance  │  │                   │\n└─────────────────────┘  └───────────────────┘\n```\n\n### Data Flow\n\n```\n1. User uploads file (CSV, JSON, Parquet, etc.)\n   ↓\n2. API validates, stores to S3, creates Job record\n   ↓\n3. Pass 1 Worker:\n   - Streams file in chunks\n   - Samples columns (deterministic reservoir)\n   - Computes MinHash/LSH signatures\n   - Inserts samples to Milvus, stats to Postgres\n   ↓\n4. Pass 2 Worker:\n   - Streams file again\n   - Checks row hashes against Postgres (exact dedup)\n   - Queries Milvus for near-duplicates (LSH candidates)\n   - Applies imputation, normalization, cleaning\n   - Streams output to S3 (Parquet)\n   - Inserts confidence scores + audit logs to Postgres\n   ↓\n5. API serves cleaned data + report\n```\n\n---\n\n## 🏗️ Project Structure\n\n```\ndata_sanitizer/\n├── data_cleaning.py           # Core algorithm (Colab prototype upgraded)\n├── storage_backend.py         # Postgres + Milvus + Redis interface\n├── cloud_storage.py           # S3/GCS connectors, Parquet/CSV writers\n├── api_server.py              # FastAPI REST server\n├── benchmark_generator.py     # Realistic dirty data generation\n├── tests.py                   # 50+ unit, integration, property-based tests\n├── requirements.txt           # Python dependencies\n│\n├── docs/\n│   ├── ARCHITECTURE.md        # Full system design (2,000+ lines)\n│   ├── DEPLOYMENT.md          # Terraform, Docker, K8s, CI/CD\n│   ├── 30DAY_ROADMAP.md       # Week-by-week execution plan\n│   ├── IMPLEMENTATION_SUMMARY.md\n│   └── API.md                 # (TODO) OpenAPI reference\n│\n├── docker/\n│   ├── api/Dockerfile\n│   ├── worker-pass1/Dockerfile\n│   ├── worker-pass2/Dockerfile\n│   └── .dockerignore\n│\n├── k8s/\n│   ├── base/\n│   │   ├── api-deployment.yaml\n│   │   ├── api-service.yaml\n│   │   ├── worker-pass1-deployment.yaml\n│   │   ├── configmap.yaml\n│   │   └── hpa.yaml\n│   └── overlays/\n│       ├── dev/\n│       ├── staging/\n│       └── prod/\n│\n├── terraform/\n│   ├── main.tf\n│   ├── postgres.tf\n│   ├── milvus.tf\n│   ├── s3.tf\n│   ├── eks.tf\n│   └── variables.tf\n│\n└── docker-compose.yaml        # Local development stack\n```\n\n---\n\n## 📖 Documentation\n\n- **[ARCHITECTURE.md](docs/ARCHITECTURE.md)** - Complete system design, data models, API contracts\n- **[DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Production infrastructure, Kubernetes, Terraform, CI/CD\n- **[30DAY_ROADMAP.md](docs/30DAY_ROADMAP.md)** - Execution plan: Day 1 through Day 30\n- **[IMPLEMENTATION_SUMMARY.md](docs/IMPLEMENTATION_SUMMARY.md)** - Overview of deliverables\n- **[API.md](docs/API.md)** - (TODO) REST API reference, Swagger/OpenAPI\n\n---\n\n## 🧪 Testing\n\n### Run All Tests\n\n```bash\n# Install test dependencies\npip install -e \".[dev]\"\n\n# Run tests with coverage\npytest tests.py -v --cov=. --cov-report=html --cov-report=term\n\n# Expected: \u003e80% coverage\n```\n\n### Test Categories\n\n- **Unit Tests**: JSON flattening, MinHash, LSH, Reservoir sampling\n- **Integration Tests**: Full pipeline on small CSV/JSONL datasets\n- **Property-Based Tests**: Determinism validation with Hypothesis\n- **Performance Tests**: Throughput \u0026 latency benchmarks\n\n---\n\n## 📈 Performance Benchmarks\n\nBaseline metrics on modern hardware (AWS m5.xlarge):\n\n| Dataset | File Size | Pass 1 (sec) | Pass 2 (sec) | Throughput (rows/sec) | Memory (MB) |\n|---------|-----------|------------|------------|-----------------|-----------|\n| 1M CSV  | ~500 MB   | 8–15      | 12–20      | 40k–70k        | 200–400   |\n| 10M CSV | ~5 GB     | 80–150    | 120–200    | 40k–70k        | 300–500   |\n\n**SLA**: 10M rows/hour throughput\n\nTo run benchmarks:\n```bash\npython benchmark_generator.py --size 10m\npython data_cleaning.py  # Run interactive menu, option 4 (vehicles.csv)\n```\n\n---\n\n## 🔐 Security \u0026 Compliance\n\n### Privacy\n- ✅ PII detection (email, phone, SSN, credit card regex patterns)\n- ✅ Configurable PII strategies: redact, hash, exclude, tokenize\n- ✅ Encrypted at-rest (S3 SSE-KMS, Postgres TDE)\n- ✅ Encrypted in-transit (TLS 1.3)\n\n### Audit \u0026 Compliance\n- ✅ Immutable audit logs (every transformation recorded)\n- ✅ Cell-level provenance (original → cleaned value + confidence score)\n- ✅ GDPR/CCPA ready (data deletion support)\n- ✅ Row-level security (multi-tenant isolation via Postgres RLS)\n\n### Access Control\n- ✅ API key authentication (tenant-scoped)\n- ✅ Rate limiting (per-tenant quotas)\n- ✅ Role-based access (Admin, Engineer, Reviewer)\n\n---\n\n## 🚀 Production Deployment\n\n### Local Development\n\n```bash\ndocker-compose up -d\nuvicorn api_server:app --reload\n```\n\n### Cloud Deployment (AWS)\n\n```bash\n# 1. Initialize infrastructure\ncd terraform\nterraform init\nterraform plan -var-file=prod.tfvars\nterraform apply -var-file=prod.tfvars\n\n# 2. Build \u0026 push Docker images\n./scripts/build-and-push.sh\n\n# 3. Deploy via GitOps (ArgoCD)\nkubectl apply -f argocd/data-sanitizer-app.yaml\n```\n\n### Kubernetes\n\n```bash\n# Install Data Sanitizer\nkubectl apply -k k8s/overlays/prod\n\n# Check status\nkubectl get pods -l app=data-sanitizer-api\nkubectl logs deployment/data-sanitizer-api\n\n# Scale workers\nkubectl scale deployment data-sanitizer-pass1-worker --replicas=10\n```\n\nSee [DEPLOYMENT.md](docs/DEPLOYMENT.md) for full instructions.\n\n---\n\n## 📞 Support \u0026 Roadmap\n\n### MVP (Current Release)\n- ✅ Core deduplication \u0026 imputation\n- ✅ Multi-format ingestion (CSV, JSON, Parquet)\n- ✅ Confidence scoring \u0026 audit logs\n- ✅ REST API\n- ✅ Postgres + Milvus backend\n\n### Phase 2 (Month 2–4)\n- [ ] Admin UI (React)\n- [ ] Human review flow\n- [ ] LLM enrichment (OpenAI/Claude)\n- [ ] Advanced PII detection\n\n### Phase 3 (Month 5–12)\n- [ ] Multi-tenant SaaS\n- [ ] Billing \u0026 usage tracking\n- [ ] On-prem deployment\n- [ ] Custom connectors (Salesforce, etc.)\n\n---\n\n## 🤝 Contributing\n\nWe welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n### Local Development Setup\n\n```bash\n# 1. Fork \u0026 clone\ngit clone https://github.com/your-fork/data-sanitizer.git\ncd data-sanitizer\n\n# 2. Create feature branch\ngit checkout -b feat/your-feature\n\n# 3. Install dev dependencies\npip install -e \".[dev]\"\n\n# 4. Run tests (must pass)\npytest tests.py -v --cov=.\n\n# 5. Format code\nblack .\nflake8 .\nmypy .\n\n# 6. Submit PR\ngit push origin feat/your-feature\n```\n\n---\n\n## 📄 License\n\nMIT License. See [LICENSE](LICENSE) for details.\n\n---\n\n## 🎓 Key Concepts\n\n### MinHash \u0026 LSH\n- **MinHash**: Probabilistic fingerprint of a text that preserves Jaccard similarity\n- **LSH** (Locality-Sensitive Hashing): Bucket function that maps similar items to same bucket\n- **Purpose**: Efficiently find near-duplicate rows without O(n²) comparisons\n\n### Deterministic Reservoir Sampling\n- **Goal**: Sample fixed-size subset of unbounded stream\n- **Method**: Use hash(row_id + salt) as priority; keep min-priority items\n- **Benefit**: Same input + same salt = same sample (reproducible)\n\n### Two-Pass Pipeline\n- **Pass 1**: Build index (reservoirs, LSH) without modifying data\n- **Pass 2**: Clean data using indices from Pass 1\n- **Benefit**: Deterministic, can replay Pass 2 with different rules\n\n---\n\n## 🙏 Acknowledgments\n\n- Built with [pandas](https://pandas.pydata.org/), [polars](https://www.pola-rs.com/), [pyarrow](https://arrow.apache.org/)\n- Storage: [PostgreSQL](https://www.postgresql.org/), [Milvus](https://milvus.io/), [Redis](https://redis.io/)\n- API: [FastAPI](https://fastapi.tiangolo.com/), [Pydantic](https://docs.pydantic.dev/)\n- Infrastructure: [Terraform](https://www.terraform.io/), [Kubernetes](https://kubernetes.io/)\n\n---\n\n## 📧 Get Started\n\n1. **Read**: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) (5 min overview)\n2. **Try**: Quick start above (10 min hands-on)\n3. **Explore**: [docs/30DAY_ROADMAP.md](docs/30DAY_ROADMAP.md) (plan for next month)\n4. **Deploy**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) (production setup)\n\n**Questions?** Open an issue or contact us at `srjnupadhyay@gmail.com`\n\n---\n\n**Happy cleaning! 🧹✨**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodersacademy006%2Fdata-sanitizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodersacademy006%2Fdata-sanitizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodersacademy006%2Fdata-sanitizer/lists"}