https://github.com/codersacademy006/data-sanitizer
AI-powered data sanitizer with schema detection, dedupe, outlier detection, and LLM enrichment.
https://github.com/codersacademy006/data-sanitizer
csv-cleaning data-cleaning data-engineering data-enrichment data-pipeline data-quality etl jsonl outlier-detection sqlite streaming-pipeline
Last synced: 6 days ago
JSON representation
AI-powered data sanitizer with schema detection, dedupe, outlier detection, and LLM enrichment.
- Host: GitHub
- URL: https://github.com/codersacademy006/data-sanitizer
- Owner: CodersAcademy006
- Created: 2025-11-17T05:33:12.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-17T05:38:23.000Z (7 months ago)
- Last Synced: 2025-11-17T07:23:49.335Z (7 months ago)
- Topics: csv-cleaning, data-cleaning, data-engineering, data-enrichment, data-pipeline, data-quality, etl, jsonl, outlier-detection, sqlite, streaming-pipeline
- Language: Python
- Homepage:
- Size: 3.76 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Sanitizer: Production Data Cleaning Platform
> **Automatically dedupe, impute, normalize, and monitor data quality at scale with deterministic, auditable fixes.**
[](LICENSE)
[](https://www.python.org/)
[]()
## ๐ฏ Overview
Data Sanitizer is a **production-ready data cleaning platform** designed for:
- **Data Engineers**: Automatically dedupe, impute, normalize data at scale
- **ML/Model Ops**: Reduce model retraining from bad upstream data
- **Business/Analytics**: Cleaner data โ fewer billing errors & faster BI insights
### Key Features
โ
**High-Quality Deduplication**
- 90%+ accuracy duplicate detection (MinHash + LSH)
- Exact + near-duplicate detection
- Deterministic, auditable fixes
โ
**Multi-Format Ingestion**
- CSV, JSON, JSONL, Parquet, Excel
- S3 / GCS / Azure Blob Storage
- Streaming processing (O(chunk) memory)
โ
**Intelligent Imputation**
- Median/mode-based fills
- Confidence scoring (0.0โ1.0)
- Per-cell provenance tracking
โ
**Production-Grade Architecture**
- Stateless, horizontally scalable workers
- Postgres metadata + Milvus vector DB + Redis cache
- REST API with authentication & rate limiting
- Full audit trail & compliance-ready
โ
**Enterprise Features**
- PII detection & redaction
- Multi-tenant isolation
- Customizable cleaning rules
- Human-in-the-loop review flow
---
## ๐ Quick Start (5 Minutes)
### Prerequisites
- Python 3.11+
- Docker & Docker Compose
- 4GB RAM minimum
### 1. Clone & Install
```bash
git clone https://github.com/CodersAcademy006/Data-Sanitizer.git
cd data-sanitizer
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
### 2. Start Infrastructure (Local)
```bash
# Start Postgres, Milvus, Redis, API server
docker-compose up -d
# Verify health
curl http://localhost:8000/api/v1/health
# Expected: {"status": "healthy", "storage_backend": "ready"}
```
### 3. Generate Test Data
```bash
python benchmark_generator.py --size 1m --output-dir ./test_data
# Generates: test_data/benchmark_1000000_rows.csv (~500 MB)
```
### 4. Clean Your Data
```bash
# Option A: Via Python
from data_cleaning import run_full_cleaning_pipeline_two_pass_sqlite_batched
cleaned_path, report_path = run_full_cleaning_pipeline_two_pass_sqlite_batched(
path="test_data/benchmark_1000000_rows.csv",
output_dir="./output",
chunksize=50_000
)
# Option B: Via REST API
curl -X POST http://localhost:8000/api/v1/datasets/my-tenant/ingest \
-H "X-API-Key: my-tenant:key123" \
-F "file=@test_data/benchmark_1000000_rows.csv" \
-F "dataset_name=test_dataset"
# Response: {"job_id": "abc-123-def", "status": "queued"}
# Check status
curl http://localhost:8000/api/v1/jobs/abc-123-def
# Download report
curl http://localhost:8000/api/v1/jobs/abc-123-def/report > report.json
```
### 5. View Results
```bash
# Cleaned data (CSV)
head output/cleaned_data.csv
# Cleaning report (JSON)
cat output/cleaning_report.json | jq '.summary'
# Output:
# {
# "original_row_count": 1000000,
# "cleaned_row_count": 950000,
# "rows_dropped": 50000,
# "deduplication_rate": 0.95
# }
```
---
## ๐ Architecture
### System Diagram
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLIENT LAYER โ
โ REST API (FastAPI) โ Admin UI โ Python/JS SDKs โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโ
โ ORCHESTRATION LAYER โ
โ Job Scheduler (RabbitMQ/Redis) โ
โ - Job state machine (queued โ running โ complete) โ
โ - Retries, idempotency, tenant quotas โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโ
โ COMPUTE WORKERS (Stateless, Scalable) โ
โ Pass 1: Sampling โ LSH index โ Postgres โ
โ Pass 2: Dedupe โ Impute โ Clean โ S3 (Parquet) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโผโโโโโโโโโ โโโโโโโโโโผโโโโโโโโโโโ
โ Metadata Storage โ โ Vector Storage โ
โ Postgres โ โ Milvus โ
โ - Jobs, hashes โ โ - LSH samples โ
โ - Audit logs โ โ - Similarity โ
โ - Confidence โ โ queries โ
โ - Cell provenance โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ
```
### Data Flow
```
1. User uploads file (CSV, JSON, Parquet, etc.)
โ
2. API validates, stores to S3, creates Job record
โ
3. Pass 1 Worker:
- Streams file in chunks
- Samples columns (deterministic reservoir)
- Computes MinHash/LSH signatures
- Inserts samples to Milvus, stats to Postgres
โ
4. Pass 2 Worker:
- Streams file again
- Checks row hashes against Postgres (exact dedup)
- Queries Milvus for near-duplicates (LSH candidates)
- Applies imputation, normalization, cleaning
- Streams output to S3 (Parquet)
- Inserts confidence scores + audit logs to Postgres
โ
5. API serves cleaned data + report
```
---
## ๐๏ธ Project Structure
```
data_sanitizer/
โโโ data_cleaning.py # Core algorithm (Colab prototype upgraded)
โโโ storage_backend.py # Postgres + Milvus + Redis interface
โโโ cloud_storage.py # S3/GCS connectors, Parquet/CSV writers
โโโ api_server.py # FastAPI REST server
โโโ benchmark_generator.py # Realistic dirty data generation
โโโ tests.py # 50+ unit, integration, property-based tests
โโโ requirements.txt # Python dependencies
โ
โโโ docs/
โ โโโ ARCHITECTURE.md # Full system design (2,000+ lines)
โ โโโ DEPLOYMENT.md # Terraform, Docker, K8s, CI/CD
โ โโโ 30DAY_ROADMAP.md # Week-by-week execution plan
โ โโโ IMPLEMENTATION_SUMMARY.md
โ โโโ API.md # (TODO) OpenAPI reference
โ
โโโ docker/
โ โโโ api/Dockerfile
โ โโโ worker-pass1/Dockerfile
โ โโโ worker-pass2/Dockerfile
โ โโโ .dockerignore
โ
โโโ k8s/
โ โโโ base/
โ โ โโโ api-deployment.yaml
โ โ โโโ api-service.yaml
โ โ โโโ worker-pass1-deployment.yaml
โ โ โโโ configmap.yaml
โ โ โโโ hpa.yaml
โ โโโ overlays/
โ โโโ dev/
โ โโโ staging/
โ โโโ prod/
โ
โโโ terraform/
โ โโโ main.tf
โ โโโ postgres.tf
โ โโโ milvus.tf
โ โโโ s3.tf
โ โโโ eks.tf
โ โโโ variables.tf
โ
โโโ docker-compose.yaml # Local development stack
```
---
## ๐ Documentation
- **[ARCHITECTURE.md](docs/ARCHITECTURE.md)** - Complete system design, data models, API contracts
- **[DEPLOYMENT.md](docs/DEPLOYMENT.md)** - Production infrastructure, Kubernetes, Terraform, CI/CD
- **[30DAY_ROADMAP.md](docs/30DAY_ROADMAP.md)** - Execution plan: Day 1 through Day 30
- **[IMPLEMENTATION_SUMMARY.md](docs/IMPLEMENTATION_SUMMARY.md)** - Overview of deliverables
- **[API.md](docs/API.md)** - (TODO) REST API reference, Swagger/OpenAPI
---
## ๐งช Testing
### Run All Tests
```bash
# Install test dependencies
pip install -e ".[dev]"
# Run tests with coverage
pytest tests.py -v --cov=. --cov-report=html --cov-report=term
# Expected: >80% coverage
```
### Test Categories
- **Unit Tests**: JSON flattening, MinHash, LSH, Reservoir sampling
- **Integration Tests**: Full pipeline on small CSV/JSONL datasets
- **Property-Based Tests**: Determinism validation with Hypothesis
- **Performance Tests**: Throughput & latency benchmarks
---
## ๐ Performance Benchmarks
Baseline metrics on modern hardware (AWS m5.xlarge):
| Dataset | File Size | Pass 1 (sec) | Pass 2 (sec) | Throughput (rows/sec) | Memory (MB) |
|---------|-----------|------------|------------|-----------------|-----------|
| 1M CSV | ~500 MB | 8โ15 | 12โ20 | 40kโ70k | 200โ400 |
| 10M CSV | ~5 GB | 80โ150 | 120โ200 | 40kโ70k | 300โ500 |
**SLA**: 10M rows/hour throughput
To run benchmarks:
```bash
python benchmark_generator.py --size 10m
python data_cleaning.py # Run interactive menu, option 4 (vehicles.csv)
```
---
## ๐ Security & Compliance
### Privacy
- โ
PII detection (email, phone, SSN, credit card regex patterns)
- โ
Configurable PII strategies: redact, hash, exclude, tokenize
- โ
Encrypted at-rest (S3 SSE-KMS, Postgres TDE)
- โ
Encrypted in-transit (TLS 1.3)
### Audit & Compliance
- โ
Immutable audit logs (every transformation recorded)
- โ
Cell-level provenance (original โ cleaned value + confidence score)
- โ
GDPR/CCPA ready (data deletion support)
- โ
Row-level security (multi-tenant isolation via Postgres RLS)
### Access Control
- โ
API key authentication (tenant-scoped)
- โ
Rate limiting (per-tenant quotas)
- โ
Role-based access (Admin, Engineer, Reviewer)
---
## ๐ Production Deployment
### Local Development
```bash
docker-compose up -d
uvicorn api_server:app --reload
```
### Cloud Deployment (AWS)
```bash
# 1. Initialize infrastructure
cd terraform
terraform init
terraform plan -var-file=prod.tfvars
terraform apply -var-file=prod.tfvars
# 2. Build & push Docker images
./scripts/build-and-push.sh
# 3. Deploy via GitOps (ArgoCD)
kubectl apply -f argocd/data-sanitizer-app.yaml
```
### Kubernetes
```bash
# Install Data Sanitizer
kubectl apply -k k8s/overlays/prod
# Check status
kubectl get pods -l app=data-sanitizer-api
kubectl logs deployment/data-sanitizer-api
# Scale workers
kubectl scale deployment data-sanitizer-pass1-worker --replicas=10
```
See [DEPLOYMENT.md](docs/DEPLOYMENT.md) for full instructions.
---
## ๐ Support & Roadmap
### MVP (Current Release)
- โ
Core deduplication & imputation
- โ
Multi-format ingestion (CSV, JSON, Parquet)
- โ
Confidence scoring & audit logs
- โ
REST API
- โ
Postgres + Milvus backend
### Phase 2 (Month 2โ4)
- [ ] Admin UI (React)
- [ ] Human review flow
- [ ] LLM enrichment (OpenAI/Claude)
- [ ] Advanced PII detection
### Phase 3 (Month 5โ12)
- [ ] Multi-tenant SaaS
- [ ] Billing & usage tracking
- [ ] On-prem deployment
- [ ] Custom connectors (Salesforce, etc.)
---
## ๐ค Contributing
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
### Local Development Setup
```bash
# 1. Fork & clone
git clone https://github.com/your-fork/data-sanitizer.git
cd data-sanitizer
# 2. Create feature branch
git checkout -b feat/your-feature
# 3. Install dev dependencies
pip install -e ".[dev]"
# 4. Run tests (must pass)
pytest tests.py -v --cov=.
# 5. Format code
black .
flake8 .
mypy .
# 6. Submit PR
git push origin feat/your-feature
```
---
## ๐ License
MIT License. See [LICENSE](LICENSE) for details.
---
## ๐ Key Concepts
### MinHash & LSH
- **MinHash**: Probabilistic fingerprint of a text that preserves Jaccard similarity
- **LSH** (Locality-Sensitive Hashing): Bucket function that maps similar items to same bucket
- **Purpose**: Efficiently find near-duplicate rows without O(nยฒ) comparisons
### Deterministic Reservoir Sampling
- **Goal**: Sample fixed-size subset of unbounded stream
- **Method**: Use hash(row_id + salt) as priority; keep min-priority items
- **Benefit**: Same input + same salt = same sample (reproducible)
### Two-Pass Pipeline
- **Pass 1**: Build index (reservoirs, LSH) without modifying data
- **Pass 2**: Clean data using indices from Pass 1
- **Benefit**: Deterministic, can replay Pass 2 with different rules
---
## ๐ Acknowledgments
- Built with [pandas](https://pandas.pydata.org/), [polars](https://www.pola-rs.com/), [pyarrow](https://arrow.apache.org/)
- Storage: [PostgreSQL](https://www.postgresql.org/), [Milvus](https://milvus.io/), [Redis](https://redis.io/)
- API: [FastAPI](https://fastapi.tiangolo.com/), [Pydantic](https://docs.pydantic.dev/)
- Infrastructure: [Terraform](https://www.terraform.io/), [Kubernetes](https://kubernetes.io/)
---
## ๐ง Get Started
1. **Read**: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) (5 min overview)
2. **Try**: Quick start above (10 min hands-on)
3. **Explore**: [docs/30DAY_ROADMAP.md](docs/30DAY_ROADMAP.md) (plan for next month)
4. **Deploy**: [docs/DEPLOYMENT.md](docs/DEPLOYMENT.md) (production setup)
**Questions?** Open an issue or contact us at `srjnupadhyay@gmail.com`
---
**Happy cleaning! ๐งนโจ**