{"id":50454119,"url":"https://github.com/mizcausevic-dev/csv-data-quality-rs","last_synced_at":"2026-06-01T01:05:39.971Z","repository":{"id":358081308,"uuid":"1239348979","full_name":"mizcausevic-dev/csv-data-quality-rs","owner":"mizcausevic-dev","description":"Streaming CSV validator against a data-contract-registry contract. Async, line-by-line, structured violation report. The fourth cross-ecosystem hook in the Kinetic Gain portfolio.","archived":false,"fork":false,"pushed_at":"2026-05-15T15:53:50.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-15T18:20:23.122Z","etag":null,"topics":["csv","data-contract","data-engineering","data-governance","data-quality","kinetic-gain","rust","streaming"],"latest_commit_sha":null,"homepage":"https://kineticgain.com/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mizcausevic-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-15T02:23:17.000Z","updated_at":"2026-05-15T15:53:55.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mizcausevic-dev/csv-data-quality-rs","commit_stats":null,"previous_names":["mizcausevic-dev/csv-data-quality-rs"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/mizcausevic-dev/csv-data-quality-rs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fcsv-data-quality-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fcsv-data-quality-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fcsv-data-quality-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fcsv-data-quality-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mizcausevic-dev","download_url":"https://codeload.github.com/mizcausevic-dev/csv-data-quality-rs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fcsv-data-quality-rs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33755379,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","data-contract","data-engineering","data-governance","data-quality","kinetic-gain","rust","streaming"],"created_at":"2026-06-01T01:05:39.896Z","updated_at":"2026-06-01T01:05:39.958Z","avatar_url":"https://github.com/mizcausevic-dev.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# csv-data-quality\n\n[![CI](https://github.com/mizcausevic-dev/csv-data-quality-rs/actions/workflows/ci.yml/badge.svg)](https://github.com/mizcausevic-dev/csv-data-quality-rs/actions/workflows/ci.yml)\n[![Rust](https://img.shields.io/badge/rust-1.86%2B-orange)](https://www.rust-lang.org/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\n**Streaming CSV validator against a [`data-contract-registry`](https://github.com/mizcausevic-dev/data-contract-registry) contract.** Reads a CSV row by row, checks each cell against the contract's field-type / required / enum rules, and emits a structured violation report.\n\nThe **fourth cross-ecosystem hook** in the Kinetic Gain portfolio.\n\n```rust\nuse csv_data_quality::{Validator, Contract};\n\n# async fn demo() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\nlet contract = Contract::from_json(r#\"{\n  \"dataset_id\": \"users.daily_active\",\n  \"version\": \"1.0.0\",\n  \"fields\": [\n    {\"name\": \"user_id\",     \"type\": \"string\"},\n    {\"name\": \"active_date\", \"type\": \"timestamp\"},\n    {\"name\": \"plan\",        \"type\": \"string\", \"enum\": [\"free\", \"pro\"]},\n    {\"name\": \"ltv\",         \"type\": \"number\", \"required\": false}\n  ]\n}\"#)?;\n\nlet validator = Validator::new(contract);\nlet report = validator.validate_file(\"daily_active_2026_05_15.csv\").await?;\nprintln!(\"{} violation(s)\", report.violation_count);\n# Ok(()) }\n```\n\n---\n\n## Why\n\nThe registry says \"the dataset must look like this.\" Producers have to be able to **prove** their output matches. CI runs that proof on every push: load the contract from the registry, validate the produced CSV, fail the build if violations show up. No drift, no surprise consumer outages.\n\n---\n\n## Violation kinds\n\n| Kind | Triggers when |\n| --- | --- |\n| `Required` | A required cell is empty. |\n| `BadType` | The cell doesn't parse as the declared `integer` / `number` / `boolean` / `timestamp`. |\n| `EnumMismatch` | The cell value isn't one of the contract's enum entries. |\n| `ColumnCountMismatch` | A row has a different number of columns than the header. |\n| `InvalidJson` | A `json`-typed cell isn't valid JSON. |\n\nSix primitive field types match the registry vocabulary: `string` · `integer` · `number` · `boolean` · `timestamp` · `json`.\n\n---\n\n## Report shape\n\n```json\n{\n  \"dataset_id\": \"users.daily_active\",\n  \"contract_version\": \"1.0.0\",\n  \"rows_scanned\": 12345,\n  \"violation_count\": 3,\n  \"valid\": false,\n  \"samples\": [\n    { \"row\": 7, \"column\": \"plan\", \"kind\": \"enum_mismatch\", \"message\": \"column \\\"plan\\\" value \\\"startup\\\" is not in declared enum\" },\n    { \"row\": 9, \"column\": \"ltv\", \"kind\": \"bad_type\", \"message\": \"column \\\"ltv\\\" value \\\"not-a-number\\\" is not a valid number\" },\n    { \"row\": 12, \"column\": \"user_id\", \"kind\": \"required\", \"message\": \"required column \\\"user_id\\\" is empty\" }\n  ]\n}\n```\n\n`samples` is capped (default 100, configurable via `.max_samples(0)` for unlimited). `violation_count` is the true total, even when samples are truncated.\n\n---\n\n## Streaming\n\nThe validator is row-by-row. Memory cost is proportional to `max_samples`, not the file size. A 10GB CSV with `max_samples(100)` peaks at ~100 violation records plus one row's worth of cells.\n\n---\n\n## Composes with\n\n- **[data-contract-registry](https://github.com/mizcausevic-dev/data-contract-registry)** — fetch contract by `dataset_id`; this crate validates against it. The fourth cross-ecosystem hook.\n- **[audit-stream-py](https://github.com/mizcausevic-dev/audit-stream-py)** — emit a `contract_compatibility_failed` event when validation lights up.\n- **[reliability-toolkit-rs](https://github.com/mizcausevic-dev/reliability-toolkit-rs)** — wrap the registry-fetch call in a circuit breaker.\n\n---\n\n## Example\n\n```bash\ncargo run --example validate\n```\n\nValidates a tiny in-memory CSV against an in-memory contract and prints the report. Useful for kicking the tyres without setting up a registry.\n\n---\n\n## Bench\n\n```bash\ncargo bench\n```\n\nBundled bench validates 10k clean rows so you can spot regressions in the streaming path.\n\n---\n\n## Tests\n\n```bash\ncargo test --all-targets\ncargo test --doc\ncargo clippy --all-targets -- -Dwarnings\ncargo fmt --all -- --check\n```\n\nCI matrix: `stable`, `beta`, `1.86.0` (MSRV). Fifteen tests cover the happy path, every violation kind, header mismatch, optional cells, JSON payload validation, sample cap, and the async file path.\n\n---\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmizcausevic-dev%2Fcsv-data-quality-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmizcausevic-dev%2Fcsv-data-quality-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmizcausevic-dev%2Fcsv-data-quality-rs/lists"}