{"id":50454132,"url":"https://github.com/mizcausevic-dev/data-quality-guardrail","last_synced_at":"2026-06-01T01:05:40.944Z","repository":{"id":357461482,"uuid":"1234967354","full_name":"mizcausevic-dev/data-quality-guardrail","owner":"mizcausevic-dev","description":"Python data validation backend for schema drift, freshness lag, null spikes, duplicate collisions, and range-based guardrails.","archived":false,"fork":false,"pushed_at":"2026-05-12T21:39:20.000Z","size":527,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-12T23:12:09.086Z","etag":null,"topics":["analytics-engineering","data-governance","data-quality","fastapi","pydantic","python"],"latest_commit_sha":null,"homepage":"https://kineticgain.com/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mizcausevic-dev.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-10T21:44:00.000Z","updated_at":"2026-05-12T21:39:24.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mizcausevic-dev/data-quality-guardrail","commit_stats":null,"previous_names":["mizcausevic-dev/data-quality-guardrail"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/mizcausevic-dev/data-quality-guardrail","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fdata-quality-guardrail","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fdata-quality-guardrail/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fdata-quality-guardrail/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fdata-quality-guardrail/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mizcausevic-dev","download_url":"https://codeload.github.com/mizcausevic-dev/data-quality-guardrail/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mizcausevic-dev%2Fdata-quality-guardrail/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33755379,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics-engineering","data-governance","data-quality","fastapi","pydantic","python"],"created_at":"2026-06-01T01:05:40.867Z","updated_at":"2026-06-01T01:05:40.935Z","avatar_url":"https://github.com/mizcausevic-dev.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Quality Guardrail\n\n\u003e **Python data-quality portfolio project** for schema drift, freshness, null, duplicate, and range validation across operator-owned datasets.\n\n**Portfolio takeaway:** *\"Data quality becomes operationally useful when failures are ranked, explained, and routed before they contaminate downstream decisions.\"*\n\n---\n\n## Project Overview\n\n| Attribute | Detail |\n|---|---|\n| **Language** | Python |\n| **Runtime Shape** | FastAPI + CLI |\n| **Domain** | Dataset health and validation workflows |\n| **Check Families** | schema drift · freshness lag · null spikes · duplicate collisions · range violations |\n| **Output Modes** | JSON API · terminal summary |\n| **Primary Users** | analytics engineering · revops ops · data platform |\n\n---\n\n## Executive Summary\n\nData Quality Guardrail models the sort of service teams use when operational reporting is only as trustworthy as the pipelines feeding it. Instead of treating dataset checks as a passive notebook exercise, the project ingests a structured dataset contract, runs high-signal validations against fresh records, scores the severity of what it finds, and returns evidence-backed issues with next actions.\n\nThe repo is intentionally built as a Python service layer rather than a frontend artifact. It shows how dataset reliability can be treated as an operating system concern: validate the shape, inspect the drift, score the damage, and route action before bad data pollutes forecasting, attribution, customer intelligence, or executive briefings.\n\n---\n\n## Validation Flow\n\n```text\ndataset contract + records\n        |\n        v\ntyped validation request\n        |\n        +--\u003e schema drift checks\n        +--\u003e freshness lag checks\n        +--\u003e null spike checks\n        +--\u003e duplicate collision checks\n        +--\u003e range violation checks\n        |\n        v\nseverity-scored quality report\n```\n\n---\n\n## Validation Families\n\n### Schema Drift\n\n- unexpected columns\n- missing required columns\n- type expectations that no longer match the feed\n\n### Freshness Lag\n\n- stale loads that undermine current-state reporting\n- delayed ingestion windows on operational datasets\n\n### Null Spike\n\n- missing critical identifiers or metrics\n- sudden completeness regression\n\n### Duplicate Collision\n\n- repeated primary keys or event identifiers\n- inflated counts on downstream models\n\n### Range Violation\n\n- values outside accepted floors or ceilings\n- unrealistic conversions, revenue, or health signals\n\n---\n\n## Usage\n\n### Create a Virtual Environment\n\n```bash\npython -m venv .venv\n.venv\\Scripts\\activate\npip install -e .[dev]\n```\n\n### Run the API\n\n```bash\nuvicorn app.main:app --reload\n```\n\n### Open the Docs\n\n```text\nhttp://127.0.0.1:8000/docs\n```\n\n### Run the CLI Summary\n\n```bash\ndata-quality-guardrail\n```\n\n### Run the Tests\n\n```bash\npytest\n```\n\n---\n\n## Sample Output\n\n```text\nData Quality Guardrail\n======================\nDataset: revops_pipeline_snapshot\nRows analyzed: 12\nOverall score: 89\n\n[CRITICAL] freshness_lag (score 89)\nSummary: Dataset freshness is materially outside the allowed reporting window.\n```\n\n---\n\n## Screenshots\n\n### Hero Capture\n\n![Hero](screenshots/01-hero.png)\n\n### API Summary\n\n![API summary](screenshots/02-api-summary.png)\n\n### Validation Breakdown\n\n![Validation breakdown](screenshots/03-breakdown.png)\n\n### Proof Layer\n\n![Proof layer](screenshots/04-proof.png)\n\n---\n\n## Industry Applications\n\n### Revenue Operations\n\n- stop stale pipeline snapshots from distorting forecast and coverage calls\n- catch duplicate opportunity rows before they inflate board-facing numbers\n\n### Growth Analytics\n\n- surface conversion-rate anomalies before attribution models drift\n- flag null campaign fields before channel reporting is trusted too far\n\n### Customer Intelligence\n\n- prevent broken health-score feeds from contaminating churn or lifecycle views\n- validate freshness on intervention datasets before operators act on them\n\n---\n\n## What This Demonstrates\n\n- Python added meaningfully through a real validation service, not a token script\n- Pydantic models and FastAPI used for operational data checks\n- data quality modeled as a severity-ranked response problem\n- CLI and API outputs shaped for real operator use\n- evidence-backed reporting instead of vague “data looks off” summaries\n\n---\n\n## Future Enhancements\n\n- add historical comparison windows for trend-aware alerting\n- support CSV upload and object-store ingestion paths\n- export markdown incident summaries for data-quality review\n- add rule packs for SaaS revenue, lifecycle, and experimentation datasets\n- emit webhook-ready escalation payloads for orchestration systems\n\n---\n\n## Tech Stack\n\n[![Python](https://img.shields.io/badge/Python-3.14-1c2633?style=for-the-badge\u0026logo=python\u0026logoColor=F7E3A1\u0026labelColor=1c2633)](https://www.python.org/)\n[![FastAPI](https://img.shields.io/badge/API-FastAPI-13352f?style=for-the-badge\u0026logo=fastapi\u0026logoColor=9df8df\u0026labelColor=13352f)](https://fastapi.tiangolo.com/)\n[![Pydantic](https://img.shields.io/badge/Models-Pydantic-24384a?style=for-the-badge\u0026logo=pydantic\u0026logoColor=95d8ff\u0026labelColor=24384a)](https://docs.pydantic.dev/)\n[![Testing](https://img.shields.io/badge/Testing-pytest-30211a?style=for-the-badge\u0026logo=pytest\u0026logoColor=ffd7b3\u0026labelColor=30211a)](https://docs.pytest.org/)\n\n### Portfolio Links\n\n- [LinkedIn](https://www.linkedin.com/in/mirzacausevic)\n- [Kinetic Gain](https://kineticgain.com/)\n- [Skills Page](https://mizcausevic.com/skills/)\n- [GitHub](https://github.com/mizcausevic-dev)\n\n---\n\n*Part of [mizcausevic-dev's GitHub portfolio](https://github.com/mizcausevic-dev), with a focus on backend systems, growth operations, data reliability, and operational decision tooling.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmizcausevic-dev%2Fdata-quality-guardrail","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmizcausevic-dev%2Fdata-quality-guardrail","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmizcausevic-dev%2Fdata-quality-guardrail/lists"}