{"id":28737251,"url":"https://github.com/ygor-j/sql-guard","last_synced_at":"2025-10-26T15:14:16.727Z","repository":{"id":287948735,"uuid":"936851887","full_name":"Ygor-J/sql-guard","owner":"Ygor-J","description":"A small package for data quality rules using Standard SQL","archived":false,"fork":false,"pushed_at":"2025-06-07T00:46:14.000Z","size":78,"stargazers_count":5,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-12T18:29:30.583Z","etag":null,"topics":["assertions","bigquery","data","data-cleaning","data-quality","data-quality-checks","duckdb","ducklake","gcp","in-memory","pure-sql","sql","testing-tools"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ygor-J.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-21T19:56:33.000Z","updated_at":"2025-06-07T00:46:17.000Z","dependencies_parsed_at":"2025-06-05T03:21:40.788Z","dependency_job_id":"69e34f0a-ea17-4d41-ac4e-7218ec6b2fde","html_url":"https://github.com/Ygor-J/sql-guard","commit_stats":null,"previous_names":["ygor-j/sql-guard"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Ygor-J/sql-guard","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ygor-J%2Fsql-guard","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ygor-J%2Fsql-guard/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ygor-J%2Fsql-guard/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ygor-J%2Fsql-guard/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ygor-J","download_url":"https://codeload.github.com/Ygor-J/sql-guard/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ygor-J%2Fsql-guard/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260083860,"owners_count":22956409,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["assertions","bigquery","data","data-cleaning","data-quality","data-quality-checks","duckdb","ducklake","gcp","in-memory","pure-sql","sql","testing-tools"],"created_at":"2025-06-16T02:09:26.542Z","updated_at":"2025-10-26T15:14:16.665Z","avatar_url":"https://github.com/Ygor-J.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n[![PyPI pyversions](https://pypi.org/static/images/logo-small.8998e9d1.svg)](https://pypi.python.org/pypi/sql-guard/)\n\n# SQL Guard - Guarding data rules through SQL\n\nA simple package for validating data using SQL. Only support for GoogleSQL (BigQuery) and DuckDB (PostgreSQL compatible).\n\n## Problem to be Solved\n\nI was using [Pandera](https://github.com/unionai-oss/pandera) for validating data in many tables located in BigQuery.  \n\nThe problem was that in order to validate data using Pandera, I first had to bring those tables into in-memory either Pandas, Polars or Dask DataFrames. \n\nBut even tools like Polars or Dask face problems handling large volumes of data, for example, above 100GB (my case).\n\nSo instead of bringing that data to local memory, I thought it was just better to validate data inside BigQuery using SQL Manipulations, but in an easy way so that I can create data quality rules using Python.\n\n## Conversion: SQL Guard x Pandera\n\nSo imagine I have a table like below.\n\nThe ideia is to have each row as the grade of a student.  \n```\nname: Name of the student  \nage: Age of the student  \nmajor: Major of the student (for example, Computer Science, Electrical Engineering etc.)  \nsemester: Semester of university, for example 1S/2024 indicating the first semester of 2024  \ncourse: Course, for example, Algorithms, Data Structures, Calculus I, Calculus II etc.  \ngrade: Grade for the course. for example: 0, 5, 10  \nfailed: Boolean to indicate if the student failed the course based on grade (\u003e=5)  \n```\n\n| Name            | Age | Major            | Semester | Course      | Grade | Failed |\n|-----------------|-----|------------------|----------|-------------|-------|--------|\n| Daniel Carter   | 19  | Computer Science | 1S/2024  | Algorithms  | 8.5   | False  |\n| Theo Hill       | 19  | Computer Science | 1S/2024  | Algorithms  | 9.0   | False  |\n| Jessica Hall    | 19  | Computer Science | 1S/2024  | Algorithms  | 8.0   | False  |\n| Liam Carter     | 19  | Computer Science | 1S/2024  | Algorithms  | 8.5   | False  |\n| Zackary Hill    | 19  | Computer Science | 1S/2024  | Algorithms  | 7.0   | False  |\n\n\n### I can either define data quality rules from scratch or use a pandera DataFrameSchema object:\n\nFrom Scratch:\n```\nfrom sqlguard.validator.CheckBase import ValidationCheck\n\n\ndata_rules = {\n\n    'name': [ValidationCheck(check_name='is_string',\n                            params=None,\n                            error_msg=None,\n                            ignore_nulls=False),\n            ValidationCheck(check_name='regex_contains',\n                            params={'value': '^[A-Z].*'},\n                            error_msg=None,\n                            ignore_nulls=False)],\n    'age': [ValidationCheck(check_name='is_integer',\n                            params=None,\n                            error_msg=None,\n                            ignore_nulls=False),\n            ValidationCheck(check_name='between',\n                         params={'min': 15, 'max': 150},\n                         error_msg=None,\n                         ignore_nulls=False)]\n    }\n```\nFrom Pandera:\n```\nimport pandera as pa\n\npandera_schema = pa.DataFrameSchema({\n\n    \"name\": pa.Column(str, checks=pa.Check.str_matches(r\"^[A-Z].*\")), # Starting with capital letter\n    \"age\": pa.Column(int, checks=pa.Check.in_range(min_value=15, max_value=150)) # Age must be between 15 and 150\n})\n```\n\n### I can convert DataFrameSchema to a compatible dictionary of data rules\n```\nfrom sqlguard.translators import SchemaParsers\n\npanderaParser = SchemaParsers.SchemaParser.get_parser(\"pandera\")\ndata_rules = panderaParser.parse(pandera_schema)\n```\n\n## Validating: SQL Guard x Pandera\n\nAs long as we have our `data_rules` dictionary, we can create a SQLValidator object that spits out a SQL query with your rules applied.\n\n```\nfrom sqlguard.validator.SQLValidator import SQLValidator\n\nsql_schema = SQLValidator(data_rules)\nvalidation_query = sql_schema.generate_sql_report(from_source=TABLE_PATH)\n\nprint(validation_query)\n```\n\nGiven we have our query, we can just run it using BigQuery client for python.\n\n```\nfrom google.cloud import bigquery\n\nquery_job = client.query(validation_query)  # API request\nquery_result = query_job.result()  # Waits for query to finish\n\ndf = query_result.to_dataframe()\n\nprint(\"--------RUN_SQL_GUARD--------\")\nprint(df.to_string())\nprint()\n```\n\nIf result is too large, you can pass `n_wrong_counts=True` to group wrong values.\n```\nvalidation_query = sql_schema.generate_sql_report(from_source=TABLE_PATH, n_wrong_count=True)\n```\n\n## Comparison of Pandera Lazy Validation and SQL Guard Generated Report\n**SQL GUARD**\n\n| column_name | check_name | params                                             | error_msg | ignore_nulls | wrong_value            |\n|-------------|------------|---------------------------------------------------|-----------|--------------|------------------------|\n| course      | is_in      | {'value': ['Algorithms', 'Data Structures', 'Calculus I']} | \u003cNA\u003e      | False        | Calculus II            |\n| course      | is_in      | {'value': ['Algorithms', 'Data Structures', 'Calculus I']} | \u003cNA\u003e      | False        | Circuit Analysis       |\n| major       | is_in      | {'value': ['Computer Science']}                     | \u003cNA\u003e      | False        | Electrical Engineering |\n| age         | between    | {'min': 15, 'max': 21}                             | \u003cNA\u003e      | False        | 22                     |\n\n**PANDERA**\n```\n{\n  \"DATA\": {\n    \"DATAFRAME_CHECK\": [\n      {\n        \"schema\": null,\n        \"column\": \"age\",\n        \"check\": \"in_range(15, 21)\",\n        \"error\": \"Column 'age' failed element-wise validator number 0: in_range(15, 21) failure cases: 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22\"\n      },\n      {\n        \"schema\": null,\n        \"column\": \"major\",\n        \"check\": \"isin(['Computer Science'])\",\n        \"error\": \"Column 'major' failed element-wise validator number 0: isin(['Computer Science']) failure cases: Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering, Electrical Engineering\"\n      },\n      {\n        \"schema\": null,\n        \"column\": \"course\",\n        \"check\": \"isin(['Algorithms', 'Data Structures', 'Calculus I'])\",\n        \"error\": \"Column 'course' failed element-wise validator number 0: isin(['Algorithms', 'Data Structures', 'Calculus I']) failure cases: Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Calculus II, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Calculus II, Calculus II, Calculus II, Calculus II, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis, Circuit Analysis\"\n      }\n    ]\n  }\n}\n```\n\n## Details\n\nFor detailed usage of package, visit `docs/` folder and take a look at two notebooks in order:\n- demo.ipynb\n- demo_duckdb.ipynb\n\n## Install\n\nJust create your virtual environment and run:  \n`pip install sql-guard`\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fygor-j%2Fsql-guard","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fygor-j%2Fsql-guard","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fygor-j%2Fsql-guard/lists"}