{"id":19278483,"url":"https://github.com/manesioz/bq_dq_plugin","last_synced_at":"2025-09-03T13:33:45.152Z","repository":{"id":53548114,"uuid":"226142920","full_name":"manesioz/bq_dq_plugin","owner":"manesioz","description":"Airflow plug-in that allows you to automate robust Data Quality checks for BigQuery","archived":false,"fork":false,"pushed_at":"2019-12-12T05:14:53.000Z","size":14,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-23T21:42:30.824Z","etag":null,"topics":["airflow","airflow-plugin","data-quality","data-quality-checks","google-bigquery"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/manesioz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-05T16:22:32.000Z","updated_at":"2024-03-12T12:45:25.000Z","dependencies_parsed_at":"2022-08-29T14:40:56.961Z","dependency_job_id":null,"html_url":"https://github.com/manesioz/bq_dq_plugin","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/manesioz/bq_dq_plugin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manesioz%2Fbq_dq_plugin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manesioz%2Fbq_dq_plugin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manesioz%2Fbq_dq_plugin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manesioz%2Fbq_dq_plugin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/manesioz","download_url":"https://codeload.github.com/manesioz/bq_dq_plugin/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/manesioz%2Fbq_dq_plugin/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262705290,"owners_count":23351228,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","airflow-plugin","data-quality","data-quality-checks","google-bigquery"],"created_at":"2024-11-09T21:09:52.227Z","updated_at":"2025-06-30T03:36:46.014Z","avatar_url":"https://github.com/manesioz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# bq_dq_plugin\nAirflow plug-in that allows you to compare various metrics based on aggregated historical metrics and arbitrary thresholds.  \n\n### Why? \n\nPreviously, Airflow's `BigQueryIntervalCheckOperator` allowed you to compare the current day's metrics with the metrics \nof another day. This extends that functionality and allows you to compare it to aggregated metrics over an arbitrary window.\nThis is import for more robust data quality checks that are more resilient to noise and daily flucuation. \n\nYou could run the following query: \n\n```sql\nselect count(*) as NumRecords, max(column1) as MaxColumn_1 \n  from table\n where date_column = '{{ds}}'\n```\n\nwhich computes the number of rows and the max of column1 for today. \nNow, compare it with the same metrics computed 7 days earlier:\n\n```sql\nselect count(*), max(column1) \n  from {{table}}\n where date_column = timestamp_sub('{{ds}}', interval 7 day) \n```\n\nHowever, if you want more robust data quality checks you likely want to compare today's metric\nwith an aggregated value (since that is less susceptible to daily flucutations/noise). \n\nWith this library, you can compare: \n\n```sql\nselect count(*) as NumRecords, max(column1) as MaxColumn_1 \n  from table\n where timestamp_trunc(date_column, Day) = timestamp_trunc('2019-01-01', Day) \n```\n\nwhich computes the number of records and the max of column1 for records that have `date_column = '2019-01-01'`. \nNow, let's compare the aggregated values: \n\n```sql\nwith data as (\n    select count(*) as NumRecords, max(column1) as MaxColumn_1, timestamp_trunc(date_column, Day) as Time\n      from table \n     where date_column between '2018-01-01' and '2018-12-31' \n     group by Time\n)\n\nselect avg(NumRecords), avg(MaxColumn_1)\n  from data \n```\n\nLike the `BigQueryIntervalCheckOperator`, you can pass a `dict` which contains all metrics and their associated thresholds as key/value pairs. \n\nIn addition to this functionality, there will be some out-of-the-box checks including: \n\n### Numerical ###\n- `num_records` \n- `percent_null`\n- `mean`\n- `std_dev`\n- `num_zero`\n- `min`\n- `median`\n- `max`\n\n### Categorical ### \n- `num_records`\n- `percent_null`\n- `num_unique` \n- `top`\n- `top_freq`\n- `avg_str_len`\n\nThis plugin will allow you to create more complex custom checks as well. \n\n### Example usage (in a DAG) \n\n```python\nfrom airflow import DAG\nfrom datetime import datetime, timedelta\nfrom bq_dq_plugin.operators.big_query_aggregate_check_operator import BigQueryAggregateCheckOperator\n\ndefault_args = {\n    'owner': 'Zachary Manesiotis', \n    'depends_on_past': False, \n    'start_date': datetime(2019, 12, 10), \n    'email': ['zacl.manesiotis@gmail.com'], \n    'email_on_failure': True, \n    'email_on_retry': False, \n    'retries': 3, \n    'retry_delay': timedelta(minutes=1)\n} \n\nwith DAG('Dag_ID', schedule_interval='@weekly', max_active_runs=15, catchup=False, default_args=default_args) as dag: \n    data_quality_check = BigQueryAggregateCheckOperator(\n        task_id='data_quality_check', \n        table='`project.dataset.table`',\n        metrics_thresholds={'count(*)': 1.5, 'max(Column1)': 1.6}, \n        date_filter_column='DateTime', \n        agg_time_period='Day', \n        start_date='2019-01-01', \n        end_date='2019-12-01', \n        gcp_conn_id=CONNECTION_ID, \n        use_legacy_sql=False, \n        dag=dag\n    )\ndata_quality_check\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanesioz%2Fbq_dq_plugin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmanesioz%2Fbq_dq_plugin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmanesioz%2Fbq_dq_plugin/lists"}