{"id":19056848,"url":"https://github.com/provectus/data-quality-gate","last_synced_at":"2025-08-20T04:31:37.101Z","repository":{"id":53761004,"uuid":"498386633","full_name":"provectus/data-quality-gate","owner":"provectus","description":"Data Quality Gate based on AWS ","archived":false,"fork":false,"pushed_at":"2024-07-08T18:23:31.000Z","size":25138,"stargazers_count":57,"open_issues_count":1,"forks_count":4,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-12-11T10:37:16.066Z","etag":null,"topics":["athena","aws","aws-lambda","data-governance","data-quality","great-expectations","redshift","s3","terraform"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/provectus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-31T15:10:56.000Z","updated_at":"2024-11-11T16:31:14.000Z","dependencies_parsed_at":"2024-07-08T22:44:54.180Z","dependency_job_id":"d2eeef7b-b6bc-45dc-9f68-d15aa6f394fe","html_url":"https://github.com/provectus/data-quality-gate","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/provectus%2Fdata-quality-gate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/provectus%2Fdata-quality-gate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/provectus%2Fdata-quality-gate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/provectus%2Fdata-quality-gate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/provectus","download_url":"https://codeload.github.com/provectus/data-quality-gate/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230394228,"owners_count":18218707,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["athena","aws","aws-lambda","data-governance","data-quality","great-expectations","redshift","s3","terraform"],"created_at":"2024-11-08T23:52:05.553Z","updated_at":"2024-12-19T07:05:46.994Z","avatar_url":"https://github.com/provectus.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Quality Gate \n\n## Description\nData Quality Gate is a Terraform module that enables data engineers and data QA professionals to effortlessly set up the Provectus DataQA solution within their infrastructure in a single click. It is AWS-based and built on the solid foundation of Great Expectations, YData Profiling (ex. Pandas Profiling), and Allure.\n\n### Data Test\nThe main engine, based on Great Expectations (GX), is used to profile, generate suites, and run tests.\n\n### Allure Report\nThe mapping from the GX format into the Allure Test Report tool is executed,\n\n### Report Push\nThe existing metadata and metrics are aggregated and pushed down the pipeline.\n\n## Solution Architecture\n![Preview Image](https://raw.githubusercontent.com/provectus/data-quality-gate/main/architecture.PNG)\n\n## Supported Features\n\n1. AWS Lambda Runtime: Utilizes Python 3.9.\n2. AWS Step Functions Pipeline: Incorporates the entire DataQA cycle, including profiling, test generation, and reporting.\n3. Notifications and Reporting: Offers support for Slack and Jira notifications and reporting.\n4. AWS SNS: Outputs message bus, allowing for seamless integration with existing data pipelines.\n5. Web Reports Delivery: Provides report delivery via Nginx for company-specific VPN/IP settings.\n6. AWS DynamoDB and Athena Integration: Enables the construction of AWS QuickSight or Grafana dashboards.\n7. Configuration Management: Provides a flexible method for managing configurations of underlying technologies like Allure and Great Expectations.\n\n## Usage\n\n```hcl\nmodule \"data_qa\" {\n  source = \"github.com/provectus/data-quality-gate\"\n\n  data_test_storage_bucket_name = \"my-data-settings-dev\"\n  s3_source_data_bucket         = \"my-data-bucket\"\n  environment                   = \"example\"\n  project                       = \"my-project\"\n\n  allure_report_image_uri = \"xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/dqg-allure_report:latest\"\n  data_test_image_uri     = \"xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/dqg-data_test:latest\"\n  push_report_image_uri   = \"xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/dqg-push_reportt:latest\"\n\n  data_reports_notification_settings = {\n    channel     = \"DataReportSlackChannelName\"\n    webhook_url = \"https://hooks.slack.com/services/xxxxxxxxxxxxxxx\"\n  }\n\n  lambda_private_subnet_ids = [\"private_subnet_id\"]\n  lambda_security_group_ids = [\"security_group_id\"]\n\n  reports_vpc_id        = \"some_vpc_id\"\n  reports_subnet_id     = \"subnet_id\"\n  reports_whitelist_ips = [\"0.0.0.0/0\"]\n}\n```\n\n## Examples\n\nThe tool can be used as a standard Terraform module, with deployment examples provided in the `examples` directory.\n\n- [Data-QA-Basic](https://github.com/provectus/data-quality-gate/tree/main/examples/basic) - Creates a DataQA module that builds AWS infrastructure.\n\n## Local Development and Testing\n\nSee the [functions](https://github.com/provectus/data-quality-gate/tree/main/functions) for further details.\n\n## Pricing\n\nThis solution is completely free because it is open source. However, if you want to integrate it into a live/production environment, there will be associated costs due to its cloud-based nature. These costs can be divided into two parts: the required infrastructure (which you may already have in place, such as VPCs and subnets) and the AWS services necessary for data quality implementation.\n\n*Note: All the information provided below has been calculated using the maximum score strategy.*\n#### Pricing for required infrastructure\n\n| AWS Service  | Approximate monthly cost| Description |\n| ------------- | -------------  | ------------- |\n| AWS S3 and DynamoDB endpoints  | - | There is no extra charge for gateway-type endpoints. You only pay for the usage of S3 and DynamoDB itself. |\n| AWS Interface VPC endpoints(secrets manager, monitoring, sns) | 3 endpoints * (30 days * 24 hours * 0.01 rate) = 21.6 USD  | Interface endpoints charged by hour. 1 hour = $0.01  |\n| AWS ECRs (allure, data_test, reports, notifications) | 7 versions * (865mb + 432mb + 380mb) =\u003e 11.3gb * 0.1 rate per gb month= 1.13 USD | allure image size = 865mb, data_test image size = 432mb, reports image size = 380mb, notifications image size = 160mb. For the purpose of our calculations, let's assume we are storing 7 versions of each image. |\n| AWS QuickSight | $7.3 aprx rate per user * 5 = 36.4 USD | Let's assume you have a team consisting of 5 individuals who are interested in the QuickSight data quality dashboard. They frequently check for changes, typically 2-3 times per day. |\n\n\u003cu\u003eMonthly total is $59.13 US$ per month\u003c/u\u003e\n___\n\n#### Pricing for data quality specific infrastructure\nFor most of the services used by Data Quality, AWS offers a free-tier supply. Additionally, the costs for these services are typically just a fraction of a cent. To provide further clarity, below you can find a basic cost formula and a few usage examples with cost estimations.\n\nWe are going to count:\n- number of AWS Lambda runs\n- number of AWS StepFunction transitions\n- web reports AWS EC2 instance running(720 hrs per month)\n\n| Description  | Formula |\n| ------------- | -------------  |\n| number of AWS Lambda runs for each | (number of data sources * number of changes * work_days_month) * lambda specific rate(depends on lambda duration and memory used) |\n| number of AWS StepFunction transitions  | number of lambda runs * 2 |\n\n##### Small\n\nLet's say we have 1000 data sources and half of them changed every day. Number of runs formula for any lambda is **(1000 data sources  * 0.5 changed * 30 days)**\n\n| AWS Service       | Number of runs | Price  |\n| ------------ | -------------- | ------ |\n| AWS Lambda AllureReport | 15000          | $8.33  |\n| AWS Lambda DataTest     | 15000          | $67.28 |\n| AWS Lambda Reports      | 15000          | $2.08  |\n| AWS StepFunctions      | 15000          | $0.65 |\n| AWS EC2 Reports S3 Gateway      | 720 hrs          | $7.25 |\n\n\u003cu\u003eMonthly total: 85.59 US$\u003c/u\u003e\n\n___\n\n##### Medium\n\nLet's say we have 10000 data sources and 70% of them changed every day.\nNumber of runs formula for any lambda is **(10000 data sources  * 0.7 changes * 30 days)**\n\n| AWS Service       | Number of runs | Price  |\n| ------------ | -------------- | ------ |\n| AWS Lambda AllureReport | 210k          | $203.33  |\n| AWS Lambda DataTest     | 210k          | $1028.57 |\n| AWS Lambda Reports      | 210k          | $115.83  |\n| AWS StepFunctions      | 210k          | $10.40 |\n| AWS EC2 Reports S3 Gateway      | 720 hrs          | $7.25 |\n\n\u003cu\u003eMonthly total: 1 365.38 US$\u003c/u\u003e\n\n___\n\n##### Large\n\nLet's say we have 30000 data sources and all of them changed every day.\nNumber of runs formula for any lambda is **(30000 data sources  * 1 changes * 30 days)**\n\n| AWS Service       | Number of runs | Price  |\n| ------------ | -------------- | ------ |\n| AWS Lambda AllureReport | 900k          | $893.34  |\n| AWS Lambda DataTest     | 900k         | $4430.06 |\n| AWS Lambda Reports      | 900k          | $518.33  |\n| AWS StepFunctions      | 900k          | $44.90 |\n| AWS EC2 Reports S3 Gateway      | 720 hrs          | $7.25 |\n\n\u003cu\u003eMonthly total: 5 893.88 US$\u003c/u\u003e\n___\n\n**Price per changed data source: 0.006 US$**\n\n\n## License\n\nApache 2 Licensed. See [LICENSE](https://github.com/provectus/data-quality-gate/tree/main/LICENSE) for full details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprovectus%2Fdata-quality-gate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprovectus%2Fdata-quality-gate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprovectus%2Fdata-quality-gate/lists"}