Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kwanUm/awesome-data-quality

Curated list of tools and frameworks assisting in monitoring data quality
https://github.com/kwanUm/awesome-data-quality

List: awesome-data-quality

Last synced: 16 days ago
JSON representation

Curated list of tools and frameworks assisting in monitoring data quality

Awesome Lists containing this project

README

        

# awesome-data-quality

A curated list of awesome tools for testing and monitoring data quality - typically at the data warehouse/lake or within running data pipelines.

_If you want to contribute to this list (please do), send me a pull request or [contact me](https://mobile.twitter.com/orikabeli)._

## Table of Contents
TBD
### Frameworks and Libraries

#### Open sourced
* [elementary](https://github.com/elementary-data/elementary) - Data monitoring and observability tailored to dbt.
* [mobydq](https://github.com/ubisoft/mobydq) - tool for data engineering teams to run & automate data quality checks on their data pipeline.
* [ydata-quality](https://github.com/ydataai/ydata-quality) - python library for assessing data quality throughout stages of the data pipeline development.
* [great-expectations](https://github.com/great-expectations/great_expectations) - tool for data testing, documentation, and profiling.
* [deepqu](https://github.com/awslabs/python-deequ) - libray by Amazon for defining unit tests for data with focus on large datasets. Based on Apache Spark.
* [soda](https://github.com/sodadata/soda-core) - enables data testing through extended SQL queries.
* [dqm](https://github.com/piotr-kalanski/data-quality-monitoring) - another data quality monitoring tool implemented using Spark.
* [owl-sanitizer](https://github.com/ronald-smith-angel/owl-data-sanitizer) - yet another Spark based lightweight data validation framework.
* [griffin](https://github.com/apache/griffin) - Data Quality solution for distributed data systems at any scale in both streaming and batch data context.
* [drunken-data-quality](https://github.com/FRosner/drunken-data-quality)
* [DataQuality for BigData](https://github.com/agile-lab-dev/DataQuality)
* [TopNotch](https://github.com/blackrock/TopNotch)
* [Phasor Data Quality Tracker](https://github.com/GridProtectionAlliance/pdqtracker)
* [DataCleaner](https://github.com/datacleaner/DataCleaner)
* [data-quality](https://github.com/Talend/data-quality)

##### Geared for ML
* [deepchecks](https://github.com/deepchecks/deepchecks) - tool for validating your machine learning models and data. Implemented test suites tailored towards ML models datasets and outputs.
* [evidently](https://github.com/evidentlyai/evidently) - analyze and track data and ML model output quality.

##### Pipelines with data quality included
* [dbt](https://docs.getdbt.com/docs/building-a-dbt-project/tests), [dataform](https://dataform.co/blog/data-assertions) - ELT tools that comes with a handy utility to define tests as SQL queries.

#### Paid
Offering ranges from data to pipelines testing, with focus on real-time monitoring, automation of tests creation & threshold setting, and addditional enterprise features.
* [Bigeye](https://bigeye.com)
* [Soda](https://soda.io)
* [Databand](https://databand.ai)
* [Monte Carlo](https://montecarlodata.com)
* [great expectations](https://greatexpectations.io)
* [Sifflet](https://siffletapp.com)
* [Validio](https://validio.io)
* [Lightup](https://lightup.ai)
* [Lantern](https://lantern.so)
* [Metaplane](https://metaplane.dev)
* [Datafold](https://datafold.com)
* [Acceldata](https://acceldata.io)
* [Anomalo](https://anomalo.com)
* [Marquez](https://marquezproject.github.io)

TODOs
* Add tools for unstructured data (Arthur, Robust)
*