Projects in Awesome Lists tagged with dataquality
A curated list of projects in awesome lists tagged with dataquality .
https://github.com/cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
active-learning annotation data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation dataops dataquality datasets exploratory-data-analysis labeling llms noisy-labels out-of-distribution-detection outlier-detection weak-supervision
Last synced: 08 Jan 2026
https://github.com/great-expectations/great_expectations
Always know what to expect from your data.
cleandata data-engineering data-profilers data-profiling data-quality data-science data-unit-tests datacleaner datacleaning dataquality dataunittest eda exploratory-analysis exploratory-data-analysis exploratorydataanalysis mlops pipeline pipeline-debt pipeline-testing pipeline-tests
Last synced: 16 Jan 2026
https://github.com/open-metadata/openmetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake
Last synced: 22 Feb 2026
https://github.com/open-metadata/OpenMetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake
Last synced: 15 Mar 2025
https://github.com/awslabs/deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
dataquality scala spark unit-testing
Last synced: 13 May 2025
https://github.com/datafold/data-diff
Compare tables within or across databases
data data-diffing data-engineering data-quality data-quality-monitoring data-science database databricks-sql dataengineering dataquality dbt mysql oracle-database postgres postgresql python rdbms snowflake sql trino
Last synced: 24 Mar 2025
https://github.com/sodadata/soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
data-contracts data-engineering data-governance data-monitoring data-observability data-profiling data-quality data-quality-checks data-quality-monitoring data-quality-testing data-reliability data-testing data-unit-tests data-validation dataquality datatesting dbt pipeline-testing python snowflake
Last synced: 14 May 2025
https://github.com/re-data/re-data
re_data - fix data issues before your users & CEO would discover them 😊
data-analysis data-monitoring data-observability data-quality data-quality-checks data-quality-monitoring data-reliability data-testing dataquality dbt dbt-packages open-source-tooling
Last synced: 14 May 2025
https://github.com/zinggAI/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics cdp customer-data-platform data-science databricks dataengineering datalake dataquality dedupe deduplication entity-resolution fuzzy-matching fuzzymatch identity-resolution master-data-management masterdata mdm ml snowflake spark
Last synced: 16 Nov 2025
https://github.com/zinggai/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark
Last synced: 14 May 2025
https://github.com/chaos-genius/chaos_genius
ML powered analytics engine for outlier detection and root cause analysis.
ai alert alert-messages analytics anomaly-detection business-intelligence data-visualization dataquality deep-learning hacktoberfest machine-learning ml monitoring monitoring-tool observability outlier-detection python rootcauseanalysis seasonality time-series
Last synced: 26 Mar 2025
https://github.com/datacleaner/DataCleaner
The premier open source Data Quality solution
data data-analysis data-science database datacleaner dataquality desktop etl mdm profiling
Last synced: 27 Mar 2025
https://github.com/datavane/datavines
Know your data better!Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.
dataobservability dataprofile dataquality datascience doris metadata spark
Last synced: 09 Apr 2025
https://github.com/MigoXLab/dingo
Dingo: A Comprehensive AI Data Quality Evaluation Tool
common-crawl data-evaluation data-quality data-quality-assessment data-quality-report data-science data-validation dataquality datascience deepseek gpt hallucination hallucination-detection llm openai opencompass qwen spark vlm
Last synced: 29 Aug 2025
https://github.com/ibm/lale
Library for Semi-Automated Data Science
artificial-intelligence automated-machine-learning automl data-science dataquality hyperparameter-optimization hyperparameter-search hyperparameter-tuning ibm-research ibm-research-ai interoperability machine-learning pipeline-testing pipeline-tests python scikit-learn
Last synced: 14 May 2025
https://github.com/IBM/lale
Library for Semi-Automated Data Science
artificial-intelligence automated-machine-learning automl data-science dataquality hyperparameter-optimization hyperparameter-search hyperparameter-tuning ibm-research ibm-research-ai interoperability machine-learning pipeline-testing pipeline-tests python scikit-learn
Last synced: 09 May 2025
https://github.com/datachecks/dcs-core
Open Source Data Quality Monitoring.
data-engineering data-governance data-observability data-ops data-quality-monitor data-quality-monitoring data-validation database dataops dataquality elasticsearch etl metrics mlops monitoring mysql postgres postgresql python sql
Last synced: 03 Mar 2026
https://github.com/osmcha/osmcha-frontend
Frontend for the osmcha-django REST API
dataquality openstreetmap osm osmcha qa
Last synced: 04 Apr 2025
https://github.com/autoviml/pandas_dq
Find data quality issues and clean your data in a single line of code with a Scikit-Learn compatible Transformer.
data data-science dataquality dataqualitycheck machine-learning pandas python scikit-learn
Last synced: 07 Apr 2025
https://github.com/DataKitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake
Last synced: 05 May 2025
https://github.com/datakitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake
Last synced: 02 Apr 2026
https://github.com/datakitchen/dataops-testgen
DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, new dataset hygiene review, AI generation of data quality validation tests, ongoing testing of data refreshes, & continuous anomaly monitoring
data data-engineering data-observability data-quality data-science data-testing datachecker dataops dataprofiling dataquality datavalidation mssql postgresql python redshift self-hosted snowflake
Last synced: 25 Feb 2026
https://github.com/infinitelambda/dq-tools
Make simple storing test results and visualisation of these in a BI dashboard
Last synced: 01 Mar 2026
https://github.com/mundipagg/amora-data-build-tool
Amora Data Build Tool enables analysts and engineers to transform data on the data warehouse (BigQuery) by writing Amora Models that describe the data schema using Python's "PEP484 - Type Hints" and select statements with SQLAlchemy. Amora is able to transform Python code into SQL data transformation jobs that run inside the warehouse.
analytics analytics-dashboard analytics-engineering bigquery business-intelligence data-engineering data-modeling datacleaning dataquality elt machine-learning python transformation
Last synced: 08 Sep 2025
https://github.com/AltimateAI/datapilot-cli
Datailot-cli is the command line interface for accessing the AI teammate for engineers to ensure best practices in their SQL and dbt projects.
Last synced: 05 May 2025
https://github.com/open-metadata/openmetadata-site
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
automation bigdata bigdataanalytics data-catalog data-discovery data-observability data-profiling data-quality-monitoring data-science datadiscovery dataengineering dataquality datascience dbt governance hacktoberfest hacktoberfest2022 metadata metadata-api metadata-management
Last synced: 14 Apr 2025
https://github.com/grillazz/fastapi-greatexpectations
Run greatexpectations.io on ANY SQL Engine using REST API. Supported by FastAPI, Pydantic and SQLAlchemy as best data quality tool
dataquality dataqualitycheck fastapi great-expectations pydantic python python3 sql sqlalchemy
Last synced: 30 Aug 2025
https://github.com/bikash/dataquality
Tutorial and examples of Data Quality in Big Data System
big-data data-quality dataquality
Last synced: 04 Mar 2026
https://github.com/opt-nc/setup-duckdb-action
🦆 Blazing Fast and highly customizable Github Action to setup a DuckDb runtime
action actions analytics csv data-science database databases dataquality dataqualitycheck duckdb embedded-database github-actions olap sql
Last synced: 16 Mar 2026
https://github.com/huemulsolutions/huemul-bigdatagovernance
Huemul BigDataGovernance, es una framework que trabaja sobre Spark, Hive y HDFS. Permite la implementación de una estrategia corporativa de dato único, basada en buenas prácticas de Gobierno de Datos. Permite implementar tablas con control de Primary Key y Foreing Key al insertar y actualizar datos utilizando la librería, Validación de nulos, largos de textos, máximos/mínimos de números y fechas, valores únicos y valores por default. También permite clasificar los campos en aplicabilidad de derechos ARCO para facilitar la implementación de leyes de protección de datos tipo GDPR, identificar los niveles de seguridad y si se está aplicando algún tipo de encriptación. Adicionalmente permite agregar reglas de validación más complejas sobre la misma tabla.
bigdata chile cloudera data data-engineer data-engineering data-governance data-warehouse datamart dataquality gdpr hadoop hive hortonworks huemul huemul-bigdatagovernance parquet spark spark-sql trabaja-sobre-spark
Last synced: 26 Apr 2025
https://github.com/rodrigobaron/qafs
Quality Aware Feature Store
dataquality feature-engineering feature-store
Last synced: 17 Aug 2025
https://github.com/koddachad/dq_tester
A lightweight simple data quality testing tool.
data database dataengineering dataquality dataqualitycheck
Last synced: 08 Oct 2025
https://github.com/jabardigitalservice/datasae
Data Quality Framework provides by Jabar Digital Service
Last synced: 04 Apr 2026
https://github.com/josephmachado/data-quality-w-greatexpectations
Code for data quality with greatexpectations blog
dataengineering dataquality greatexpectations python
Last synced: 15 Apr 2025
https://github.com/dima-ischenko/xoverrr
Data quality library on python
clickhouse comparison dataquality greenplum oracle postgresql python
Last synced: 02 Feb 2026
https://github.com/lapetitesouris/kuronososhiki
Data Stream Quality Control with Apache Kafka
dataquality faust kafka kafka-streams
Last synced: 03 Apr 2025
https://github.com/cintia0528/data_cleaning_and_analytics-python
Evaluate if aggressive discounting benefits Eniac long-term, considering differing views on customer acquisition and brand positioning. Focus on data cleaning for informed decision-making.
colab-notebook data data-analysis datacleaning dataquality jupyter-notebook matplotlib pandas python seaborn
Last synced: 08 Jan 2026
https://github.com/santiviquez/feedsanity
Minimal, educational RSS reader with built-in data quality validation using Soda Core.
Last synced: 24 Feb 2026
https://github.com/kevinndungu-source/amazon_redshift_s3_data_pipeline
This repository contains code and documentation for the Amazon Redshift project, showcasing a data pipeline using Amazon S3. Explore how I manage and analyze large datasets in the cloud!
amazon-redshift-data-pipeline-aws-cloud-computing amazon-s3 datamanagement dataquality redshift-database sql-query vpc-creation vpc-endpoint
Last synced: 06 Mar 2025
https://github.com/rgzafra11/excel_sales_analytics
# Excel_Sales_Analytics📊 This repository contains a comprehensive business intelligence report for AtliQ Hardware, focusing on sales performance and strategic insights. 🚀 Explore data-driven analytics to enhance product offerings and optimize sales strategies for improved profitability.
businessanalytics businessinsights data-visualization dataquality eda excel excel-dashboard jupyter-notebook kpi pivot-table powerquery recommandation sales-insights seaborn
Last synced: 01 Jul 2025
https://github.com/interzoid/companynamesimkey-go
Generates a similarity key for a company/organization name for matching inconsistent names within a dataset(s)
ai api companydata companynames data-quality-assessment dataquality go go-package golang standardization
Last synced: 12 Jan 2026
https://github.com/amsh4/driftsiren
DriftSiren is a production-grade platform for real-time data drift and quality monitoring. Built with Next.js, FastAPI, and Docker, it tracks feature drift, provides live alerts, and visualizes metrics on a sleek dashboard. Includes agent, APIs, Celery workers, and Kubernetes-ready setup.
celery cicd datadrift dataquality docker fastapi k8s machine-learning nextjs observability postgresql real-time redis tailwindcss typescript websocket
Last synced: 30 Dec 2025
https://github.com/jotstolu/azure-data-engineering-end--to-end-project
An end-to-end Netflix data engineering pipeline built on Microsoft Azure. This project ingests raw Netflix data, applies PySpark transformations , enforces data quality with Delta Live Tables, and orchestrates workflows via Azure Data Factory and Databricks.
adf adlsgen2 azuredatabricks azuredatafactory cloudcomputing dataengineering datapipeline dataquality deltalake deltalivetables medallionarchitecture pyspark
Last synced: 31 Jan 2026
https://github.com/interzoid/fullnamesimkey-go
Generates a similarity key for an individual name for matching inconsistent names within a dataset(s)
ai dataquality go golang golang-package matching
Last synced: 11 Jan 2026
https://github.com/kevinndungu-source/data_analytics_projects_python_powerbi
Explore comprehensive retail analysis and insights focusing on Marley International Store, powered by data analytics techniques and visualizations using Python, PowerBI and Jupyter Notebooks.
analytics datacleaning datamanagement datamanipulation dataquality dax-query juypter-notebook matplotlib on-premises-data-gateway powerbi-desktop powerbi-service powerquery python python-script sql-query validity-check
Last synced: 26 Jul 2025
https://github.com/interzoid/streetaddresssimkey-go
Generates a similarity key for a street address for matching inconsistent street addresses within a dataset(s)
addresses ai dataquality go go-package golang
Last synced: 12 Jan 2026
https://github.com/projects-developer/data-duplication-removal-using-machine-learning
This project utilizes machine learning algorithms to detect and remove duplicate data entries from a dataset. Project Includes Source Code, PPT, Synopsis, Report, Documents, Base Research Paper & Video tutorials
btechprojects computerscienceprojects dataanalytics datacleaning dataduplicationremoval datamanagement datamatching dataquality duplicatedetection machinelearning mtechprojects
Last synced: 23 Feb 2025