An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-quality

A curated list of projects in awesome lists tagged with data-quality .

https://github.com/open-metadata/openmetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake

Last synced: 22 Feb 2026

https://github.com/evidentlyai/evidently

Evidently is โ€‹โ€‹an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

data-drift data-quality data-science data-validation generative-ai hacktoberfest html-report jupyter-notebook llm llmops machine-learning mlops model-monitoring pandas-dataframe

Last synced: 13 May 2025

https://github.com/open-metadata/OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake

Last synced: 15 Mar 2025

https://github.com/gokumohandas/mlops-course

Learn how to design, develop, deploy and iterate on production-grade ML applications.

data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray

Last synced: 15 May 2025

https://github.com/GokuMohandas/mlops-course

Learn how to design, develop, deploy and iterate on production-grade ML applications.

data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray

Last synced: 27 Mar 2025

https://github.com/whylabs/whylogs

An open-source data logging library for machine learning models and data pipelines. ๐Ÿ“š Provides visibility into data quality & model performance over time. ๐Ÿ›ก๏ธ Supports privacy-preserving data collection, ensuring safety & robustness. ๐Ÿ“ˆ

ai-pipelines analytics approximate-statistics calculate-statistics constraints data-constraints data-pipeline data-quality data-science dataops dataset logging machine-learning ml-pipelines mlops model-performance python statistical-properties

Last synced: 13 May 2025

https://github.com/featureform/featureform

The Virtual Feature Store. Turn your existing data infrastructure into a feature store.

data-quality data-science embeddings embeddings-similarity feature-engineering feature-store hacktoberfest machine-learning ml mlops python vector-database

Last synced: 14 Dec 2025

https://github.com/WeBankFinTech/Qualitis

Qualitis is a one-stop data quality management platform that supports quality verification, notification, and management for various datasource. It is used to solve various data quality problems caused by data processing. https://github.com/WeBankFinTech/Qualitis

compare data-quality data-quality-model datashperestudio dss linkis quality quality-check quality-improvement workflow

Last synced: 04 Apr 2025

https://github.com/posit-dev/pointblank

Data validation toolkit for assessing and monitoring data quality.

data-quality data-testing data-validation easy-to-understand tabular-data

Last synced: 01 Apr 2026

https://github.com/databrickslabs/dqx

Databricks framework to validate Data Quality of pySpark DataFrames and Tables

data-profiling data-quality data-quality-monitoring databricks lakeflow spark spark-streaming unity-catalog

Last synced: 01 Apr 2026

https://github.com/alibaba/feathub

FeatHub - A stream-batch unified feature store for real-time machine learning

apache-flink data data-engineering data-quality data-science feature-engineering feature-store machine-learning mlops streaming

Last synced: 14 Oct 2025

https://github.com/rocky-data/rocky

The typed graph between your code and whichever warehouse, table format, or query engine you've chosen โ€” typed compiler, branches, replay, column-level lineage, compile-time contracts, per-model cost. Adapters: Databricks, Snowflake, BigQuery, DuckDB. Single static Rust binary. Apache 2.0.

column-lineage dagster data-contracts data-engineering data-lineage data-pipeline data-platform data-quality dbt-alternative rust schema-drift sql

Last synced: 06 Jun 2026

https://github.com/ubisoft/mobydq

:whale: Tool to automate data quality checks on data pipelines

big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse

Last synced: 28 Jul 2025

https://github.com/adidas/lakehouse-engine

The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.

big-data configuration-driven data-engineering data-quality databricks delta-lake framework great-expectations lakehouse spark

Last synced: 12 Apr 2025

https://posit-dev.github.io/pointblank/

Data validation made beautiful and powerful

data-quality data-testing data-validation easy-to-understand tabular-data

Last synced: 22 Jun 2025

https://github.com/atrocore/atrocore

AtroCore is an enterprise-ready, highly configurable, and scalable open-source Data Management and System Integration Platform. It can be used for Master Data Management (MDM), Product Information Management (PIM), Business Process Management (BPM), and much more.

api-first application-development b2b business-process-management dam data-governance data-management-system data-quality digital-asset-management file-management headless integration-platform ipaas master-data-management mdm php reference-data software-integration svelte system-integration

Last synced: 08 Mar 2026

https://github.com/gair-nlp/prox

Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"

continual continual-pre-training data-centric-ai data-quality llama llm mistral neural-symbolic pre-training

Last synced: 05 Apr 2025

https://github.com/ohdsi/dataqualitydashboard

A tool to help improve data quality standards in observational data science.

data-quality

Last synced: 18 May 2026

https://github.com/OHDSI/DataQualityDashboard

A tool to help improve data quality standards in observational data science.

data-quality

Last synced: 20 Jul 2025

https://github.com/dqops/dqo

Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.

data-observability data-ops data-profiling data-quality data-quality-checks data-quality-measurement data-quality-monitoring data-quality-report monitoring

Last synced: 13 Dec 2025

https://github.com/DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 05 May 2025

https://github.com/datakitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 06 Apr 2026

https://github.com/Seddryck/NBi

NBi is a testing framework (add-on to NUnit) for Business Intelligence and Data Access. The main goal of this framework is to let users create tests with a declarative approach based on an Xml syntax. By the means of NBi, you don't need to develop C# or Java code to specify your tests! Either, you don't need Visual Studio or Eclipse to compile your test suite. Just create an Xml file and let the framework interpret it and play your tests. The framework is designed as an add-on of NUnit but with the possibility to port it easily to other testing frameworks.

business-intelligence cube data-quality data-quality-checks database etl nunit test-automation test-framework

Last synced: 04 May 2025

https://github.com/seddryck/nbi

NBi is a testing framework (add-on to NUnit) for Business Intelligence and Data Access. The main goal of this framework is to let users create tests with a declarative approach based on an Xml syntax. By the means of NBi, you don't need to develop C# or Java code to specify your tests! Either, you don't need Visual Studio or Eclipse to compile your test suite. Just create an Xml file and let the framework interpret it and play your tests. The framework is designed as an add-on of NUnit but with the possibility to port it easily to other testing frameworks.

business-intelligence cube data-quality data-quality-checks database etl nunit test-automation test-framework

Last synced: 15 May 2025

https://github.com/re-data/dbt-re-data

re_data - fix data issues before your users & CEO would discover them ๐Ÿ˜Š

data-monitoring data-observability data-quality data-testing dbt dbt-packages sql

Last synced: 07 Apr 2025

https://github.com/aai-institute/pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab

Last synced: 11 May 2025

https://github.com/evidentlyai/ml_observability_course

Free Open-source ML observability course for data scientists and ML engineers. Learn how to monitor and debug your ML models in production.

data-drift data-quality data-quality-checks llmops machine-learning-operations ml-monitoring ml-observability ml-pipelines mlops model-monitoring model-performance production-machine-learning

Last synced: 29 Jul 2025

https://github.com/aws-samples/amazon-deequ-glue

Automated data quality suggestions and analysis with Deequ on AWS Glue

aws aws-glue data-quality deequ

Last synced: 30 Jul 2025

https://github.com/great-expectations/great_expectations_action

A GitHub Action that makes it easy to use Great Expectations to validate your data pipelines in your CI workflows.

actions continuous-integration data-integrity data-quality data-science mlops

Last synced: 07 Apr 2025

https://github.com/monte-carlo-data/mc-agent-toolkit

Official Monte Carlo toolkit for AI coding agents. Skills and plugins that bring data and agent observability โ€” monitoring, triaging, troubleshooting, health checks โ€” into Claude Code, Cursor, and more.

agent-observability agent-skills ai-agents claude-code codex-skills cursor data-observability data-quality mcp monte-carlo opencode skill-md skillsmp vscode

Last synced: 20 Apr 2026

https://github.com/Impetus/jumbune

Jumbune, an open source BigData APM & Data Quality Management Platform for Data Clouds. Enterprise feature offering is available at http://jumbune.com. More details of open source offering are at,

aiops apm cluster-monitoring data-analysis data-quality developer-tools devops-tools hadoop hadoop-cluster hadoop-monitor hadoop-monitoring monitoring-tool optimization-framework yarn yarn-hadoop-cluster

Last synced: 15 Jun 2026

https://github.com/impetus/jumbune

Jumbune, an open source BigData APM & Data Quality Management Platform for Data Clouds. Enterprise feature offering is available at http://jumbune.com. More details of open source offering are at,

aiops apm cluster-monitoring data-analysis data-quality developer-tools devops-tools hadoop hadoop-cluster hadoop-monitor hadoop-monitoring monitoring-tool optimization-framework yarn yarn-hadoop-cluster

Last synced: 17 Dec 2025

https://github.com/sodadata/soda-spark

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

data-engineering data-observability data-quality data-testing pyspark python soda-sql spark

Last synced: 26 Jul 2025

https://github.com/ucd-dnp/leila

Librerรญa para la evaluaciรณn de calidad de datos, e interacciรณn con el portal de datos.gov.co

data-quality data-science eda espanol exploratory-data-analysis python report-generator ucd

Last synced: 05 Apr 2026

https://github.com/datakitchen/dataops-testgen

DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, ย new dataset hygiene review, AI generation of data quality validation tests, ongoing testing of data refreshes, & continuous anomaly monitoring

data data-engineering data-observability data-quality data-science data-testing datachecker dataops dataprofiling dataquality datavalidation mssql postgresql python redshift self-hosted snowflake

Last synced: 25 Feb 2026

https://github.com/vertti/daffy

Lightweight DataFrame validation decorators for Pandas, Polars, Modin, and PyArrow. No custom types required.

data-quality data-validation dataframe dataframe-schema dataframe-validation decorator modin narwhals pandas polars pyarrow pydantic python python-decorator runtime-validation validation

Last synced: 21 Feb 2026

https://github.com/sparkdq-community/sparkdq

A declarative PySpark framework for row- and aggregate-level data quality validation.

data-check data-engineering data-quality data-validation data-verification dq-framework pyspark pyspark-validation spark-data-quality

Last synced: 02 Jul 2025

https://github.com/davidberenstein1957/dataset-viber

Dataset Viber is your chill repo for data collection, annotation and vibe checks.

data-collection data-quality evaluation human-feedback

Last synced: 06 Mar 2025

https://github.com/realdatadriven/etlx

ETL / ELT / Reverse ETL Framework powered by DuckDB, designed to seamlessly integrate and process data from diverse sources. It leverages Markdown as a configuration medium, where YAML blocks define metadata for each data source, and embedded SQL blocks specify the extraction, transformation, and loading logic.

data-engineering data-lake data-lakehouse data-quality data-quality-checks data-quality-monitoring data-science duckdb elt elt-pipeline etl etl-elt-pipelines etl-pipeline object-storage relational-databases report report-automation s3 s3-storage

Last synced: 30 Apr 2026

https://github.com/ammsa/dtcleaner

DTCleaner: data cleaning using multi-target decision trees.

data-cleaning data-mining data-preprocessing data-quality data-science data-wrangling

Last synced: 21 Mar 2025

https://github.com/emilyriederer/convo

R package based on "Column Names as Contracts" blog post (https://emilyriederer.netlify.app/post/column-name-contracts/)

controlled-vocabulary data-quality data-validation r-package schema-design variable-names variable-naming

Last synced: 26 Oct 2025

https://github.com/mfcabrera/hooqu

hooqu is a library built on top of Pandas-like Dataframes for defining "unit tests for data". This is a spiritual port of Apache Deequ to Python

data-quality data-quality-checks data-science

Last synced: 14 Jan 2026

https://github.com/benzsevern/goldenmatch

Entity resolution toolkit โ€” deduplicate, match, and create golden records. 27 MCP tools on Smithery. Zero-config. 97.2% F1.

a2a agent data-engineering data-quality dbt deduplication entity-resolution fellegi-sunter fuzzy-matching golden-record golden-suite llm mcp-server polars pprl privacy-preserving python record-linkage record-matching remote-mcp

Last synced: 13 May 2026

https://github.com/bolcom/hive_compared_bq

hive_compared_bq compares/validates 2 (SQL like) tables, and graphically shows the rows/columns that are different.

bigquery data-quality hive python validation

Last synced: 13 Aug 2025

https://github.com/semyonsinchenko/tsumugi-spark

SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.

data-quality deequ pyspark spark

Last synced: 24 Oct 2025

https://github.com/mrpowers-io/tsumugi-spark

SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.

data-quality deequ pyspark spark

Last synced: 28 Jun 2025

https://github.com/timgent/data-flare

Data quality control tool built on spark and deequ

big-data data-quality spark

Last synced: 15 Apr 2025

https://github.com/scienxlab/redflag

Safety net for machine learning pipelines. Plays nice with sklearn and pandas.

data-quality data-quality-checks data-science machine-learning numpy pandas python

Last synced: 06 Oct 2025

https://github.com/dp6/penguin-datalayer-collect

A data layer quality monitoring and validation module, this solution is part of the Raft Suite ecosystem.

adobe-launch data-quality data-quality-monitoring datalayer dp6 dtm gtm gtm-server-side hacktoberfest marketing-automation monitoring penguin-datalayer raft-suite tealium

Last synced: 30 Jul 2025

https://github.com/kiwicom/contessa

Easy way to define, execute and store quality rules for your data.

data data-engineering data-quality framework mysql postgres python quality-assurance sqlite3

Last synced: 29 Jul 2025