Projects in Awesome Lists tagged with data-quality
A curated list of projects in awesome lists tagged with data-quality .
https://github.com/gokumohandas/made-with-ml
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 05 Mar 2026
https://github.com/GokuMohandas/MadeWithML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 03 Mar 2025
https://github.com/GokuMohandas/Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 15 Mar 2025
https://github.com/eugeneyan/applied-ml
๐ Papers & tech blogs by companies sharing their work on data science & machine learning in production.
applied-data-science applied-machine-learning computer-vision data-discovery data-engineering data-quality data-science deep-learning machine-learning natural-language-processing production recsys reinforcement-learning search
Last synced: 17 Mar 2025
https://github.com/data-centric-ai-community/fg-data-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
big-data-analytics data-analysis data-exploration data-profiling data-quality data-science deep-learning eda exploration exploratory-data-analysis hacktoberfest html-report jupyter jupyter-notebook machine-learning pandas pandas-dataframe pandas-profiling python statistics
Last synced: 08 May 2026
https://github.com/Data-Centric-AI-Community/ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
big-data-analytics data-analysis data-exploration data-profiling data-quality data-science deep-learning eda exploration exploratory-data-analysis hacktoberfest html-report jupyter jupyter-notebook machine-learning pandas pandas-dataframe pandas-profiling python statistics
Last synced: 09 Mar 2026
https://github.com/ydataai/ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
big-data-analytics data-analysis data-exploration data-profiling data-quality data-science deep-learning eda exploration exploratory-data-analysis hacktoberfest html-report jupyter jupyter-notebook machine-learning pandas pandas-dataframe pandas-profiling python statistics
Last synced: 16 Jan 2026
https://github.com/cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
active-learning annotation data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation dataops dataquality datasets exploratory-data-analysis labeling llms noisy-labels out-of-distribution-detection outlier-detection weak-supervision
Last synced: 08 Jan 2026
https://github.com/great-expectations/great_expectations
Always know what to expect from your data.
cleandata data-engineering data-profilers data-profiling data-quality data-science data-unit-tests datacleaner datacleaning dataquality dataunittest eda exploratory-analysis exploratory-data-analysis exploratorydataanalysis mlops pipeline pipeline-debt pipeline-testing pipeline-tests
Last synced: 16 Jan 2026
https://github.com/voxel51/fiftyone
Refine high-quality datasets and visual AI models
active-learning artificial-intelligence computer-vision data-centric-ai data-cleaning data-curation data-quality data-science deep-learning developer-tools image-classification machine-learning object-detection python unstructured-data vector-search visualization
Last synced: 19 Feb 2026
https://github.com/open-metadata/openmetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake
Last synced: 22 Feb 2026
https://github.com/feast-dev/feast
The Open Source Feature Store for AI/ML
big-data data-engineering data-quality data-science feature-store features machine-learning ml mlops python
Last synced: 04 May 2026
https://github.com/evidentlyai/evidently
Evidently is โโan open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
data-drift data-quality data-science data-validation generative-ai hacktoberfest html-report jupyter-notebook llm llmops machine-learning mlops model-monitoring pandas-dataframe
Last synced: 13 May 2025
https://github.com/open-metadata/OpenMetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake
Last synced: 15 Mar 2025
https://github.com/treeverse/lakefs
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 18 Feb 2026
https://github.com/treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 20 Mar 2025
https://github.com/gokumohandas/mlops-course
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 15 May 2025
https://github.com/GokuMohandas/mlops-course
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 27 Mar 2025
https://github.com/datafold/data-diff
Compare tables within or across databases
data data-diffing data-engineering data-quality data-quality-monitoring data-science database databricks-sql dataengineering dataquality dbt mysql oracle-database postgres postgresql python rdbms snowflake sql trino
Last synced: 24 Mar 2025
https://github.com/whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. ๐ Provides visibility into data quality & model performance over time. ๐ก๏ธ Supports privacy-preserving data collection, ensuring safety & robustness. ๐
ai-pipelines analytics approximate-statistics calculate-statistics constraints data-constraints data-pipeline data-quality data-science dataops dataset logging machine-learning ml-pipelines mlops model-performance python statistical-properties
Last synced: 13 May 2025
https://github.com/sodadata/soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
data-contracts data-engineering data-governance data-monitoring data-observability data-profiling data-quality data-quality-checks data-quality-monitoring data-quality-testing data-reliability data-testing data-unit-tests data-validation dataquality datatesting dbt pipeline-testing python snowflake
Last synced: 14 May 2025
https://github.com/feathr-ai/feathr
Feathr โ A scalable, unified data and AI engineering platform for enterprise
apache-spark artificial-intelligence azure data-engineering data-quality data-science feature-engineering feature-governance feature-management feature-marketplace feature-metadata feature-platform feature-store machine-learning mlops
Last synced: 09 Jan 2026
https://github.com/featureform/featureform
The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
data-quality data-science embeddings embeddings-similarity feature-engineering feature-store hacktoberfest machine-learning ml mlops python vector-database
Last synced: 14 Dec 2025
https://github.com/re-data/re-data
re_data - fix data issues before your users & CEO would discover them ๐
data-analysis data-monitoring data-observability data-quality data-quality-checks data-quality-monitoring data-reliability data-testing dataquality dbt dbt-packages open-source-tooling
Last synced: 14 May 2025
https://github.com/opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
alerting bigdata data-catalog data-discovery data-engineering data-exploration data-governance data-lineage data-observability data-pipelines data-platform data-profiling data-quality data-science datacatalog lineage metadata metadata-management observability oss
Last synced: 02 Apr 2026
https://github.com/daochenzha/data-centric-ai
A curated, but incomplete, list of data-centric AI resources.
ai artificial-intelligence data-centric data-centric-ai data-centric-machine-learning data-curation data-engineering data-quality data-science machine-learning
Last synced: 05 Feb 2026
https://github.com/cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
computer-vision data-centric-ai data-exploration data-profiling data-quality data-science data-validation deep-learning exploratory-data-analysis image-analysis image-classification image-generation image-quality image-segmentation
Last synced: 06 Jan 2026
https://github.com/daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
ai artificial-intelligence data-centric data-centric-ai data-centric-machine-learning data-curation data-engineering data-quality data-science machine-learning
Last synced: 26 Mar 2025
https://github.com/NVIDIA/NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 29 Jul 2025
https://github.com/rstudio/pointblank
Data quality assessment and metadata reporting for data frames and database tables
data-assertions data-checker data-dictionaries data-frames data-inference data-management data-profiler data-quality data-validation data-verification database-tables easy-to-understand reporting-tool schema-validation testing-tools yaml-configuration
Last synced: 14 May 2025
https://github.com/kennethleungty/failed-ml
Compilation of high-profile real-world examples of failed machine learning projects
ai artificial-intelligence classification computer-vision data-engineering data-quality data-science deep-learning failed-data-science failed-machine-learning failed-ml fml forecasting machine-learning ml natural-language-processing production recsys regression
Last synced: 18 Feb 2026
https://github.com/WeBankFinTech/Qualitis
Qualitis is a one-stop data quality management platform that supports quality verification, notification, and management for various datasource. It is used to solve various data quality problems caused by data processing. https://github.com/WeBankFinTech/Qualitis
compare data-quality data-quality-model datashperestudio dss linkis quality quality-check quality-improvement workflow
Last synced: 04 Apr 2025
https://github.com/kennethleungty/Failed-ML
Compilation of high-profile real-world examples of failed machine learning projects
ai artificial-intelligence classification computer-vision data-engineering data-quality data-science deep-learning failed-data-science failed-machine-learning failed-ml fml forecasting machine-learning ml natural-language-processing production recsys regression
Last synced: 04 Apr 2025
https://github.com/bitol-io/open-data-contract-standard
Home of the Open Data Contract Standard (ODCS).
data data-contract data-contracts data-engineering data-mesh data-quality standard
Last synced: 10 Mar 2026
https://github.com/NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 20 Jul 2025
https://github.com/polyaxon/datatile
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
dask data-exploration data-profiling data-quality data-quality-checks data-science data-visualization dataframes dataops explainable-ai matplotlib mlops pandas pandas-summary plotly pytorch spark statistics tensorflow tracking
Last synced: 17 Aug 2025
https://github.com/polyaxon/traceml
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
dask data-exploration data-profiling data-quality data-quality-checks data-science data-visualization dataframes dataops explainable-ai matplotlib mlops pandas pandas-summary plotly pytorch spark statistics tensorflow tracking
Last synced: 12 Dec 2025
https://github.com/infuseai/piperider
Code review for data in dbt
code-review continuous-integration data-exploration data-observability data-pipeline data-profiler data-profiling data-quality data-reliability data-science data-testing data-visualization dbt dbt-metrics eda exploratory-data-analysis pull-requests python reporting
Last synced: 10 Apr 2025
https://github.com/InfuseAI/piperider
Code review for data in dbt
code-review continuous-integration data-exploration data-observability data-pipeline data-profiler data-profiling data-quality data-reliability data-science data-testing data-visualization dbt dbt-metrics eda exploratory-data-analysis pull-requests python reporting
Last synced: 18 Apr 2025
https://github.com/posit-dev/pointblank
Data validation toolkit for assessing and monitoring data quality.
data-quality data-testing data-validation easy-to-understand tabular-data
Last synced: 01 Apr 2026
https://github.com/databrickslabs/dqx
Databricks framework to validate Data Quality of pySpark DataFrames and Tables
data-profiling data-quality data-quality-monitoring databricks lakeflow spark spark-streaming unity-catalog
Last synced: 01 Apr 2026
https://github.com/MigoXLab/dingo
Dingo: A Comprehensive AI Data Quality Evaluation Tool
common-crawl data-evaluation data-quality data-quality-assessment data-quality-report data-science data-validation dataquality datascience deepseek gpt hallucination hallucination-detection llm openai opencompass qwen spark vlm
Last synced: 29 Aug 2025
https://github.com/duoan/mega-data-factory
๐ญ Mega Scale Multimodal DataPipeline for SOTA Foundation Models
data-centric-ai data-curation data-quality datapipeline datapipelines deeplearning foundation-models image-editing image-generation llm machine-learning mllm multimodal ray rust video-generation vlm
Last synced: 18 Jun 2026
https://github.com/alibaba/feathub
FeatHub - A stream-batch unified feature store for real-time machine learning
apache-flink data data-engineering data-quality data-science feature-engineering feature-store machine-learning mlops streaming
Last synced: 14 Oct 2025
https://github.com/data-drift/data-drift
Metrics Observability & Troubleshooting
analytics bigquery context data-diffing data-governance data-lineage data-monitoring data-observability data-quality data-reliability data-version-control dbt dbt-metrics dbt-packages drill-down metrics reconciliation redshift semantic-layer snowflake
Last synced: 08 Oct 2025
https://github.com/rocky-data/rocky
The typed graph between your code and whichever warehouse, table format, or query engine you've chosen โ typed compiler, branches, replay, column-level lineage, compile-time contracts, per-model cost. Adapters: Databricks, Snowflake, BigQuery, DuckDB. Single static Rust binary. Apache 2.0.
column-lineage dagster data-contracts data-engineering data-lineage data-pipeline data-platform data-quality dbt-alternative rust schema-drift sql
Last synced: 06 Jun 2026
https://github.com/ubisoft/mobydq
:whale: Tool to automate data quality checks on data pipelines
big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse
Last synced: 28 Jul 2025
https://github.com/adidas/lakehouse-engine
The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.
big-data configuration-driven data-engineering data-quality databricks delta-lake framework great-expectations lakehouse spark
Last synced: 12 Apr 2025
https://posit-dev.github.io/pointblank/
Data validation made beautiful and powerful
data-quality data-testing data-validation easy-to-understand tabular-data
Last synced: 22 Jun 2025
https://github.com/atrocore/atrocore
AtroCore is an enterprise-ready, highly configurable, and scalable open-source Data Management and System Integration Platform. It can be used for Master Data Management (MDM), Product Information Management (PIM), Business Process Management (BPM), and much more.
api-first application-development b2b business-process-management dam data-governance data-management-system data-quality digital-asset-management file-management headless integration-platform ipaas master-data-management mdm php reference-data software-integration svelte system-integration
Last synced: 08 Mar 2026
https://github.com/whylabs/whylogs-java
Profile and monitor your ML data pipeline end-to-end
ai-pipelines aiops apache-spark approximate-statistics calculate-statistics data-quality dataset java mlops spark statistical-properties statistics whylogs
Last synced: 03 Oct 2025
https://github.com/gair-nlp/prox
Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"
continual continual-pre-training data-centric-ai data-quality llama llm mistral neural-symbolic pre-training
Last synced: 05 Apr 2025
https://github.com/ohdsi/dataqualitydashboard
A tool to help improve data quality standards in observational data science.
Last synced: 18 May 2026
https://github.com/astronomer/airflow-provider-great-expectations
Great Expectations Airflow operator
airflow airflow-operators airflow-providers data-quality data-science data-testing
Last synced: 16 May 2025
https://github.com/OHDSI/DataQualityDashboard
A tool to help improve data quality standards in observational data science.
Last synced: 20 Jul 2025
https://github.com/AKSW/RDFUnit
An RDF Unit Testing Suite
data-quality data-quality-checks data-validation rdf schema schema-validation shacl unit-testing validation web-ontology-language
Last synced: 03 Apr 2025
https://github.com/dqops/dqo
Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.
data-observability data-ops data-profiling data-quality data-quality-checks data-quality-measurement data-quality-monitoring data-quality-report monitoring
Last synced: 13 Dec 2025
https://github.com/DataKitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake
Last synced: 05 May 2025
https://github.com/datakitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake
Last synced: 06 Apr 2026
https://github.com/Seddryck/NBi
NBi is a testing framework (add-on to NUnit) for Business Intelligence and Data Access. The main goal of this framework is to let users create tests with a declarative approach based on an Xml syntax. By the means of NBi, you don't need to develop C# or Java code to specify your tests! Either, you don't need Visual Studio or Eclipse to compile your test suite. Just create an Xml file and let the framework interpret it and play your tests. The framework is designed as an add-on of NUnit but with the possibility to port it easily to other testing frameworks.
business-intelligence cube data-quality data-quality-checks database etl nunit test-automation test-framework
Last synced: 04 May 2025
https://github.com/seddryck/nbi
NBi is a testing framework (add-on to NUnit) for Business Intelligence and Data Access. The main goal of this framework is to let users create tests with a declarative approach based on an Xml syntax. By the means of NBi, you don't need to develop C# or Java code to specify your tests! Either, you don't need Visual Studio or Eclipse to compile your test suite. Just create an Xml file and let the framework interpret it and play your tests. The framework is designed as an add-on of NUnit but with the possibility to port it easily to other testing frameworks.
business-intelligence cube data-quality data-quality-checks database etl nunit test-automation test-framework
Last synced: 15 May 2025
https://github.com/re-data/dbt-re-data
re_data - fix data issues before your users & CEO would discover them ๐
data-monitoring data-observability data-quality data-testing dbt dbt-packages sql
Last synced: 07 Apr 2025
https://github.com/aai-institute/pyDVL
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab
Last synced: 11 May 2025
https://github.com/evidentlyai/ml_observability_course
Free Open-source ML observability course for data scientists and ML engineers. Learn how to monitor and debug your ML models in production.
data-drift data-quality data-quality-checks llmops machine-learning-operations ml-monitoring ml-observability ml-pipelines mlops model-monitoring model-performance production-machine-learning
Last synced: 29 Jul 2025
https://github.com/aws-samples/amazon-deequ-glue
Automated data quality suggestions and analysis with Deequ on AWS Glue
aws aws-glue data-quality deequ
Last synced: 30 Jul 2025
https://github.com/gclunies/reflekt
Define, govern, and model event data for warehouse-first product analytics.
avo customer-data-platform data-modeling data-quality data-warehouse dbt dbt-package events governance product-analytics schema-registry segment segment-protocols
Last synced: 12 Apr 2025
https://github.com/GClunies/Reflekt
Define, govern, and model event data for warehouse-first product analytics.
avo customer-data-platform data-modeling data-quality data-warehouse dbt dbt-package events governance product-analytics schema-registry segment segment-protocols
Last synced: 02 Sep 2025
https://github.com/great-expectations/great_expectations_action
A GitHub Action that makes it easy to use Great Expectations to validate your data pipelines in your CI workflows.
actions continuous-integration data-integrity data-quality data-science mlops
Last synced: 07 Apr 2025
https://github.com/monte-carlo-data/mc-agent-toolkit
Official Monte Carlo toolkit for AI coding agents. Skills and plugins that bring data and agent observability โ monitoring, triaging, troubleshooting, health checks โ into Claude Code, Cursor, and more.
agent-observability agent-skills ai-agents claude-code codex-skills cursor data-observability data-quality mcp monte-carlo opencode skill-md skillsmp vscode
Last synced: 20 Apr 2026
https://github.com/Impetus/jumbune
Jumbune, an open source BigData APM & Data Quality Management Platform for Data Clouds. Enterprise feature offering is available at http://jumbune.com. More details of open source offering are at,
aiops apm cluster-monitoring data-analysis data-quality developer-tools devops-tools hadoop hadoop-cluster hadoop-monitor hadoop-monitoring monitoring-tool optimization-framework yarn yarn-hadoop-cluster
Last synced: 15 Jun 2026
https://github.com/kevinadhiguna/dqlab-career-track
A collection of scripts written to complete DQLab Data Analyst Career Track ๐
career-track data-analysis data-analyst data-manipulation data-quality data-visualization dqlab dqlab-career-track exploratory-data-analysis machine-learning python sql
Last synced: 21 Mar 2025
https://github.com/impetus/jumbune
Jumbune, an open source BigData APM & Data Quality Management Platform for Data Clouds. Enterprise feature offering is available at http://jumbune.com. More details of open source offering are at,
aiops apm cluster-monitoring data-analysis data-quality developer-tools devops-tools hadoop hadoop-cluster hadoop-monitor hadoop-monitoring monitoring-tool optimization-framework yarn yarn-hadoop-cluster
Last synced: 17 Dec 2025
https://github.com/sodadata/soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
data-engineering data-observability data-quality data-testing pyspark python soda-sql spark
Last synced: 26 Jul 2025
https://github.com/ucd-dnp/leila
Librerรญa para la evaluaciรณn de calidad de datos, e interacciรณn con el portal de datos.gov.co
data-quality data-science eda espanol exploratory-data-analysis python report-generator ucd
Last synced: 05 Apr 2026
https://github.com/provectus/data-quality-gate
Data Quality Gate based on AWS
athena aws aws-lambda data-governance data-quality great-expectations redshift s3 terraform
Last synced: 20 Aug 2025
https://github.com/anerv/bikedna
BikeDNA: Bicycle Infrastructure Data & Network Assessment
bicycle-infrastructure bicycle-network data-quality geospatial-data openstreetmap sustainable-mobility urban-planning volunteered-geographic-information
Last synced: 23 Mar 2025
https://github.com/datakitchen/dataops-testgen
DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, ย new dataset hygiene review, AI generation of data quality validation tests, ongoing testing of data refreshes, & continuous anomaly monitoring
data data-engineering data-observability data-quality data-science data-testing datachecker dataops dataprofiling dataquality datavalidation mssql postgresql python redshift self-hosted snowflake
Last synced: 25 Feb 2026
https://github.com/vertti/daffy
Lightweight DataFrame validation decorators for Pandas, Polars, Modin, and PyArrow. No custom types required.
data-quality data-validation dataframe dataframe-schema dataframe-validation decorator modin narwhals pandas polars pyarrow pydantic python python-decorator runtime-validation validation
Last synced: 21 Feb 2026
https://github.com/kenthsu/udacity-data-engineering-nanodgree
Udacity Data Engineering Nanodegree Program
apache-airflow apache-cassandra apache-spark aws-redshift aws-s3 data-engineering data-lake data-pipelines data-quality data-warehouses postgresql
Last synced: 10 Apr 2025
https://github.com/sparkdq-community/sparkdq
A declarative PySpark framework for row- and aggregate-level data quality validation.
data-check data-engineering data-quality data-validation data-verification dq-framework pyspark pyspark-validation spark-data-quality
Last synced: 02 Jul 2025
https://github.com/davidberenstein1957/dataset-viber
Dataset Viber is your chill repo for data collection, annotation and vibe checks.
data-collection data-quality evaluation human-feedback
Last synced: 06 Mar 2025
https://github.com/bitol-io/open-data-product-standard
Home of the Open Data Product Standard (ODPS).
data data-engineering data-mesh data-product data-products data-quality standard
Last synced: 10 Mar 2026
https://github.com/realdatadriven/etlx
ETL / ELT / Reverse ETL Framework powered by DuckDB, designed to seamlessly integrate and process data from diverse sources. It leverages Markdown as a configuration medium, where YAML blocks define metadata for each data source, and embedded SQL blocks specify the extraction, transformation, and loading logic.
data-engineering data-lake data-lakehouse data-quality data-quality-checks data-quality-monitoring data-science duckdb elt elt-pipeline etl etl-elt-pipelines etl-pipeline object-storage relational-databases report report-automation s3 s3-storage
Last synced: 30 Apr 2026
https://github.com/ropensci/daiquiri
Data quality reporting for temporal datasets.
data-quality initial-data-analysis r r-package reproducible-research rstats temporal-data time-series
Last synced: 22 Feb 2026
https://github.com/ammsa/dtcleaner
DTCleaner: data cleaning using multi-target decision trees.
data-cleaning data-mining data-preprocessing data-quality data-science data-wrangling
Last synced: 21 Mar 2025
https://github.com/giscience/ohsome-quality-api
Data quality estimations for OpenStreetMap
accuracy completeness data-quality heigit indicators ohsome openstreetmap openstreetmap-data osm osm-data reports
Last synced: 02 May 2025
https://github.com/emilyriederer/convo
R package based on "Column Names as Contracts" blog post (https://emilyriederer.netlify.app/post/column-name-contracts/)
controlled-vocabulary data-quality data-validation r-package schema-design variable-names variable-naming
Last synced: 26 Oct 2025
https://github.com/cleanlab/cleanlab-studio
Client interface to Cleanlab Studio and the Trustworthy Language Model
annotations automl computer-vision data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation image-classification llm machine-learning model-deployment natural-language-processing noisy-labels outlier-detection structured-data text-classification
Last synced: 13 Apr 2025
https://github.com/mfcabrera/hooqu
hooqu is a library built on top of Pandas-like Dataframes for defining "unit tests for data". This is a spiritual port of Apache Deequ to Python
data-quality data-quality-checks data-science
Last synced: 14 Jan 2026
https://github.com/benzsevern/goldenmatch
Entity resolution toolkit โ deduplicate, match, and create golden records. 27 MCP tools on Smithery. Zero-config. 97.2% F1.
a2a agent data-engineering data-quality dbt deduplication entity-resolution fellegi-sunter fuzzy-matching golden-record golden-suite llm mcp-server polars pprl privacy-preserving python record-linkage record-matching remote-mcp
Last synced: 13 May 2026
https://github.com/bolcom/hive_compared_bq
hive_compared_bq compares/validates 2 (SQL like) tables, and graphically shows the rows/columns that are different.
bigquery data-quality hive python validation
Last synced: 13 Aug 2025
https://github.com/semyonsinchenko/tsumugi-spark
SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.
data-quality deequ pyspark spark
Last synced: 24 Oct 2025
https://github.com/mrpowers-io/tsumugi-spark
SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.
data-quality deequ pyspark spark
Last synced: 28 Jun 2025
https://github.com/timgent/data-flare
Data quality control tool built on spark and deequ
Last synced: 15 Apr 2025
https://github.com/aws-samples/monitoring-apache-iceberg-table-metadata-layer
Sample code to collect Apache Iceberg metrics for table monitoring
apache-iceberg apache-spark aws aws-cloudwatch aws-glue aws-lambda data-quality monitoring pyiceberg sam-cli
Last synced: 29 Oct 2025
https://github.com/scienxlab/redflag
Safety net for machine learning pipelines. Plays nice with sklearn and pandas.
data-quality data-quality-checks data-science machine-learning numpy pandas python
Last synced: 06 Oct 2025
https://github.com/dp6/penguin-datalayer-collect
A data layer quality monitoring and validation module, this solution is part of the Raft Suite ecosystem.
adobe-launch data-quality data-quality-monitoring datalayer dp6 dtm gtm gtm-server-side hacktoberfest marketing-automation monitoring penguin-datalayer raft-suite tealium
Last synced: 30 Jul 2025
https://github.com/kiwicom/contessa
Easy way to define, execute and store quality rules for your data.
data data-engineering data-quality framework mysql postgres python quality-assurance sqlite3
Last synced: 29 Jul 2025
https://github.com/hms-dbmi/EHRtemporalVariability
R package for delineating temporal dataset shifts in Eletronic Health Records
biomedical-data-science biomedical-informatics data-quality data-quality-monitoring dataset-shifts electronic-health-records time variability visualization
Last synced: 15 May 2025
https://github.com/piotr-kalanski/data-quality-monitoring
Data Quality Monitoring Tool
data-quality monitoring scala spark
Last synced: 03 Aug 2025