An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with dataquality

A curated list of projects in awesome lists tagged with dataquality .

https://github.com/open-metadata/openmetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake

Last synced: 22 Feb 2026

https://github.com/open-metadata/OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake

Last synced: 15 Mar 2025

https://github.com/awslabs/deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

dataquality scala spark unit-testing

Last synced: 13 May 2025

https://github.com/datavane/datavines

Know your data better!Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.

dataobservability dataprofile dataquality datascience doris metadata spark

Last synced: 09 Apr 2025

https://github.com/osmcha/osmcha-frontend

Frontend for the osmcha-django REST API

dataquality openstreetmap osm osmcha qa

Last synced: 04 Apr 2025

https://github.com/autoviml/pandas_dq

Find data quality issues and clean your data in a single line of code with a Scikit-Learn compatible Transformer.

data data-science dataquality dataqualitycheck machine-learning pandas python scikit-learn

Last synced: 07 Apr 2025

https://github.com/DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 05 May 2025

https://github.com/datakitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 02 Apr 2026

https://github.com/datakitchen/dataops-testgen

DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling,  new dataset hygiene review, AI generation of data quality validation tests, ongoing testing of data refreshes, & continuous anomaly monitoring

data data-engineering data-observability data-quality data-science data-testing datachecker dataops dataprofiling dataquality datavalidation mssql postgresql python redshift self-hosted snowflake

Last synced: 25 Feb 2026

https://github.com/infinitelambda/dq-tools

Make simple storing test results and visualisation of these in a BI dashboard

dataquality dbt package

Last synced: 01 Mar 2026

https://github.com/mundipagg/amora-data-build-tool

Amora Data Build Tool enables analysts and engineers to transform data on the data warehouse (BigQuery) by writing Amora Models that describe the data schema using Python's "PEP484 - Type Hints" and select statements with SQLAlchemy. Amora is able to transform Python code into SQL data transformation jobs that run inside the warehouse.

analytics analytics-dashboard analytics-engineering bigquery business-intelligence data-engineering data-modeling datacleaning dataquality elt machine-learning python transformation

Last synced: 08 Sep 2025

https://github.com/AltimateAI/datapilot-cli

Datailot-cli is the command line interface for accessing the AI teammate for engineers to ensure best practices in their SQL and dbt projects.

dataquality dbt dbt-core

Last synced: 05 May 2025

https://github.com/grillazz/fastapi-greatexpectations

Run greatexpectations.io on ANY SQL Engine using REST API. Supported by FastAPI, Pydantic and SQLAlchemy as best data quality tool

dataquality dataqualitycheck fastapi great-expectations pydantic python python3 sql sqlalchemy

Last synced: 30 Aug 2025

https://github.com/bikash/dataquality

Tutorial and examples of Data Quality in Big Data System

big-data data-quality dataquality

Last synced: 04 Mar 2026

https://github.com/opt-nc/setup-duckdb-action

🦆 Blazing Fast and highly customizable Github Action to setup a DuckDb runtime

action actions analytics csv data-science database databases dataquality dataqualitycheck duckdb embedded-database github-actions olap sql

Last synced: 16 Mar 2026

https://github.com/huemulsolutions/huemul-bigdatagovernance

Huemul BigDataGovernance, es una framework que trabaja sobre Spark, Hive y HDFS. Permite la implementación de una estrategia corporativa de dato único, basada en buenas prácticas de Gobierno de Datos. Permite implementar tablas con control de Primary Key y Foreing Key al insertar y actualizar datos utilizando la librería, Validación de nulos, largos de textos, máximos/mínimos de números y fechas, valores únicos y valores por default. También permite clasificar los campos en aplicabilidad de derechos ARCO para facilitar la implementación de leyes de protección de datos tipo GDPR, identificar los niveles de seguridad y si se está aplicando algún tipo de encriptación. Adicionalmente permite agregar reglas de validación más complejas sobre la misma tabla.

bigdata chile cloudera data data-engineer data-engineering data-governance data-warehouse datamart dataquality gdpr hadoop hive hortonworks huemul huemul-bigdatagovernance parquet spark spark-sql trabaja-sobre-spark

Last synced: 26 Apr 2025

https://github.com/rodrigobaron/qafs

Quality Aware Feature Store

dataquality feature-engineering feature-store

Last synced: 17 Aug 2025

https://github.com/koddachad/dq_tester

A lightweight simple data quality testing tool.

data database dataengineering dataquality dataqualitycheck

Last synced: 08 Oct 2025

https://github.com/jabardigitalservice/datasae

Data Quality Framework provides by Jabar Digital Service

dataquality python

Last synced: 04 Apr 2026

https://github.com/josephmachado/data-quality-w-greatexpectations

Code for data quality with greatexpectations blog

dataengineering dataquality greatexpectations python

Last synced: 15 Apr 2025

https://github.com/lapetitesouris/kuronososhiki

Data Stream Quality Control with Apache Kafka

dataquality faust kafka kafka-streams

Last synced: 03 Apr 2025

https://github.com/cintia0528/data_cleaning_and_analytics-python

Evaluate if aggressive discounting benefits Eniac long-term, considering differing views on customer acquisition and brand positioning. Focus on data cleaning for informed decision-making.

colab-notebook data data-analysis datacleaning dataquality jupyter-notebook matplotlib pandas python seaborn

Last synced: 08 Jan 2026

https://github.com/santiviquez/feedsanity

Minimal, educational RSS reader with built-in data quality validation using Soda Core.

dataquality rss soda

Last synced: 24 Feb 2026

https://github.com/kevinndungu-source/amazon_redshift_s3_data_pipeline

This repository contains code and documentation for the Amazon Redshift project, showcasing a data pipeline using Amazon S3. Explore how I manage and analyze large datasets in the cloud!

amazon-redshift-data-pipeline-aws-cloud-computing amazon-s3 datamanagement dataquality redshift-database sql-query vpc-creation vpc-endpoint

Last synced: 06 Mar 2025

https://github.com/rgzafra11/excel_sales_analytics

# Excel_Sales_Analytics📊 This repository contains a comprehensive business intelligence report for AtliQ Hardware, focusing on sales performance and strategic insights. 🚀 Explore data-driven analytics to enhance product offerings and optimize sales strategies for improved profitability.

businessanalytics businessinsights data-visualization dataquality eda excel excel-dashboard jupyter-notebook kpi pivot-table powerquery recommandation sales-insights seaborn

Last synced: 01 Jul 2025

https://github.com/interzoid/companynamesimkey-go

Generates a similarity key for a company/organization name for matching inconsistent names within a dataset(s)

ai api companydata companynames data-quality-assessment dataquality go go-package golang standardization

Last synced: 12 Jan 2026

https://github.com/amsh4/driftsiren

DriftSiren is a production-grade platform for real-time data drift and quality monitoring. Built with Next.js, FastAPI, and Docker, it tracks feature drift, provides live alerts, and visualizes metrics on a sleek dashboard. Includes agent, APIs, Celery workers, and Kubernetes-ready setup.

celery cicd datadrift dataquality docker fastapi k8s machine-learning nextjs observability postgresql real-time redis tailwindcss typescript websocket

Last synced: 30 Dec 2025

https://github.com/jotstolu/azure-data-engineering-end--to-end-project

An end-to-end Netflix data engineering pipeline built on Microsoft Azure. This project ingests raw Netflix data, applies PySpark transformations , enforces data quality with Delta Live Tables, and orchestrates workflows via Azure Data Factory and Databricks.

adf adlsgen2 azuredatabricks azuredatafactory cloudcomputing dataengineering datapipeline dataquality deltalake deltalivetables medallionarchitecture pyspark

Last synced: 31 Jan 2026

https://github.com/interzoid/fullnamesimkey-go

Generates a similarity key for an individual name for matching inconsistent names within a dataset(s)

ai dataquality go golang golang-package matching

Last synced: 11 Jan 2026

https://github.com/kevinndungu-source/data_analytics_projects_python_powerbi

Explore comprehensive retail analysis and insights focusing on Marley International Store, powered by data analytics techniques and visualizations using Python, PowerBI and Jupyter Notebooks.

analytics datacleaning datamanagement datamanipulation dataquality dax-query juypter-notebook matplotlib on-premises-data-gateway powerbi-desktop powerbi-service powerquery python python-script sql-query validity-check

Last synced: 26 Jul 2025

https://github.com/interzoid/streetaddresssimkey-go

Generates a similarity key for a street address for matching inconsistent street addresses within a dataset(s)

addresses ai dataquality go go-package golang

Last synced: 12 Jan 2026

https://github.com/projects-developer/data-duplication-removal-using-machine-learning

This project utilizes machine learning algorithms to detect and remove duplicate data entries from a dataset. Project Includes Source Code, PPT, Synopsis, Report, Documents, Base Research Paper & Video tutorials

btechprojects computerscienceprojects dataanalytics datacleaning dataduplicationremoval datamanagement datamatching dataquality duplicatedetection machinelearning mtechprojects

Last synced: 23 Feb 2025