Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with data-quality

A curated list of projects in awesome lists tagged with data-quality .

https://github.com/open-metadata/openmetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake

Last synced: 30 Dec 2024

https://github.com/evidentlyai/evidently

Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

data-drift data-quality data-science data-validation generative-ai hacktoberfest html-report jupyter-notebook llm llmops machine-learning mlops model-monitoring pandas-dataframe

Last synced: 30 Dec 2024

https://github.com/gokumohandas/mlops-course

Learn how to design, develop, deploy and iterate on production-grade ML applications.

data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray

Last synced: 01 Jan 2025

https://github.com/GokuMohandas/mlops-course

Learn how to design, develop, deploy and iterate on production-grade ML applications.

data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray

Last synced: 30 Oct 2024

https://github.com/whylabs/whylogs

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

ai-pipelines analytics approximate-statistics calculate-statistics constraints data-constraints data-pipeline data-quality data-science dataops dataset logging machine-learning ml-pipelines mlops model-performance python statistical-properties

Last synced: 31 Dec 2024

https://github.com/featureform/featureform

The Virtual Feature Store. Turn your existing data infrastructure into a feature store.

data-quality data-science embeddings embeddings-similarity feature-engineering feature-store hacktoberfest machine-learning ml mlops python vector-database

Last synced: 01 Jan 2025

https://github.com/WeBankFinTech/Qualitis

Qualitis is a one-stop data quality management platform that supports quality verification, notification, and management for various datasource. It is used to solve various data quality problems caused by data processing. https://github.com/WeBankFinTech/Qualitis

compare data-quality data-quality-model datashperestudio dss linkis quality quality-check quality-improvement workflow

Last synced: 05 Nov 2024

https://github.com/alibaba/feathub

FeatHub - A stream-batch unified feature store for real-time machine learning

apache-flink data data-engineering data-quality data-science feature-engineering feature-store machine-learning mlops streaming

Last synced: 05 Nov 2024

https://github.com/ubisoft/mobydq

:whale: Tool to automate data quality checks on data pipelines

big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse

Last synced: 11 Nov 2024

https://github.com/adidas/lakehouse-engine

The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.

big-data configuration-driven data-engineering data-quality databricks delta-lake framework great-expectations lakehouse spark

Last synced: 03 Jan 2025

https://github.com/gair-nlp/prox

Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"

continual continual-pre-training data-centric-ai data-quality llama llm mistral neural-symbolic pre-training

Last synced: 04 Jan 2025

https://github.com/ohdsi/dataqualitydashboard

A tool to help improve data quality standards in observational data science.

data-quality

Last synced: 01 Jan 2025

https://github.com/OHDSI/DataQualityDashboard

A tool to help improve data quality standards in observational data science.

data-quality

Last synced: 27 Nov 2024

https://github.com/dqops/dqo

Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.

data-observability data-ops data-profiling data-quality data-quality-checks data-quality-measurement data-quality-monitoring data-quality-report monitoring

Last synced: 19 Nov 2024

https://github.com/Seddryck/NBi

NBi is a testing framework (add-on to NUnit) for Business Intelligence and Data Access. The main goal of this framework is to let users create tests with a declarative approach based on an Xml syntax. By the means of NBi, you don't need to develop C# or Java code to specify your tests! Either, you don't need Visual Studio or Eclipse to compile your test suite. Just create an Xml file and let the framework interpret it and play your tests. The framework is designed as an add-on of NUnit but with the possibility to port it easily to other testing frameworks.

business-intelligence cube data-quality data-quality-checks database etl nunit test-automation test-framework

Last synced: 13 Nov 2024

https://github.com/seddryck/nbi

NBi is a testing framework (add-on to NUnit) for Business Intelligence and Data Access. The main goal of this framework is to let users create tests with a declarative approach based on an Xml syntax. By the means of NBi, you don't need to develop C# or Java code to specify your tests! Either, you don't need Visual Studio or Eclipse to compile your test suite. Just create an Xml file and let the framework interpret it and play your tests. The framework is designed as an add-on of NUnit but with the possibility to port it easily to other testing frameworks.

business-intelligence cube data-quality data-quality-checks database etl nunit test-automation test-framework

Last synced: 04 Jan 2025

https://github.com/datakitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 04 Jan 2025

https://github.com/re-data/dbt-re-data

re_data - fix data issues before your users & CEO would discover them 😊

data-monitoring data-observability data-quality data-testing dbt dbt-packages sql

Last synced: 06 Nov 2024

https://github.com/aai-institute/pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab

Last synced: 17 Nov 2024

https://github.com/DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 13 Nov 2024

https://github.com/aws-samples/amazon-deequ-glue

Automated data quality suggestions and analysis with Deequ on AWS Glue

aws aws-glue data-quality deequ

Last synced: 04 Dec 2024

https://github.com/great-expectations/great_expectations_action

A GitHub Action that makes it easy to use Great Expectations to validate your data pipelines in your CI workflows.

actions continuous-integration data-integrity data-quality data-science mlops

Last synced: 06 Nov 2024

https://github.com/impetus/jumbune

Jumbune, an open source BigData APM & Data Quality Management Platform for Data Clouds. Enterprise feature offering is available at http://jumbune.com. More details of open source offering are at,

aiops apm cluster-monitoring data-analysis data-quality developer-tools devops-tools hadoop hadoop-cluster hadoop-monitor hadoop-monitoring monitoring-tool optimization-framework yarn yarn-hadoop-cluster

Last synced: 14 Nov 2024

https://github.com/datakitchen/dataops-testgen

DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling,  new dataset hygiene review, AI generation of data quality validation tests, ongoing testing of data refreshes, & continuous anomaly monitoring

data data-engineering data-observability data-quality data-science data-testing datachecker dataops dataprofiling dataquality datavalidation mssql postgresql python redshift self-hosted snowflake

Last synced: 01 Jan 2025

https://github.com/davidberenstein1957/dataset-viber

Dataset Viber is your chill repo for data collection, annotation and vibe checks.

data-collection data-quality evaluation human-feedback

Last synced: 01 Jan 2025

https://github.com/ammsa/dtcleaner

DTCleaner: data cleaning using multi-target decision trees.

data-cleaning data-mining data-preprocessing data-quality data-science data-wrangling

Last synced: 28 Oct 2024

https://github.com/emilyriederer/convo

R package based on "Column Names as Contracts" blog post (https://emilyriederer.netlify.app/post/column-name-contracts/)

controlled-vocabulary data-quality data-validation r-package schema-design variable-names variable-naming

Last synced: 04 Dec 2024

https://github.com/bolcom/hive_compared_bq

hive_compared_bq compares/validates 2 (SQL like) tables, and graphically shows the rows/columns that are different.

bigquery data-quality hive python validation

Last synced: 15 Dec 2024

https://github.com/timgent/data-flare

Data quality control tool built on spark and deequ

big-data data-quality spark

Last synced: 16 Nov 2024

https://github.com/semyonsinchenko/tsumugi-spark

SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.

data-quality deequ pyspark spark

Last synced: 10 Oct 2024

https://github.com/dp6/penguin-datalayer-collect

A data layer quality monitoring and validation module, this solution is part of the Raft Suite ecosystem.

adobe-launch data-quality data-quality-monitoring datalayer dp6 dtm gtm gtm-server-side hacktoberfest marketing-automation monitoring penguin-datalayer raft-suite tealium

Last synced: 04 Dec 2024

https://github.com/kiwicom/contessa

Easy way to define, execute and store quality rules for your data.

data data-engineering data-quality framework mysql postgres python quality-assurance sqlite3

Last synced: 04 Dec 2024

https://github.com/ahmadassaf/roomba

A Node.js tool to examine the correctness of Open Data Metadata and build custom dataset profiles

ckan ckan-api data-profiling data-quality dataset dataset-catalog dataset-metadata node portal

Last synced: 13 Oct 2024

https://github.com/christianbors/OpenRefineQualityMetrics

MetricDoc is an interactive visual exploration environment for assessing data quality

data-profiling data-quality data-quality-checks data-wrangling interactive-visualizations quality-metrics visual-analytics

Last synced: 05 Nov 2024

https://github.com/dp6/penguin-datalayer

Crawler assistido para validação de objetos enviados à camada de dados (Data Layer)

data-quality data-quality-checks datalayer dp6 gtm hacktoberfest json-schema nodejs raft-suite

Last synced: 04 Dec 2024

https://github.com/datarootsio/notion-dbs-data-quality

Using Great Expectations and Notion's API, this repo aims to provide data quality for our databases in Notion.

data-engineering-pipeline data-quality great-expectations notion notion-api notion-database

Last synced: 14 Nov 2024

https://github.com/adidas/lakehouse-engine-docs

The Goal of this project is to provide documentation for the Lakehouse Engine framework.

big-data data-engineering data-quality databricks delta-lake framework great-expectations lakehouse lakehouse-engine spark

Last synced: 12 Oct 2024

https://github.com/open-risk/dataqualitytoolkit

Python toolkit for evaluating and visualizing the data quality of excel spreadsheets

data-quality data-quality-measurement data-science excel spreadsheet

Last synced: 13 Oct 2024

https://github.com/dp6/raft-suite-hub

O Hub é a solução responsável por centralizar a consolidação dos dados no BigQuery, ferramenta escolhida para servir de data warehouse do raft-suite.

bigquery data data-quality google-cloud google-cloud-functions hacktoberfest

Last synced: 04 Dec 2024

https://github.com/dp6/penguin-document-formatter

A document reader to extract Google Analytics planned events to use on the Raft Suite Data Quality

analytics data-quality google-cloud hacktoberfest monitoring pdf-converter

Last synced: 04 Dec 2024

https://github.com/giscience/ohsome-dashboard

Web Client for easy access to OSM History and Quality Analyses

data-quality openstreetmap openstreetmap-data openstreetmap-history osm osm-data

Last synced: 12 Nov 2024

https://github.com/nationalparkservice/qckit

QCkit provides useful functions for data quality control and manipulation including updating data to DarwinCore standards, unit conversions, and data flagging.

darwin-core data-quality data-science npsdataverse quality-control r r-package rstats

Last synced: 08 Nov 2024

https://github.com/harpin-ai/toolkit-examples

Examples for trying out the harpin AI identity resolution and data quality toolkit

data-engineering data-quality dedupe deduplication entity-resolution identity identity-resolution spark

Last synced: 01 Nov 2024

https://github.com/absaoss/spark-data-standardization

A library for Spark that helps to stadardize any input data (DataFrame) to adhere to the provided schema.

data-quality data-structures scala schema spark

Last synced: 07 Nov 2024

https://github.com/dev-ev/isobaric-inspection-jupyter

Inspecting the quality of isobaric labeling proteomic data in a Jupyter notebook. Data output from Proteome Discoverer.

data-quality data-visualization data-wrangling isobaric-labeling jupyterlab mass-spectrometry proteome-discoverer proteomics proteomics-data-analysis python quantitative-proteomics

Last synced: 19 Nov 2024

https://github.com/data-drift/dbt-snapshot-analytics

Get insight from a dbt snapshot on your metric quality

analytics data-quality dbt monitoring snapshot

Last synced: 30 Dec 2024

https://github.com/maastrichtu-ids/dqa-pipeline

Large-scale RDF-based Data Quality Assessment Pipeline

data-quality docker fair-data rdf sparql

Last synced: 21 Dec 2024

https://github.com/maastrichtu-ids/fairsharing-metrics

📊 Fairsharing metrics implementation

bioinformatics data-quality docker python rdf rdfunit

Last synced: 21 Dec 2024

https://github.com/opendatadiscovery/odd-great-expectations

Integration for collecting metadata from Great Expectations

data-governance data-quality

Last synced: 14 Nov 2024