Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with data-quality
A curated list of projects in awesome lists tagged with data-quality .
https://github.com/gokumohandas/made-with-ml
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 29 Sep 2024
https://github.com/GokuMohandas/Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 31 Jul 2024
https://github.com/eugeneyan/applied-ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
applied-data-science applied-machine-learning computer-vision data-discovery data-engineering data-quality data-science deep-learning machine-learning natural-language-processing production recsys reinforcement-learning search
Last synced: 29 Sep 2024
https://github.com/pandas-profiling/pandas-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
big-data-analytics data-analysis data-exploration data-profiling data-quality data-science deep-learning eda exploration exploratory-data-analysis hacktoberfest html-report jupyter jupyter-notebook machine-learning pandas pandas-dataframe pandas-profiling python statistics
Last synced: 03 Aug 2024
https://github.com/ydataai/ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
big-data-analytics data-analysis data-exploration data-profiling data-quality data-science deep-learning eda exploration exploratory-data-analysis hacktoberfest html-report jupyter jupyter-notebook machine-learning pandas pandas-dataframe pandas-profiling python statistics
Last synced: 29 Sep 2024
https://github.com/great-expectations/great_expectations
Always know what to expect from your data.
cleandata data-engineering data-profilers data-profiling data-quality data-science data-unit-tests datacleaner datacleaning dataquality dataunittest eda exploratory-analysis exploratory-data-analysis exploratorydataanalysis mlops pipeline pipeline-debt pipeline-testing pipeline-tests
Last synced: 29 Sep 2024
https://github.com/cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
active-learning annotation data-analysis data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation dataops dataquality datasets labeling llms noisy-labels out-of-distribution-detection outlier-detection weak-supervision
Last synced: 31 Jul 2024
https://github.com/kestra-io/kestra
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
data data-engineering data-integration data-orchestration data-orchestrator data-pipeline data-quality elt etl low-code orchestration pipeline reverse-etl scheduler workflow workflow-engine
Last synced: 29 Sep 2024
https://github.com/voxel51/fiftyone
The open-source tool for building high-quality datasets and computer vision models
active-learning artificial-intelligence computer-vision data-centric-ai data-cleaning data-curation data-quality data-science deep-learning developer-tools image-classification machine-learning object-detection python unstructured-data vector-search visualization
Last synced: 31 Jul 2024
https://github.com/feast-dev/feast
The Open Source Feature Store for Machine Learning
big-data data-engineering data-quality data-science feature-store features machine-learning ml mlops python
Last synced: 29 Sep 2024
https://github.com/open-metadata/openmetadata
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datacatalog datadiscovery dataengineering dataquality dbt metadata metadata-management snowflake
Last synced: 29 Sep 2024
https://github.com/open-metadata/OpenMetadata
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datacatalog datadiscovery dataengineering dataquality dbt metadata metadata-management snowflake
Last synced: 31 Jul 2024
https://github.com/treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 31 Jul 2024
https://github.com/treeverse/lakefs
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 28 Sep 2024
https://github.com/gokumohandas/mlops-course
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 30 Sep 2024
https://github.com/datafold/data-diff
Compare tables within or across databases
data data-diffing data-engineering data-quality data-quality-monitoring data-science database databricks-sql dataengineering dataquality dbt mysql oracle-database postgres postgresql python rdbms snowflake sql trino
Last synced: 27 Sep 2024
https://github.com/GokuMohandas/mlops-course
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 31 Jul 2024
https://github.com/whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
ai-pipelines analytics approximate-statistics calculate-statistics constraints data-constraints data-pipeline data-quality data-science dataops dataset logging machine-learning ml-pipelines mlops model-performance python statistical-properties
Last synced: 30 Sep 2024
https://github.com/feathr-ai/feathr
Feathr – A scalable, unified data and AI engineering platform for enterprise
apache-spark artificial-intelligence azure data-engineering data-quality data-science feature-engineering feature-governance feature-management feature-marketplace feature-metadata feature-platform feature-store machine-learning mlops
Last synced: 28 Sep 2024
https://github.com/sodadata/soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
data-contracts data-engineering data-governance data-monitoring data-observability data-profiling data-quality data-quality-checks data-quality-monitoring data-quality-testing data-reliability data-testing data-unit-tests data-validation dataquality datatesting dbt pipeline-testing python snowflake
Last synced: 30 Sep 2024
https://github.com/featureform/featureform
The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
data-quality data-science embeddings embeddings-similarity feature-engineering feature-store hacktoberfest machine-learning ml mlops python vector-database
Last synced: 30 Sep 2024
https://github.com/featureform/embeddinghub
The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
data-quality data-science embeddings embeddings-similarity feature-engineering feature-store hacktoberfest machine-learning ml mlops python vector-database
Last synced: 31 Jul 2024
https://github.com/re-data/re-data
re_data - fix data issues before your users & CEO would discover them 😊
data-analysis data-monitoring data-observability data-quality data-quality-checks data-quality-monitoring data-reliability data-testing dataquality dbt dbt-packages open-source-tooling
Last synced: 30 Sep 2024
https://github.com/opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
alerting bigdata data-catalog data-discovery data-engineering data-exploration data-governance data-lineage data-observability data-pipelines data-platform data-profiling data-quality data-science datacatalog lineage metadata metadata-management observability oss
Last synced: 30 Sep 2024
https://github.com/daochenzha/data-centric-ai
A curated, but incomplete, list of data-centric AI resources.
ai artificial-intelligence data-centric data-centric-ai data-centric-machine-learning data-curation data-engineering data-quality data-science machine-learning
Last synced: 30 Sep 2024
https://github.com/daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
ai artificial-intelligence data-centric data-centric-ai data-centric-machine-learning data-curation data-engineering data-quality data-science machine-learning
Last synced: 31 Jul 2024
https://github.com/cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
computer-vision data-centric-ai data-exploration data-profiling data-quality data-science data-validation deep-learning exploratory-data-analysis image-analysis image-classification image-generation image-quality image-segmentation
Last synced: 30 Jul 2024
https://github.com/rstudio/pointblank
Data quality assessment and metadata reporting for data frames and database tables
data-assertions data-checker data-dictionaries data-frames data-inference data-management data-profiler data-quality data-validation data-verification database-tables easy-to-understand reporting-tool schema-validation testing-tools yaml-configuration
Last synced: 30 Jul 2024
https://github.com/kennethleungty/Failed-ML
Compilation of high-profile real-world examples of failed machine learning projects
ai artificial-intelligence classification computer-vision data-engineering data-quality data-science deep-learning failed-data-science failed-machine-learning failed-ml fml forecasting machine-learning ml natural-language-processing production recsys regression
Last synced: 01 Aug 2024
https://github.com/WeBankFinTech/Qualitis
Qualitis is a one-stop data quality management platform that supports quality verification, notification, and management for various datasource. It is used to solve various data quality problems caused by data processing. https://github.com/WeBankFinTech/Qualitis
compare data-quality data-quality-model datashperestudio dss linkis quality quality-check quality-improvement workflow
Last synced: 01 Aug 2024
https://github.com/polyaxon/traceml
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
dask data-exploration data-profiling data-quality data-quality-checks data-science data-visualization dataframes dataops explainable-ai matplotlib mlops pandas pandas-summary plotly pytorch spark statistics tensorflow tracking
Last synced: 27 Sep 2024
https://github.com/InfuseAI/piperider
Code review for data in dbt
code-review continuous-integration data-exploration data-observability data-pipeline data-profiler data-profiling data-quality data-reliability data-science data-testing data-visualization dbt dbt-metrics eda exploratory-data-analysis pull-requests python reporting
Last synced: 01 Aug 2024
https://github.com/alibaba/feathub
FeatHub - A stream-batch unified feature store for real-time machine learning
apache-flink data data-engineering data-quality data-science feature-engineering feature-store machine-learning mlops streaming
Last synced: 01 Aug 2024
https://github.com/data-drift/data-drift
Metrics Observability & Troubleshooting
analytics bigquery context data-diffing data-governance data-lineage data-monitoring data-observability data-quality data-reliability data-version-control dbt dbt-metrics dbt-packages drill-down metrics reconciliation redshift semantic-layer snowflake
Last synced: 29 Sep 2024
https://github.com/ubisoft/mobydq
:whale: Tool to automate data quality checks on data pipelines
big-data data-pipeline data-quality data-quality-checks data-quality-monitoring data-warehouse
Last synced: 02 Aug 2024
https://github.com/bitol-io/open-data-contract-standard
Home of the Open Data Contract Standard (ODCS).
data data-contract data-contracts data-engineering data-mesh data-quality
Last synced: 06 Aug 2024
https://github.com/whylabs/whylogs-java
Profile and monitor your ML data pipeline end-to-end
ai-pipelines aiops apache-spark approximate-statistics calculate-statistics data-quality dataset java mlops spark statistical-properties statistics whylogs
Last synced: 28 Sep 2024
https://github.com/AKSW/RDFUnit
An RDF Unit Testing Suite
data-quality data-quality-checks data-validation rdf schema schema-validation shacl unit-testing validation web-ontology-language
Last synced: 01 Aug 2024
https://github.com/OHDSI/DataQualityDashboard
A tool to help improve data quality standards in observational data science.
Last synced: 08 Aug 2024
https://github.com/aai-institute/pyDVL
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab
Last synced: 03 Aug 2024
https://github.com/aws-samples/amazon-deequ-glue
Automated data quality suggestions and analysis with Deequ on AWS Glue
aws aws-glue data-quality deequ
Last synced: 13 Aug 2024
https://github.com/gclunies/reflekt
Define, govern, and model event data for warehouse-first product analytics.
avo customer-data-platform data-modeling data-quality data-warehouse dbt dbt-package events governance product-analytics schema-registry segment segment-protocols
Last synced: 26 Sep 2024
https://github.com/GClunies/Reflekt
Define, govern, and model event data for warehouse-first product analytics.
avo customer-data-platform data-modeling data-quality data-warehouse dbt dbt-package events governance product-analytics schema-registry segment segment-protocols
Last synced: 05 Sep 2024
https://github.com/great-expectations/great_expectations_action
A GitHub Action that makes it easy to use Great Expectations to validate your data pipelines in your CI workflows.
actions continuous-integration data-integrity data-quality data-science mlops
Last synced: 01 Aug 2024
https://github.com/datakitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake
Last synced: 28 Sep 2024
https://github.com/dqops/dqo
Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.
data-observability data-ops data-profiling data-quality data-quality-checks data-quality-measurement data-quality-monitoring data-quality-report monitoring
Last synced: 04 Aug 2024
https://github.com/DataKitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake
Last synced: 02 Aug 2024
https://github.com/kenthsu/udacity-data-engineering-nanodgree
Udacity Data Engineering Nanodegree Program
apache-airflow apache-cassandra apache-spark aws-redshift aws-s3 data-engineering data-lake data-pipelines data-quality data-warehouses postgresql
Last synced: 29 Sep 2024
https://github.com/ropensci/daiquiri
Data quality reporting for temporal datasets.
data-quality initial-data-analysis r r-package reproducible-research rstats temporal-data time-series
Last synced: 13 Aug 2024
https://github.com/emilyriederer/convo
R package based on "Column Names as Contracts" blog post (https://emilyriederer.netlify.app/post/column-name-contracts/)
controlled-vocabulary data-quality data-validation r-package schema-design variable-names variable-naming
Last synced: 13 Aug 2024
https://github.com/semyonsinchenko/tsumugi-spark
SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.
data-quality deequ pyspark spark
Last synced: 26 Sep 2024
https://github.com/kiwicom/contessa
Easy way to define, execute and store quality rules for your data.
data data-engineering data-quality framework mysql postgres python quality-assurance sqlite3
Last synced: 13 Aug 2024
https://github.com/piotr-kalanski/data-quality-monitoring
Data Quality Monitoring Tool
data-quality monitoring scala spark
Last synced: 02 Aug 2024
https://github.com/hms-dbmi/EHRtemporalVariability
R package for delineating temporal dataset shifts in Eletronic Health Records
biomedical-data-science biomedical-informatics data-quality data-quality-monitoring dataset-shifts electronic-health-records time variability visualization
Last synced: 04 Aug 2024
https://github.com/aws-samples/monitoring-apache-iceberg-table-metadata-layer
Sample code to collect Apache Iceberg metrics for table monitoring
apache-iceberg apache-spark aws aws-cloudwatch aws-glue aws-lambda data-quality monitoring pyiceberg sam-cli
Last synced: 28 Sep 2024
https://github.com/data-catering/data-caterer
Data generation and validation tool for any data source
data-generation data-quality data-test data-testing data-validation java scala testing-automation ui yaml
Last synced: 05 Sep 2024
https://github.com/christianbors/OpenRefineQualityMetrics
MetricDoc is an interactive visual exploration environment for assessing data quality
data-profiling data-quality data-quality-checks data-wrangling interactive-visualizations quality-metrics visual-analytics
Last synced: 01 Aug 2024
https://github.com/adidas/lakehouse-engine-docs
The Goal of this project is to provide documentation for the Lakehouse Engine framework.
big-data data-engineering data-quality databricks delta-lake framework great-expectations lakehouse lakehouse-engine spark
Last synced: 28 Sep 2024
https://github.com/byteplant/phone-validator-net
NodeJS wrapper for the phone-validator.net API
byteplant cleaning cleaning-data data-quality data-validation javascript node-js node-module phone phone-marketing phone-number phone-number-verification phone-validation phonenumber typescript validation
Last synced: 28 Sep 2024
https://github.com/BetweenTwoTests/between_dbs
DDL & test data for different databases for ETL data quality checks / data loading tests
Last synced: 13 Aug 2024
https://github.com/bharathsudharsan/tiny-impute
On-device Hybrid Anomaly Detection and Data Imputation
anamoly-detection arduino data-quality edge-computing esp32 expectation-maximization imputation-algorithm iot knn laplacian micro-python mkr1000 moving-average raspberry-pi simple-linear-regression tinyml
Last synced: 27 Sep 2024
https://github.com/jadelhelm/autoprep
Automated Preprocessing Pipeline - DataFrame
anomalies anomaly anomaly-detection automated automated-machine-learning automation data-cleaning data-cleaning-and-preprocessing data-quality machine-learning machinelearning machinelearning-python preprocessing preprocessing-data preprocessing-pipeline python python3 sklearn standardization tabular-data
Last synced: 26 Sep 2024