An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-profiling

A curated list of projects in awesome lists tagged with data-profiling .

https://github.com/open-metadata/openmetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake

Last synced: 04 Feb 2026

https://github.com/open-metadata/OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake

Last synced: 15 Mar 2025

https://github.com/desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data

Last synced: 22 Nov 2025

https://github.com/Desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data

Last synced: 03 Apr 2025

https://github.com/databrickslabs/dqx

Databricks framework to validate Data Quality of pySpark DataFrames and Tables

data-profiling data-quality data-quality-monitoring databricks lakeflow spark spark-streaming unity-catalog

Last synced: 09 Feb 2026

https://github.com/dqops/dqo

Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.

data-observability data-ops data-profiling data-quality data-quality-checks data-quality-measurement data-quality-monitoring data-quality-report monitoring

Last synced: 13 Dec 2025

https://github.com/hi-primus/bumblebee

🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

bumblebee cudf dask dask-cudf data-cleaning data-preparation data-profiling datasets gpu gui optimus prepare-data python

Last synced: 02 May 2025

https://github.com/DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 05 May 2025

https://github.com/datakitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 04 Apr 2025

https://github.com/vida-nyu/auctus

Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index

crawling data-profiling dataset dataset-search index search search-engine

Last synced: 10 Apr 2025

https://github.com/tsegall/fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.

data-discovery data-profiler data-profiling date java metadata semantic-type-detection semantic-typechecking semantic-types

Last synced: 06 Feb 2026

https://github.com/ahmadassaf/roomba

A Node.js tool to examine the correctness of Open Data Metadata and build custom dataset profiles

ckan ckan-api data-profiling data-quality dataset dataset-catalog dataset-metadata node portal

Last synced: 19 Jun 2025

https://github.com/christianbors/OpenRefineQualityMetrics

MetricDoc is an interactive visual exploration environment for assessing data quality

data-profiling data-quality data-quality-checks data-wrangling interactive-visualizations quality-metrics visual-analytics

Last synced: 06 Apr 2025

https://github.com/darenasc/auto-fes

Automated exploration of files in a folder structure to extract metadata and potential usage of information.

data-exploration data-profiling data-science eda plain-text python

Last synced: 16 Mar 2025

https://github.com/hpcc-systems/datapatterns

HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer

data-profiling ecl-bundle hpcc-platform hpcc-systems

Last synced: 05 Feb 2026

https://github.com/statsim/profile

Profile. Generate data profiles in the browser (work in progress)

data-profile data-profiling online-algorithms statistics streaming-algorithms

Last synced: 24 Feb 2025

https://github.com/amr-yasser226/datagovernanceworkflow

Comprehensive data governance pipeline for SSH honeypot logs—covering data profiling, cleansing, quality assurance, encryption, classification, and GDPR/CCPA/HIPAA compliance. Built with Pandas, Pandera, YData Profiling, and cryptography, with simulated Caesar cipher attacks to demonstrate practical data-security techniques.

caesar-cipher ccpa cryptography cybersecurity data-cleaning data-encryption data-governance data-profiling data-quality data-validation data-visualization gdpr hipaa honeypot-analysis open-source pandas privacy-compliance python ssh-logs

Last synced: 05 Feb 2026

https://github.com/mzj14/function-dependency-exploration

Homework for exploring function dependencies in data sets

data-profiling function-dependency python3 tane

Last synced: 17 Mar 2025

https://github.com/hadarsharon/compars

DataFrame comparison done right, powered by Rust with polars (AKA the bear-agnostic 🐻 🐼 🐨 🐻‍❄️ DataFrame comparison library)

data-engineering data-profiling data-quality dataframe dataframes koalas pandas polars pyspark python rust spark

Last synced: 28 Jan 2026

https://github.com/clarelgibson/inspectr

Collection of notebooks documenting best practices for data profiling and QA in R.

data-profiling r-language

Last synced: 31 Aug 2025

https://github.com/analyst-amitbisht/ydata-profiling

This repository showcases my learning process of automating EDA using 'ydata-profiling'

data-analytics data-profiling eda pandas python3 ydata-profiling

Last synced: 09 Oct 2025