Projects in Awesome Lists tagged with data-profiling
A curated list of projects in awesome lists tagged with data-profiling .
https://github.com/ydataai/ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
big-data-analytics data-analysis data-exploration data-profiling data-quality data-science deep-learning eda exploration exploratory-data-analysis hacktoberfest html-report jupyter jupyter-notebook machine-learning pandas pandas-dataframe pandas-profiling python statistics
Last synced: 16 Jan 2026
https://github.com/cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
active-learning annotation data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation dataops dataquality datasets exploratory-data-analysis labeling llms noisy-labels out-of-distribution-detection outlier-detection weak-supervision
Last synced: 08 Jan 2026
https://github.com/great-expectations/great_expectations
Always know what to expect from your data.
cleandata data-engineering data-profilers data-profiling data-quality data-science data-unit-tests datacleaner datacleaning dataquality dataunittest eda exploratory-analysis exploratory-data-analysis exploratorydataanalysis mlops pipeline pipeline-debt pipeline-testing pipeline-tests
Last synced: 16 Jan 2026
https://github.com/open-metadata/openmetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake
Last synced: 04 Feb 2026
https://github.com/open-metadata/OpenMetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
data-catalog data-collaboration data-contracts data-discovery data-governance data-lineage data-observability data-profiling data-quality data-quality-checks data-science data-validation datadiscovery dataengineering dataquality dbt hacktoberfest metadata metadata-management snowflake
Last synced: 15 Mar 2025
https://github.com/fbdesignpro/sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
data-analysis data-exploration data-profiling data-science data-visualization eda exploration exploratory-data-analysis machine-learning pandas pandas-dataframe python statistics
Last synced: 14 May 2025
https://github.com/sodadata/soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
data-contracts data-engineering data-governance data-monitoring data-observability data-profiling data-quality data-quality-checks data-quality-monitoring data-quality-testing data-reliability data-testing data-unit-tests data-validation dataquality datatesting dbt pipeline-testing python snowflake
Last synced: 14 May 2025
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 14 May 2025
https://github.com/opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
alerting bigdata data-catalog data-discovery data-engineering data-exploration data-governance data-lineage data-observability data-pipelines data-platform data-profiling data-quality data-science datacatalog lineage metadata metadata-management observability oss
Last synced: 15 May 2025
https://github.com/cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
computer-vision data-centric-ai data-exploration data-profiling data-quality data-science data-validation deep-learning exploratory-data-analysis image-analysis image-classification image-generation image-quality image-segmentation
Last synced: 06 Jan 2026
https://github.com/polyaxon/datatile
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
dask data-exploration data-profiling data-quality data-quality-checks data-science data-visualization dataframes dataops explainable-ai matplotlib mlops pandas pandas-summary plotly pytorch spark statistics tensorflow tracking
Last synced: 17 Aug 2025
https://github.com/polyaxon/traceml
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
dask data-exploration data-profiling data-quality data-quality-checks data-science data-visualization dataframes dataops explainable-ai matplotlib mlops pandas pandas-summary plotly pytorch spark statistics tensorflow tracking
Last synced: 12 Dec 2025
https://github.com/ing-bank/popmon
Monitor the stability of a Pandas or Spark dataframe ⚙︎
covariate-shift data-analysis data-distributions data-profiling data-science dataset-shifts drift-detection hacktoberfest ing-bank ipython jupyter mlops monitoring pandas population-monitoring python spark statistical-process-control statistical-tests statistics
Last synced: 13 Apr 2025
https://github.com/infuseai/piperider
Code review for data in dbt
code-review continuous-integration data-exploration data-observability data-pipeline data-profiler data-profiling data-quality data-reliability data-science data-testing data-visualization dbt dbt-metrics eda exploratory-data-analysis pull-requests python reporting
Last synced: 10 Apr 2025
https://github.com/InfuseAI/piperider
Code review for data in dbt
code-review continuous-integration data-exploration data-observability data-pipeline data-profiler data-profiling data-quality data-reliability data-science data-testing data-visualization dbt dbt-metrics eda exploratory-data-analysis pull-requests python reporting
Last synced: 18 Apr 2025
https://github.com/polyaxon/haupt
Lineage metadata API, artifacts streams, sandbox, API, and spaces for Polyaxon
bokeh data-processing data-profiling data-science data-visualization deep-learning jupyter lineage machine-learning matplotlib mlops models plotly python pytorch serving tensorflow tracking ui visualization
Last synced: 14 May 2025
https://github.com/desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 22 Nov 2025
https://github.com/Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 03 Apr 2025
https://github.com/databrickslabs/dqx
Databricks framework to validate Data Quality of pySpark DataFrames and Tables
data-profiling data-quality data-quality-monitoring databricks lakeflow spark spark-streaming unity-catalog
Last synced: 09 Feb 2026
https://github.com/dqops/dqo
Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.
data-observability data-ops data-profiling data-quality data-quality-checks data-quality-measurement data-quality-monitoring data-quality-report monitoring
Last synced: 13 Dec 2025
https://github.com/hi-primus/bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
bumblebee cudf dask dask-cudf data-cleaning data-preparation data-profiling datasets gpu gui optimus prepare-data python
Last synced: 02 May 2025
https://github.com/DataKitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake
Last synced: 05 May 2025
https://github.com/datakitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake
Last synced: 04 Apr 2025
https://github.com/vida-nyu/auctus
Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index
crawling data-profiling dataset dataset-search index search search-engine
Last synced: 10 Apr 2025
https://github.com/tsegall/fta
Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
data-discovery data-profiler data-profiling date java metadata semantic-type-detection semantic-typechecking semantic-types
Last synced: 06 Feb 2026
https://github.com/cleanlab/cleanlab-studio
Client interface to Cleanlab Studio and the Trustworthy Language Model
annotations automl computer-vision data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation image-classification llm machine-learning model-deployment natural-language-processing noisy-labels outlier-detection structured-data text-classification
Last synced: 13 Apr 2025
https://github.com/open-metadata/openmetadata-site
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
automation bigdata bigdataanalytics data-catalog data-discovery data-observability data-profiling data-quality-monitoring data-science datadiscovery dataengineering dataquality datascience dbt governance hacktoberfest hacktoberfest2022 metadata metadata-api metadata-management
Last synced: 14 Apr 2025
https://github.com/ahmadassaf/roomba
A Node.js tool to examine the correctness of Open Data Metadata and build custom dataset profiles
ckan ckan-api data-profiling data-quality dataset dataset-catalog dataset-metadata node portal
Last synced: 19 Jun 2025
https://github.com/SebastianSchmidl/distod
DISTOD algorithm: Distributed discovery of bidirectional order dependencies
akka akka-cluster bidirectional-order-dependencies data-profiling distributed-systems elastic java order-dependencies order-dependency-discovery scala
Last synced: 21 Mar 2025
https://github.com/christianbors/OpenRefineQualityMetrics
MetricDoc is an interactive visual exploration environment for assessing data quality
data-profiling data-quality data-quality-checks data-wrangling interactive-visualizations quality-metrics visual-analytics
Last synced: 06 Apr 2025
https://github.com/darenasc/auto-fes
Automated exploration of files in a folder structure to extract metadata and potential usage of information.
data-exploration data-profiling data-science eda plain-text python
Last synced: 16 Mar 2025
https://github.com/hpcc-systems/datapatterns
HPCC Systems ECL bundle that provides some basic data profiling and research tools to an ECL programmer
data-profiling ecl-bundle hpcc-platform hpcc-systems
Last synced: 05 Feb 2026
https://github.com/statsim/profile
Profile. Generate data profiles in the browser (work in progress)
data-profile data-profiling online-algorithms statistics streaming-algorithms
Last synced: 24 Feb 2025
https://github.com/amr-yasser226/datagovernanceworkflow
Comprehensive data governance pipeline for SSH honeypot logs—covering data profiling, cleansing, quality assurance, encryption, classification, and GDPR/CCPA/HIPAA compliance. Built with Pandas, Pandera, YData Profiling, and cryptography, with simulated Caesar cipher attacks to demonstrate practical data-security techniques.
caesar-cipher ccpa cryptography cybersecurity data-cleaning data-encryption data-governance data-profiling data-quality data-validation data-visualization gdpr hipaa honeypot-analysis open-source pandas privacy-compliance python ssh-logs
Last synced: 05 Feb 2026
https://github.com/mzj14/function-dependency-exploration
Homework for exploring function dependencies in data sets
data-profiling function-dependency python3 tane
Last synced: 17 Mar 2025
https://github.com/hadarsharon/compars
DataFrame comparison done right, powered by Rust with polars (AKA the bear-agnostic 🐻 🐼 🐨 🐻❄️ DataFrame comparison library)
data-engineering data-profiling data-quality dataframe dataframes koalas pandas polars pyspark python rust spark
Last synced: 28 Jan 2026
https://github.com/clarelgibson/inspectr
Collection of notebooks documenting best practices for data profiling and QA in R.
Last synced: 31 Aug 2025
https://github.com/analyst-amitbisht/ydata-profiling
This repository showcases my learning process of automating EDA using 'ydata-profiling'
data-analytics data-profiling eda pandas python3 ydata-profiling
Last synced: 09 Oct 2025