Projects in Awesome Lists tagged with data-cleansing
A curated list of projects in awesome lists tagged with data-cleansing .
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 14 May 2025
https://github.com/data-forge/data-forge-ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
csv data data-analysis data-cleaning data-cleansing data-forge data-management data-manipulation data-munging data-visualization data-wrangling javascript json linq nodejs pandas visualization
Last synced: 13 May 2025
https://github.com/desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 22 Nov 2025
https://github.com/Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 03 Apr 2025
https://github.com/probcomp/pclean
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming
Last synced: 08 May 2025
https://github.com/probcomp/PClean
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming
Last synced: 04 May 2025
https://github.com/data-integrations/wrangler
Wrangler Transform: A DMD system for transforming Big Data
avro big-data cdap cdap-plugin data-cleansing data-prep data-science data-transform data-transformation manipulate-data parsing preparation project transform transform-data wrangle
Last synced: 25 Oct 2025
https://github.com/bakdata/dedupe
Java DSL for (online) deduplication
data-cleaning data-cleansing deduplication duplicate-detection duplicate-removal
Last synced: 10 Apr 2025
https://github.com/brunocampos01/porto-seguro-safe-driver-prediction
Predict if a driver will file an insurance claim next year. (Kaggle Competition)
challenge data-cleansing data-engineering data-science dataset insurance-claims kaggle kaggle-competition machine-learning porto-seguro python random-forest xgboost
Last synced: 05 Sep 2025
https://github.com/data-forge/data-forge-fs
This library contains the file system extensions to Data-Forge that allow it to directly read and write CSV and JSON files in Node.js
csv data data-analysis data-cleaning data-cleansing data-forge data-management data-manipulation data-munging data-visualization data-wrangling javascript json linq nodejs pandas visualization
Last synced: 04 Sep 2025
https://github.com/hypertextassassin0273/excel_data_organizer_and_cleaner-ds_project
Data Structures project in C++11 language, uses custom Vector & String structures with Move Semantics (Rule of Five)
cpp11 data-cleaning data-cleansing data-structure-projects data-structures data-structures-project data-wrangling ds-projects easy-project excel-operations move-semantics object-oriented-programming oop open-source open-source-code open-source-project rule-of-five string university-project vector
Last synced: 30 Jun 2025
https://github.com/datapreprocessing/datacleaning
Data Cleaning is a python package for data preprocessing. This cleans the CSV file and returns the cleaned data frame. It does the work of imputation, removing duplicates, replacing special characters, and many more.
data data-cleaning data-cleansing data-preprocessing data-wrangling imputation python threshold
Last synced: 14 Dec 2025
https://github.com/softwaresalt/csv-managed
csv-managed is a Rust command-line utility for high‑performance exploration and transformation of CSV data at scale, emphasizing streaming, typed operations, and reproducible workflows via schema and index files.
big-data cli-app data-cleansing data-engineering data-standardization data-transformation data-wrangling high-performance ml-engineering
Last synced: 12 Dec 2025
https://github.com/jcp/datafilter
Quickly find flags (words, phrases, etc) within your data. :male_detective:
csv data-clean data-cleansing hate-speech-detection parser python swear-filter text textfile
Last synced: 14 Jan 2026
https://github.com/agungbudiwirawan/data_science_in_telco-data_cleansing
Data cleansing using python: handling missing data values, outliers, and standardized values.
data-analysis-python data-cleansing data-science pandas python
Last synced: 08 May 2026
https://github.com/vaxdata22/water-quality-dw-on-oracle-database
This is an Oracle DB Data Warehouse and ETL implementation on specially formatted Water Quality dataset from DEFRA, UK
advanced-sql data-cleansing data-transformation data-warehouse database-schema dimension-tables etl extract-transform-load fact-table jupyter-notebook oracle-21c oracle-database oracle-sql-developer pandas-dataframe pl-sql pl-sql-cursors pyodbc python staging-area
Last synced: 30 Apr 2026
https://github.com/samhollings/nhs_data_cleansing
A repo of reusable functions for cleansing data
cleansing data data-cleaning data-cleansing preprocessing pyspark python python3
Last synced: 05 Oct 2025
https://github.com/itrauco/vtt-to-csv-python-script
Python3 script to convert transcribed video VTT to CSV for import into Google Sheets
captions closed-captions data-cleansing data-wrangling python script transcri vtt vtt-to-csv
Last synced: 19 May 2026
https://github.com/saya304/data-cleaning-and-exploratory-data-analysis
Data Cleaning and Exploratory Data Analysis in Snowflake
data-cleansing exploratory-data-analysis snowflake sql
Last synced: 16 Mar 2026
https://github.com/miozilla/dataprep-alteryx
dataprep-alteryx :eight_spoked_asterisk: : Political & Election # DataPrep # Alteryx # Trifacta # Wrangle # Recipe
alteryx-designer data-analytics data-cleansing data-wrangling dataprep recipe trifacta
Last synced: 29 Aug 2025
https://github.com/vbhvsingh0/cdc_immunization
This project explores the relationships in between different vaccines and the sex, age and other basic features in the data.
data-cleansing data-manipulation-with-pandas data-science numpy pandas-python python3
Last synced: 05 May 2026