An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-cleaning

A curated list of projects in awesome lists tagged with data-cleaning .

https://github.com/justmarkham/pandas-videos

Jupyter notebook and datasets from the pandas video series

data-analysis data-cleaning data-science jupyter-notebook pandas python tutorial

Last synced: 15 May 2025

https://github.com/ECNU-ICALK/EduChat

An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM

belle chinese-nlp data-cleaning education llama llm moss open-models

Last synced: 01 Apr 2025

https://github.com/schema-inspector/schema-inspector

Schema-Inspector is a simple JavaScript object sanitization and validation module.

data-cleaning javascript sanitization validation

Last synced: 14 May 2025

https://github.com/akanz1/klib

Easy to use Python library of customized functions for cleaning and analyzing data.

data-analysis data-cleaning data-preprocessing data-science data-visualization feature-selection klib python

Last synced: 21 Oct 2025

https://github.com/desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data

Last synced: 22 Nov 2025

https://github.com/data-cleaning/validate

Professional data validation for the R environment

data-cleaning r validation

Last synced: 21 Oct 2025

https://github.com/Desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data

Last synced: 03 Apr 2025

https://github.com/msamogh/nonechucks

Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!

data-cleaning data-pipeline data-preprocessing data-processing machine-learning preprocessing pytorch torch

Last synced: 07 May 2025

https://github.com/cambioml/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering

LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!

data-cleaning generative-ai llm

Last synced: 11 Oct 2025

https://github.com/probcomp/pclean

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming

Last synced: 08 May 2025

https://github.com/genomoncology/FuzzTypes

Pydantic extension for annotating autocorrecting fields.

data-cleaning fuzzy-string-matching named-entity-linking pydantic

Last synced: 11 May 2025

https://github.com/probcomp/PClean

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming

Last synced: 04 May 2025

https://github.com/ekstroem/datamaid

An R package for data screening

data-cleaning data-screening reproducible-research

Last synced: 09 Apr 2025

https://github.com/ekstroem/dataMaid

An R package for data screening

data-cleaning data-screening reproducible-research

Last synced: 06 May 2025

https://github.com/hi-primus/bumblebee

🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

bumblebee cudf dask dask-cudf data-cleaning data-preparation data-profiling datasets gpu gui optimus prepare-data python

Last synced: 02 May 2025

https://github.com/iam-mhaseeb/skytrax-data-warehouse

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

airflow data-analysis data-analytics data-cleaning data-engineering data-orchestration data-processing data-visualization data-warehouse data-warehousing database docker metabase python python3 redshift s3 s3-bucket sql

Last synced: 12 Aug 2025

https://github.com/ChrisMuir/refinr

Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms

approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats

Last synced: 15 Mar 2025

https://github.com/chrismuir/refinr

Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms

approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats

Last synced: 08 Sep 2025

https://github.com/aai-institute/pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab

Last synced: 11 May 2025

https://github.com/sail-sg/sailcraft

🚢 Data Toolkit for Sailor Language Models

data-cleaning data-deduplication

Last synced: 05 Oct 2025

https://github.com/lolei/redditcleaner

Cleans Reddit Text Data :scroll: :broom:

data-cleaning hacktoberfest nlp praw psaw pushshift python reddit text-data

Last synced: 22 Jul 2025

https://github.com/renumics/sliceguard

A library for detecting problematic data segments in structured and unstructured data with few lines of code.

data-analysis data-cleaning data-curation data-exploration data-science data-visualization deep-learning eda exploratory-data-analysis machine-learning python visualization

Last synced: 16 Mar 2025

https://github.com/laureberti/learn2clean

Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning

automated data-cleaning data-cleaning-pipeline data-curation data-preprocessing reinforcement-learning

Last synced: 11 Sep 2025

https://github.com/ropensci/taxa

taxonomic classes for R

data-cleaning r r-package rstats taxon taxonomy

Last synced: 21 Oct 2025

https://github.com/msberends/clean

Fast and Easy Data Cleaning (in R)

data-cleaning r

Last synced: 22 Apr 2025

https://github.com/elysian01/data-purifier

A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning, and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.

data-analysis data-cleaning data-cleaning-pipeline data-preprocessing data-science data-visualization datapurifier eda exploratory-data-analysis jupyter python-lib python-library python3

Last synced: 04 Oct 2025

https://github.com/ammsa/dtcleaner

DTCleaner: data cleaning using multi-target decision trees.

data-cleaning data-mining data-preprocessing data-quality data-science data-wrangling

Last synced: 21 Mar 2025

https://github.com/theronione/cleaner.jl

A toolbox of simple solutions for common data cleaning problems.

data data-cleaning julia

Last synced: 24 Oct 2025

https://github.com/jmcastagnetto/covid-19-data-cleanup

Scripts to cleanup data from https://github.com/CSSEGISandData/COVID-19

covid-19 covid-19-data data-cleaning data-visualization datasets r

Last synced: 17 Apr 2025

https://github.com/facultyai/boltzmannclean

Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines

data-cleaning data-science dataframe pandas restricted-boltzmann-machine

Last synced: 27 Jun 2025

https://github.com/data-cleaning/errorlocate

Find and replace erroneous fields in data using validation rules

data-cleaning errors invalidation r

Last synced: 22 Oct 2025

https://github.com/jkminder/data2neo

Data2Neo is a library that simplifies the conversion of data in relational format to a graph knowledge database.

data-cleaning data-conversion data-engineering data2neo database-migrations graphs neo4j relational-databases remodeling

Last synced: 12 Apr 2025

https://github.com/rubydamodar/the-ultimate-pandas-bootcamp

Welcome to the Pandas for Data Science repository! This course is designed to take you from beginner to proficient in using Pandas, the powerful data manipulation library in Python. Whether you're just starting your data science journey or looking to sharpen your skills, this repository contains all the resources

beginner-friendly csv-data data-analysis data-cleaning data-manipulation data-science data-visualization dataframe exploratory-data-analysis jupyter-notebook machine-learning matplotlib numpy pandas python python-pandas series statistical-analysis time-series titanic-dataset

Last synced: 19 Apr 2025

https://github.com/amine-smahi/r-learning-journey

Some of the projects i made when starting to learn R for Data Science at the university

afc cpa data-cleaning data-integration data-science datascience r r-language

Last synced: 18 Mar 2025

https://github.com/catalyst/moodle-local_datacleaner

Reduce, filter, and anonymize moodle data for non-prod environments

anonymize data-cleaning datacleaner moodle php plugin

Last synced: 25 Jul 2025

https://github.com/aifred-health/vulcanai

A high level deep learning framework for quickly prototyping networks with added tools in data visualisation, model interpretability and performance metrics

data-analysis data-cleaning data-science data-visualization deep-learning deep-neural-networks feature-engineering mental-health python3 pytorch scikit-learn

Last synced: 01 Aug 2025

https://github.com/bbva/mercury-dataschema

Utility package that, given a Pandas DataFrame, it uses the DataSchema class which auto-infers feature types and automatically calculates different statistics depending on the types.

analytics data data-cleaning data-processing data-science feature-engineering

Last synced: 21 Jun 2025

https://github.com/facultyai/ipydataclean

Interactive cleaning for Pandas DataFrames

data-cleaning data-science dataframe jupyter-notebook pandas

Last synced: 26 Aug 2025

https://github.com/firaskahlaoui/heart-disease-prediction

The Heart Disease Prediction project aims to predict the likelihood of heart disease using machine learning techniques.

data-cleaning data-visualization flask jupyter-notebook kaggle-dataset model-building python3

Last synced: 14 Apr 2025

https://github.com/chinmayrane16/titanic-survival-in-depth-analysis

Used Pandas , Matplotlib , Seaborn libraries to Analyze , Visualize and Explore the data of people travelling on Titanic, and Used Scikit-learn Modelling Algorithms to predict their probability of Survival.

classification-model data-cleaning data-visualization feature-engineering matplotlib numpy pandas seaborn

Last synced: 11 Oct 2025

https://github.com/kemingy/plane

A text processing tool including tag(HTML, URL, Email) extraction and removing, punctuation normalization, simple segmentation, and so on.

chinese-nlp data-cleaning nlp preprocess regex tokenization tokenizer

Last synced: 17 Mar 2025

https://github.com/incubated-geek-cc/text-manipulation

A browser-based text-manipulation toolkit. No server required. Re-designed version of https://textmechanic.com/

css data-cleaning html javascript productivity text-editor

Last synced: 14 Apr 2025

https://github.com/vida-nyu/openclean-core

Data Cleaning and Data Profiling Library for Python

data-cleaning data-curation hacktoberfest

Last synced: 10 Apr 2025

https://github.com/waynejz/comp9321-19t1

COMP9321 Data Services Engineering 2019T1

backend data-cleaning data-services data-visualization

Last synced: 18 Aug 2025

https://github.com/data-forge/data-forge-fs

This library contains the file system extensions to Data-Forge that allow it to directly read and write CSV and JSON files in Node.js

csv data data-analysis data-cleaning data-cleansing data-forge data-management data-manipulation data-munging data-visualization data-wrangling javascript json linq nodejs pandas visualization

Last synced: 04 Sep 2025

https://github.com/jchehe/xcel

【项目已迁移到团队github】因此该 repository 只会同步最新的 README.md,若需要 watch、Star、Fork,则去团队的 github。谢谢。

data-cleaning electron vue

Last synced: 17 Jul 2025

https://github.com/ddayto21/nba-time-series-forecasts

This repo contains machine learning applications that use time-series forecasts to predict the probability of certain players winning the MVP award in the National Basketball Association

beautifulsoup4 data-cleaning machine-learning nba nba-mvp-prediction python requests-library-python

Last synced: 30 Apr 2025

https://github.com/jay0lee/cmdc

Chrome Managed Data Cleanup - https://chrome.google.com/webstore/detail/chrome-managed-data-clean/anfhmiaflneaeffhlmbcedfjakdlpleg

cache cookies data-cleaning g-suite google-chrome google-chrome-extension javascript

Last synced: 12 May 2025

https://github.com/saisurajmatta/bike-sales-excel-dashboard-project

Bike Sales Excel Dashboard Project: Analyzed and visualized sales data, cleaned datasets, and created interactive dashboards in Excel.

data-analysis data-analytics data-cleaning data-visualization excel excel-dashboard excel-data-analytics pivot-tables

Last synced: 11 Aug 2025

https://github.com/jacobmarks/image-deduplication-plugin

Remove exact and approximate duplicates from your dataset in FiftyOne!

computer-vision data-cleaning deduplication fiftyone image-processing plugin python similarity

Last synced: 31 Oct 2025

https://github.com/abhifuturetech/eda-rollercoaster

This repository contains an exploratory data analysis (EDA) project focused on roller coasters. The project involved organizing, cleaning, and visualizing the data to gain insights into roller coasters' characteristics and performance.

data-cleaning data-visualization mathplotlib mysql-database numpy python seaborn

Last synced: 10 Aug 2025

https://github.com/datapreprocessing/datacleaning

Data Cleaning is a python package for data preprocessing. This cleans the CSV file and returns the cleaned data frame. It does the work of imputation, removing duplicates, replacing special characters, and many more.

data data-cleaning data-cleansing data-preprocessing data-wrangling imputation python threshold

Last synced: 14 Dec 2025

https://github.com/epiverse-trace/cleanepi

R package to clean and standardize epidemiological data

data-cleaning epidemiology epiverse r r-package

Last synced: 26 Jul 2025

https://github.com/yaph/james-bond-actors

Script to grab Freebase data about James Bond actors and generate gexf data file.

data-cleaning data-processing data-retrieval freebase james-bond-actors network-graph

Last synced: 08 Sep 2025

https://github.com/marksweiss/sofine

Lightweight framework for creating data-collecting plugins and chaining calls to them from CLI, REST or Python to return unified data sets.

cross-language data-cleaning data-processing data-retrieval json python

Last synced: 29 Jul 2025

https://github.com/sayakpaul/analytics-vidhya-game-of-deep-learning-hackathon

Contains my experiments for the Game of Deep Learning Hackathon conducted by Analytics Vidhya

active-learning analytics-vidhya computer-vision data-cleaning deep-learning fastai label-noise

Last synced: 28 Jul 2025

https://github.com/siddeshsambasivam/ntuoss-datascraping-and-datacleaning-workshop

This repository contains the reference scripts and the content presented in the NTU OSS Data scraping and Data cleaning workshop.

data-cleaning data-crawling data-scraping

Last synced: 12 May 2025

https://github.com/memgonzales/pisa-2018-analysis

Jupyter notebook presenting the process of data preparation, research question formulation, data analysis, and data modeling with the goal of extracting insights from the 2018 PISA Dataset

data-cleaning data-modeling data-science data-visualization exploratory-data-analysis jupyter-notebook matplotlib numpy oecd-data pandas pisa scipy statistical-inference

Last synced: 13 Jun 2025

https://github.com/erictleung/tutorial-tidyverse

:milky_way: Presentation on the tidyverse in R to clean and manipulate data

data-cleaning data-manipulation data-science manipulate-data presentation programming r tidyverse tutorial

Last synced: 25 Mar 2025

https://github.com/erictleung/data-science

:computer: Repository for teaching materials and notes on machine learning and data science for freeCodeCamp

data-cleaning data-engineering data-science data-visualization freecodecamp learning machine-learning mathematics notes python statistics

Last synced: 25 Mar 2025

https://github.com/chaitanyac22/lending-club-project---data-analysis-for-a-consumer-finance-company

Lending Club is a consumer finance company that specializes in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. The project work aims to help the company in understanding the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.

banking business-intelligence data-analysis data-cleaning data-manipulation data-visualization exploratory-data-analysis feature-engineering finance portfolio-management python3 risk-assessment statistics

Last synced: 23 Aug 2025

https://github.com/hrolive/from-data-to-insights-with-google-cloud-platform

Four-course accelerated online specialization teaches course participants how to derive insights through data analysis and visualization using the Google Cloud Platform

data-analysis data-cleaning data-preparation data-visualization sql

Last synced: 12 May 2025

https://github.com/vishnu-t-r/data-analytics-portfolio-projects

This repository contain data analyst portfolio projects developed using various data analytics tools including SQL, Python, Tableau, Looker etc.

data data-analysis data-cleaning data-modeling data-visualization looker looker-studio python sql ssms tableau

Last synced: 23 Apr 2025

https://github.com/data-cleaning/validatesuggest

Generate validation rules from data

data-cleaning r validation

Last synced: 22 Oct 2025

https://github.com/kwokhing/network-analysis-on-mrt-station

Demo on applying the concept of network analysis on a network of connected railway stations, attempting to identify the important stations (nodes) in this network. Web scraping techniques using rvest package is also briefly discussed upon.

betweenness-centrality closeness-centrality data-cleaning degree-centrality eigenvector-centrality gephi graph-analysis igraph r rvest social-network-analysis social-networks web-scraping xpath

Last synced: 13 Oct 2025