Projects in Awesome Lists tagged with data-cleaning
A curated list of projects in awesome lists tagged with data-cleaning .
https://github.com/cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
active-learning annotation data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation dataops dataquality datasets exploratory-data-analysis labeling llms noisy-labels out-of-distribution-detection outlier-detection weak-supervision
Last synced: 08 Jan 2026
https://github.com/voxel51/fiftyone
Refine high-quality datasets and visual AI models
active-learning artificial-intelligence computer-vision data-centric-ai data-cleaning data-curation data-quality data-science deep-learning developer-tools image-classification machine-learning object-detection python unstructured-data vector-search visualization
Last synced: 19 Feb 2026
https://github.com/johnkerl/miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
command-line command-line-tools csv csv-format data-cleaning data-processing data-reduction data-regression devops devops-tools json json-data miller statistical-analysis statistics streaming-algorithms streaming-data tabular-data tsv unix-toolkit
Last synced: 21 Feb 2026
https://github.com/unionai-oss/pandera
A light-weight, flexible, and expressive statistical data testing library
assertions data-assertions data-check data-cleaning data-processing data-validation data-verification dataframe-schema dataframes hypothesis-testing pandas pandas-dataframe pandas-validation pandas-validator schema testing testing-tools validation
Last synced: 14 Apr 2026
https://github.com/justmarkham/pandas-videos
Jupyter notebook and datasets from the pandas video series
data-analysis data-cleaning data-science jupyter-notebook pandas python tutorial
Last synced: 15 May 2025
https://github.com/justmarkham/dat8
General Assembly's 2015 Data Science course in Washington, DC
clustering course data-analysis data-cleaning data-science data-visualization decision-trees ensemble-learning jupyter-notebook linear-regression logistic-regression machine-learning model-evaluation naive-bayes natural-language-processing pandas python regular-expressions scikit-learn web-scraping
Last synced: 15 May 2025
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 14 May 2025
https://github.com/sfirke/janitor
simple tools for data cleaning in R
data-analysis data-cleaning data-science dirty-data excel pivot-tables r spss tabulations tidyverse
Last synced: 13 May 2025
https://github.com/skrub-data/skrub
Machine learning with dataframes
data data-analysis data-cleaning data-preparation data-preprocessing data-science data-wrangling dataframe dataframes dirty-data machine-learning
Last synced: 06 Jan 2026
https://github.com/data-forge/data-forge-ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
csv data data-analysis data-cleaning data-cleansing data-forge data-management data-manipulation data-munging data-visualization data-wrangling javascript json linq nodejs pandas visualization
Last synced: 13 May 2025
https://github.com/ECNU-ICALK/EduChat
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
belle chinese-nlp data-cleaning education llama llm moss open-models
Last synced: 01 Apr 2025
https://github.com/akanz1/klib
Easy to use Python library of customized functions for cleaning and analyzing data.
data-analysis data-cleaning data-preprocessing data-science data-visualization feature-selection klib python
Last synced: 01 Feb 2026
https://github.com/schema-inspector/schema-inspector
Schema-Inspector is a simple JavaScript object sanitization and validation module.
data-cleaning javascript sanitization validation
Last synced: 15 Jan 2026
https://github.com/data-cleaning/validate
Professional data validation for the R environment
Last synced: 18 Feb 2026
https://github.com/desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 22 Nov 2025
https://github.com/Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data
Last synced: 03 Apr 2025
https://github.com/jim-schwoebel/voicebook
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
data data-cleaning encryption-decryption featurization generation machine-learning python3 security server transcription visualization voice voice-activity-detection voice-assistant voice-computing voice-control voice-recognition voice-recording wake-word-detection
Last synced: 06 Apr 2025
https://github.com/msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
data-cleaning data-pipeline data-preprocessing data-processing machine-learning preprocessing pytorch torch
Last synced: 14 Jan 2026
https://github.com/rasgointelligence/feature-engineering-tutorials
Data Science Feature Engineering and Selection Tutorials
data-cleaning data-science exploratory-data-analysis feature-engineering feature-selection features jupyter machine-learning notebook pandas pandas-profiling pyrasgo python scikit-learn sweetviz tutorial tutorials xgboost
Last synced: 14 Jun 2025
https://github.com/cambioml/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
data-cleaning generative-ai llm
Last synced: 11 Oct 2025
https://github.com/probcomp/pclean
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming
Last synced: 08 May 2025
https://github.com/genomoncology/FuzzTypes
Pydantic extension for annotating autocorrecting fields.
data-cleaning fuzzy-string-matching named-entity-linking pydantic
Last synced: 11 May 2025
https://github.com/probcomp/PClean
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
bayesian-inference data-cleaning data-cleansing probabilistic-graphical-models probabilistic-programming
Last synced: 04 May 2025
https://github.com/charlesdedampierre/BunkaTopics
🗺️ Data Cleaning and Textual Data Visualization 🗺️
cartography data-cleaning explainability fine-tuning llms machine-learning natural-language-processing nlp summarization topic-modeling
Last synced: 30 Aug 2025
https://github.com/ekstroem/datamaid
An R package for data screening
data-cleaning data-screening reproducible-research
Last synced: 09 Apr 2025
https://github.com/ekstroem/dataMaid
An R package for data screening
data-cleaning data-screening reproducible-research
Last synced: 06 May 2025
https://github.com/hi-primus/bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
bumblebee cudf dask dask-cudf data-cleaning data-preparation data-profiling datasets gpu gui optimus prepare-data python
Last synced: 02 May 2025
https://github.com/jim-schwoebel/allie
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
autokeras automl autopytorch data-augmentation data-cleaning data-cleaning-pipeline data-transformation data-visualization datasets deep-learning ludwig machine-learning machine-learning-api machine-learning-library machine-learning-models model-compression model-deployment tpot voice-computing
Last synced: 21 Aug 2025
https://github.com/iam-mhaseeb/skytrax-data-warehouse
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
airflow data-analysis data-analytics data-cleaning data-engineering data-orchestration data-processing data-visualization data-warehouse data-warehousing database docker metabase python python3 redshift s3 s3-bucket sql
Last synced: 12 Aug 2025
https://github.com/datawithbaraa/sql-data-warehouse-project
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
data-analysis data-analytics data-cleaning data-engineering data-lakehouse data-science data-warehouse data-warehousing datalake datascience datawarehouse datawarehousing etl etl-job etl-pipeline medallion-architecture sql sql-query sql-server sqlserver
Last synced: 06 Apr 2025
https://github.com/ChrisMuir/refinr
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats
Last synced: 15 Mar 2025
https://github.com/chrismuir/refinr
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
approximate-string-matching clustering cran data-cleaning data-clustering fuzzy-matching ngram openrefine r rstats
Last synced: 08 Sep 2025
https://github.com/aai-institute/pyDVL
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
banzhaf-index data-centric-ai data-cleaning data-pruning data-quality data-valuation game-theory influence-functions least-core machine-learning robust-machine-learning shapley-value transferlab
Last synced: 11 May 2025
https://github.com/sail-sg/sailcraft
🚢 Data Toolkit for Sailor Language Models
data-cleaning data-deduplication
Last synced: 05 Oct 2025
https://github.com/lolei/redditcleaner
Cleans Reddit Text Data :scroll: :broom:
data-cleaning hacktoberfest nlp praw psaw pushshift python reddit text-data
Last synced: 22 Jul 2025
https://github.com/cosbidev/pytrack
a Map-Matching-based Python Toolbox for Vehicle Trajectory Reconstruction
computer-vision data-cleaning gps-tracker graph intelligent-transportation-systems map-match map-matching maps network-graph networkx openstreetmap python snapping street-view topology tracking trajectory-analysis visualization
Last synced: 13 Oct 2025
https://github.com/renumics/sliceguard
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
data-analysis data-cleaning data-curation data-exploration data-science data-visualization deep-learning eda exploratory-data-analysis machine-learning python visualization
Last synced: 16 Mar 2025
https://github.com/akvo/akvo-lumen
Make sense of your data
agplv3 akvo akvo-lumen clojure d3 data-cleaning data-visualization react
Last synced: 29 Aug 2025
https://github.com/rvanasa/pandas-gpt
Power up your data science workflow with ChatGPT.
chatgpt claude-ai data-cleaning data-engineering data-science data-visualization gemini generative-ai gpt4 jupyter-notebook litellm low-code matplotlib numpy o1 openai pandas productivity scipy seaborn
Last synced: 09 May 2025
https://github.com/hplt-project/opuscleaner
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
data-cleaning machine-translation
Last synced: 14 Jan 2026
https://github.com/laureberti/learn2clean
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
automated data-cleaning data-cleaning-pipeline data-curation data-preprocessing reinforcement-learning
Last synced: 11 Sep 2025
https://github.com/ropensci/taxa
taxonomic classes for R
data-cleaning r r-package rstats taxon taxonomy
Last synced: 21 Oct 2025
https://github.com/elysian01/data-purifier
A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning, and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.
data-analysis data-cleaning data-cleaning-pipeline data-preprocessing data-science data-visualization datapurifier eda exploratory-data-analysis jupyter python-lib python-library python3
Last synced: 04 Oct 2025
https://github.com/dssg/pgdedupe
A simple command line interface to the datamade/dedupe library.
data-cleaning database dedupe deduplication postgresql python record-linkage
Last synced: 21 Jan 2026
https://github.com/mramshaw/data-cleaning
Data Cleaning with Python
data-cleaning data-munging data-wrangling numpy pandas python python3
Last synced: 21 Aug 2025
https://github.com/ammsa/dtcleaner
DTCleaner: data cleaning using multi-target decision trees.
data-cleaning data-mining data-preprocessing data-quality data-science data-wrangling
Last synced: 21 Mar 2025
https://github.com/mrankitgupta/sales-insights-data-analysis-using-tableau-and-sql
India based Hardware company Sales Insights - A Data Analysis Project performed on Tableau & SQL
66daysofdata analysis analytics ankitgupta data-analysis data-cleaning data-science data-visualization excel mrankitgupta mysql powerbi rdbms sql sql-server statistics tableau tableau-dashboards tableau-desktop tableau-public
Last synced: 13 Aug 2025
https://github.com/theronione/cleaner.jl
A toolbox of simple solutions for common data cleaning problems.
Last synced: 24 Oct 2025
https://github.com/datacarpentry/stata-economics
Economics Lesson with Stata
carpentries data-carpentry data-cleaning data-wrangling economics english lesson pre-alpha stata
Last synced: 11 Mar 2026
https://github.com/cleanlab/cleanlab-studio
Client interface to Cleanlab Studio and the Trustworthy Language Model
annotations automl computer-vision data-centric-ai data-cleaning data-curation data-labeling data-profiling data-quality data-science data-validation image-classification llm machine-learning model-deployment natural-language-processing noisy-labels outlier-detection structured-data text-classification
Last synced: 13 Apr 2025
https://github.com/irsol/udacity-bertelsmann-data-science-challenge-scholarship-2018
This is a repo for my Bertelsmann Data Science Scholarship Challenge: notes, exercises, quizzes.
aggregation bertelsmann challenge control-flow data-cleaning data-science data-visualization python scholarship sql statistics udacity udacity-course udacity-scholarship-course udacity2018 variability
Last synced: 23 Mar 2025
https://github.com/jmcastagnetto/covid-19-data-cleanup
Scripts to cleanup data from https://github.com/CSSEGISandData/COVID-19
covid-19 covid-19-data data-cleaning data-visualization datasets r
Last synced: 17 Apr 2025
https://github.com/datacarpentry/openrefine-socialsci
OpenRefine for Social Science Data
carpentries data-carpentry data-cleaning data-management english hacktoberfest lesson open-educational-resources openrefine social-sciences stable
Last synced: 11 Mar 2026
https://github.com/facultyai/boltzmannclean
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines
data-cleaning data-science dataframe pandas restricted-boltzmann-machine
Last synced: 27 Jun 2025
https://github.com/the-hull/datacleanr
Interactive and Reproducible Data Cleaning
annotation-tool data-cleaning outlier-detection outlier-removal reproducibility
Last synced: 22 Oct 2025
https://github.com/the-Hull/datacleanr
Interactive and Reproducible Data Cleaning
annotation-tool data-cleaning outlier-detection outlier-removal reproducibility
Last synced: 30 Jul 2025
https://github.com/data-cleaning/errorlocate
Find and replace erroneous fields in data using validation rules
data-cleaning errors invalidation r
Last synced: 22 Feb 2026
https://github.com/jkminder/data2neo
Data2Neo is a library that simplifies the conversion of data in relational format to a graph knowledge database.
data-cleaning data-conversion data-engineering data2neo database-migrations graphs neo4j relational-databases remodeling
Last synced: 12 Apr 2025
https://github.com/amine-smahi/r-learning-journey
Some of the projects i made when starting to learn R for Data Science at the university
afc cpa data-cleaning data-integration data-science datascience r r-language
Last synced: 18 Mar 2025
https://github.com/rubydamodar/the-ultimate-pandas-bootcamp
Welcome to the Pandas for Data Science repository! This course is designed to take you from beginner to proficient in using Pandas, the powerful data manipulation library in Python. Whether you're just starting your data science journey or looking to sharpen your skills, this repository contains all the resources
beginner-friendly csv-data data-analysis data-cleaning data-manipulation data-science data-visualization dataframe exploratory-data-analysis jupyter-notebook machine-learning matplotlib numpy pandas python python-pandas series statistical-analysis time-series titanic-dataset
Last synced: 19 Apr 2025
https://github.com/bakdata/dedupe
Java DSL for (online) deduplication
data-cleaning data-cleansing deduplication duplicate-detection duplicate-removal
Last synced: 10 Apr 2025
https://github.com/LimaRAF/plantR
An R Package for Managing Species Records from Biological Collections
biodiversity biological-data data-cleaning data-downloader data-mining gbif herbarium r r-package
Last synced: 27 May 2026
https://github.com/catalyst/moodle-local_datacleaner
Reduce, filter, and anonymize moodle data for non-prod environments
anonymize data-cleaning datacleaner moodle php plugin
Last synced: 25 Jul 2025
https://github.com/santoshlite/quantclean
🧹 Quantclean is a program that reformats financial dataset to US Equity TradeBar (Quantconnect format)
algo-trading algorithmic-trading data-cleaning finance financial-data futures lean-engine ohlcv options quandl quant quantconnect quantitative-finance quantitative-trading stock-data stock-market stocks trading-algorithms trading-bot trading-strategies
Last synced: 14 Dec 2025
https://github.com/aifred-health/vulcanai
A high level deep learning framework for quickly prototyping networks with added tools in data visualisation, model interpretability and performance metrics
data-analysis data-cleaning data-science data-visualization deep-learning deep-neural-networks feature-engineering mental-health python3 pytorch scikit-learn
Last synced: 01 Aug 2025
https://github.com/bbva/mercury-dataschema
Utility package that, given a Pandas DataFrame, it uses the DataSchema class which auto-infers feature types and automatically calculates different statistics depending on the types.
analytics data data-cleaning data-processing data-science feature-engineering
Last synced: 21 Jun 2025
https://github.com/data-cleaning/validatetools
data-cleaning r rules validation
Last synced: 22 Oct 2025
https://github.com/autonlab/aqua
AQuA: A Benchmarking Tool for Label Quality Assessment
data-centric-ai data-cleaning data-science label-errors machine-learning robust-machine-learning
Last synced: 16 Jan 2026
https://github.com/facultyai/ipydataclean
Interactive cleaning for Pandas DataFrames
data-cleaning data-science dataframe jupyter-notebook pandas
Last synced: 26 Aug 2025
https://github.com/ccb-hms/ontology-mapper
a tool for mapping free-text descriptions of entities to ontology terms
data-cleaning fair-data fair-principles metadata-cleaning ontology ontology-mapping ontology-search ontology-services owl owlready2
Last synced: 19 Feb 2026
https://github.com/firaskahlaoui/heart-disease-prediction
The Heart Disease Prediction project aims to predict the likelihood of heart disease using machine learning techniques.
data-cleaning data-visualization flask jupyter-notebook kaggle-dataset model-building python3
Last synced: 14 Apr 2025
https://github.com/chinmayrane16/titanic-survival-in-depth-analysis
Used Pandas , Matplotlib , Seaborn libraries to Analyze , Visualize and Explore the data of people travelling on Titanic, and Used Scikit-learn Modelling Algorithms to predict their probability of Survival.
classification-model data-cleaning data-visualization feature-engineering matplotlib numpy pandas seaborn
Last synced: 11 Oct 2025
https://github.com/ketgo/marshmallow-pyspark
Marshmallow serializer integration with pyspark
data-cleaning data-engineering data-engineering-pipeline data-pipelines data-schemas marshmallow pyspark schema spark
Last synced: 02 Feb 2026
https://github.com/caerbannogwhite/preludio
Preludio is a data wrangling language based on PRQL and written in Go. 🎭
csv data data-analysis data-cleaning data-engineering dplyr dsl go golang language manipulation pipeline programming-language prql sql stack-oriented wrangling
Last synced: 17 Jan 2026
https://github.com/kemingy/plane
A text processing tool including tag(HTML, URL, Email) extraction and removing, punctuation normalization, simple segmentation, and so on.
chinese-nlp data-cleaning nlp preprocess regex tokenization tokenizer
Last synced: 17 Mar 2025
https://github.com/lukashedegaard/datasetops
Fluent dataset operations, compatible with your favorite libraries
data-cleaning data-munging data-processing data-science data-wrangling dataset dataset-combinations deep-learning multiple-datasets pytorch tensorflow
Last synced: 23 Apr 2025
https://github.com/aicorsair/dataquest-data-science-analysis-projects
A repository dedicated to storing guided projects completed while learning data science concepts with Dataquest.
classification-models cluster-analysis data-analysis data-analytics data-cleaning data-preparation data-preprocessing data-science data-visualization deep-learning excel feature-engineering machine-learning pandas-dataframe power-bi python-3 regression-models scikit-learn sql web-scraping
Last synced: 27 Oct 2025
https://github.com/LukasHedegaard/datasetops
Fluent dataset operations, compatible with your favorite libraries
data-cleaning data-munging data-processing data-science data-wrangling dataset dataset-combinations deep-learning multiple-datasets pytorch tensorflow
Last synced: 08 May 2025
https://github.com/vida-nyu/openclean-core
Data Cleaning and Data Profiling Library for Python
data-cleaning data-curation hacktoberfest
Last synced: 10 Apr 2025
https://github.com/ddayto21/nba-time-series-forecasts
This repo contains machine learning applications that use time-series forecasts to predict the probability of certain players winning the MVP award in the National Basketball Association
beautifulsoup4 data-cleaning machine-learning nba nba-mvp-prediction python requests-library-python
Last synced: 30 Apr 2025
https://github.com/waynejz/comp9321-19t1
COMP9321 Data Services Engineering 2019T1
backend data-cleaning data-services data-visualization
Last synced: 18 Aug 2025
https://github.com/jchehe/xcel
【项目已迁移到团队github】因此该 repository 只会同步最新的 README.md,若需要 watch、Star、Fork,则去团队的 github。谢谢。
Last synced: 17 Jul 2025
https://github.com/data-forge/data-forge-fs
This library contains the file system extensions to Data-Forge that allow it to directly read and write CSV and JSON files in Node.js
csv data data-analysis data-cleaning data-cleansing data-forge data-management data-manipulation data-munging data-visualization data-wrangling javascript json linq nodejs pandas visualization
Last synced: 04 Sep 2025
https://github.com/incubated-geek-cc/text-manipulation
A browser-based text-manipulation toolkit. No server required. Re-designed version of https://textmechanic.com/
css data-cleaning html javascript productivity text-editor
Last synced: 14 Apr 2025
https://github.com/codepawl/loclean
An AI Data Cleaning Library
automated-cleaning data data-cleaning data-engineering data-preprocessing data-science data-wrangling etl llm normalization open-source polars privacy-preserving python semantic-analysis slm structured-data
Last synced: 04 Apr 2026
https://github.com/jay0lee/cmdc
Chrome Managed Data Cleanup - https://chrome.google.com/webstore/detail/chrome-managed-data-clean/anfhmiaflneaeffhlmbcedfjakdlpleg
cache cookies data-cleaning g-suite google-chrome google-chrome-extension javascript
Last synced: 12 May 2025
https://github.com/benedekrozemberczki/av_ultimate_student_hunt
Solution for the Ultimate Student Hunt Challenge (1st place).
analytics-vidhya-competition competition data-cleaning data-engineering data-engineering-pipeline distributed-machine-learning driven-data extreme-gradient-boosting forecasting gradient-boosting kaggle machine-learning r student-hunt supervised-learning weather-forecast winning-entry xgboost
Last synced: 20 Jun 2025
https://github.com/saisurajmatta/bike-sales-excel-dashboard-project
Bike Sales Excel Dashboard Project: Analyzed and visualized sales data, cleaned datasets, and created interactive dashboards in Excel.
data-analysis data-analytics data-cleaning data-visualization excel excel-dashboard excel-data-analytics pivot-tables
Last synced: 11 Feb 2026
https://github.com/hypertextassassin0273/excel_data_organizer_and_cleaner-ds_project
Data Structures project in C++11 language, uses custom Vector & String structures with Move Semantics (Rule of Five)
cpp11 data-cleaning data-cleansing data-structure-projects data-structures data-structures-project data-wrangling ds-projects easy-project excel-operations move-semantics object-oriented-programming oop open-source open-source-code open-source-project rule-of-five string university-project vector
Last synced: 30 Jun 2025
https://github.com/datapreprocessing/datacleaning
Data Cleaning is a python package for data preprocessing. This cleans the CSV file and returns the cleaned data frame. It does the work of imputation, removing duplicates, replacing special characters, and many more.
data data-cleaning data-cleansing data-preprocessing data-wrangling imputation python threshold
Last synced: 14 Dec 2025
https://github.com/jacobmarks/image-deduplication-plugin
Remove exact and approximate duplicates from your dataset in FiftyOne!
computer-vision data-cleaning deduplication fiftyone image-processing plugin python similarity
Last synced: 31 Oct 2025
https://github.com/brunocampos01/allstate-claims-severity
Udacity Machine Learning Engineer Nanodegree capstone proposal.
allstate capstone-proposal challenge data-analyst-nanodegree data-cleaning data-engineering data-science data-visualization dataset deep-learning kaggle machine-learning pca-analysis pt-br python udacity-machine-learning-nanodegree
Last synced: 15 Apr 2025
https://github.com/abhifuturetech/eda-rollercoaster
This repository contains an exploratory data analysis (EDA) project focused on roller coasters. The project involved organizing, cleaning, and visualizing the data to gain insights into roller coasters' characteristics and performance.
data-cleaning data-visualization mathplotlib mysql-database numpy python seaborn
Last synced: 10 Aug 2025
https://github.com/epiverse-trace/cleanepi
R package to clean and standardize epidemiological data
data-cleaning epidemiology epiverse r r-package
Last synced: 26 Jul 2025
https://github.com/chaitanyac22/car-price-prediction-model-for-an-automobile-consulting-company
The goal of this project is to build multiple linear regression models for the prediction of car prices.
business-analytics data-analytics data-cleaning data-manipulation data-visualization exploratory-data-analysis feature-engineering machine-learning model-building model-evaluation prediction-model python3 residual-analysis statistics
Last synced: 13 Apr 2025
https://github.com/amey-thakur/kaggle
Kaggle Courses - All Exercises of the respective courses.
amey ameythakur courses data-cleaning data-manipulation data-science data-visualization deep-learning feature-engineering intro-to-ml kaggle machine-learning machine-learning-explainability python
Last synced: 25 Aug 2025
https://github.com/yaph/james-bond-actors
Script to grab Freebase data about James Bond actors and generate gexf data file.
data-cleaning data-processing data-retrieval freebase james-bond-actors network-graph
Last synced: 08 Sep 2025
https://github.com/marksweiss/sofine
Lightweight framework for creating data-collecting plugins and chaining calls to them from CLI, REST or Python to return unified data sets.
cross-language data-cleaning data-processing data-retrieval json python
Last synced: 18 Feb 2026
https://github.com/spider-rs/readability
The readability library for LLM's
clean-data data-cleaning llm-training readability rust safari-reader
Last synced: 05 Apr 2025