An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with datacleaning

A curated list of projects in awesome lists tagged with datacleaning .

https://github.com/openrefine/openrefine

OpenRefine is a free, open source power tool for working with messy data and improving it

data-analysis data-science data-wrangling datacleaning datacleansing datajournalism datamining java journalism opendata reconciliation wikidata

Last synced: 13 May 2025

https://github.com/OpenRefine/OpenRefine

OpenRefine is a free, open source power tool for working with messy data and improving it

data-analysis data-science data-wrangling datacleaning datacleansing datajournalism datamining java journalism opendata reconciliation wikidata

Last synced: 15 Mar 2025

https://github.com/sfu-db/dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.

apis apiwrapper cleaning connector data-exploration data-science datacleaning dataconnector dataprep datapreparation eda exploratory-data-analysis webconnector

Last synced: 14 May 2025

https://github.com/yobulkdev/yobulkdev

🔥 🔥 🔥Open Source & AI driven Data Onboarding Platform:Free flatfile.com alternative

csv-import csv-parser csv-reader data-engineering datacleaning embeddable javascript languagemodel mongodb nextjs nodejs open-source react stream streaming

Last synced: 21 Apr 2025

https://github.com/DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 05 May 2025

https://github.com/datakitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 04 Apr 2025

https://github.com/benchopt/benchmark_bilevel

Benchmark for bi-level optimization solvers

bilevel-optimization datacleaning hyperparameter-optimization

Last synced: 01 Aug 2025

https://github.com/mundipagg/amora-data-build-tool

Amora Data Build Tool enables analysts and engineers to transform data on the data warehouse (BigQuery) by writing Amora Models that describe the data schema using Python's "PEP484 - Type Hints" and select statements with SQLAlchemy. Amora is able to transform Python code into SQL data transformation jobs that run inside the warehouse.

analytics analytics-dashboard analytics-engineering bigquery business-intelligence data-engineering data-modeling datacleaning dataquality elt machine-learning python transformation

Last synced: 08 Sep 2025

https://github.com/data-cleaning/validatedb

Validate on a table in a DB, using dbplyr

database datacleaning validation

Last synced: 22 Oct 2025

https://github.com/nirala96/bangalore-house-prediction-app

Predicts home prices of Bangalore. Used Flutter, Flask and Jupyter Notebook.

data-science datacleaning exploratory-data-analysis flask-api flutter jupyter-notebook linear-regression python

Last synced: 23 Mar 2025

https://github.com/ropensci/excluder

Checks for Exclusion Criteria in Online Data

datacleaning exclusion mturk qualtrics r r-package rstats

Last synced: 22 Oct 2025

https://github.com/salaah01/pandas-data-cleaner

A package to aid with data cleaning using pandas.

datacleaning pandas python

Last synced: 11 Aug 2025

https://github.com/NhanAZ/DataCleaner

Clean up unnecessary data inside plugin_data folder

dataclean datacleaning php plugin pmmp pocketmine pocketmine-mp

Last synced: 09 Jul 2025

https://github.com/vijishmadhavan/parse-clip

A simple CLIP based project for combining images from multiple datasets.

clip data datacleaning dataexploration dataset fastai image python

Last synced: 09 Oct 2025

https://github.com/divithraju/divith-raju-immigration-data-engineering

A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)

apachespark bigdata bigdataprocessing bigdataproject capstone-project datacleaning dataengineering datalake datamodeling datapipeline dataprocessing dataschema dataset datawherehouse pandas sql

Last synced: 20 Feb 2025

https://github.com/mchenryspagg/google-play-store-apps-analysis-visualization

An analysis and visualization of google play store apps scraped data for the period of 2010 - 2018 . This project aims at cleaning the dataset, analyzing the given dataset, and mining informational quality insights. This project also involves visualizing the data to better and easily understand trends and different categories.

dataanalysis datacleaning datavisualization documentation mysql powerbi preprocessing python sql

Last synced: 20 Feb 2025

https://github.com/vishrut-b/clustering-analysis-of-online-retail-data

This project leverages machine learning techniques to analyze online retail data through customer segmentation. It uses KMeans clustering to identify key customer groups and proposes tailored business strategies based on their purchasing behaviors.

clustering datacleaning exploratory-data-analysis feature-engineering kmeans-clustering machine-learning numpy online-retail pandas sciki seaborn

Last synced: 12 May 2025

https://github.com/ngambip/top-uk-youtubers-2024.githu.io

This project involves a comprehensive analysis to determine the top YouTubers in the UK for 2024, Using Excel, SQL and Power BI.

analysis dashboards datacleaning dataqualitycheck dax excel kpi mockup powerbi recommendations testing tsql

Last synced: 11 Oct 2025

https://github.com/girish119628/girish119628

Data Enthusiast | Predictive Modeler | Turning Insights into Strategies

cross-validation data-visualization datacleaning feature-engineering modeling preprocessing

Last synced: 17 Jul 2025

https://github.com/arzan101/ev--car-data-analysis

This Power BI dashboard provides an interactive and data-driven overview of the electric vehicle (EV) landscape. It visualizes key insights across various dimensions including sales trends, model performance, manufacturer comparisons, and market growth. The purpose of the dashboard is to enable stakeholders to explore and analyze development

data-analysis data-science data-visualization database datacleaning excel powerbi

Last synced: 17 Jun 2025

https://github.com/sonu275981/big-mart-sales-prediction

Using Machine Learning Algorithms for Regression Analysis to predict the sales pattern and Using Data Analysis and Data Visualizations to Support it.

bigmart-sales-prediction data-science database datacleaning feature-engineering machine-learning pandas python sales xgboost-algorithm

Last synced: 06 Aug 2025

https://github.com/pavankethavath/car_dekho_car_price_prediction

A Streamlit web app utilizing Python, scikit-learn, and pandas for used car price prediction. Features data preprocessing (scaling, encoding), Random Forest model optimization with GridSearchCV, and interactive user input handling. Achieves high accuracy (R² score: 0.9028), showcasing skills in machine learning, data engineering, and deployment.

dataanalysis datacleaning datapreprocessing eda encoding feature-extraction feature-selection featureimportance fine-tuning machine-learning minmaxscaling normalization pandas pickle prediction-model python random-forest randomsearch-cv regression streamlit

Last synced: 23 Apr 2025

https://github.com/cintia0528/data_science-ab_testing

Conduct a 5-way AB Test on Montana State University Library's website, comparing the original "Interact" button with new versions ("Learn," "Help," "Connect," "Services") to boost user engagement.

abtesting bonferroni chisquare-test data data-science datacleaning datavisualization hypothesis-testing mde statistics

Last synced: 31 Mar 2025

https://github.com/r-mahesh45/hr---resume-text-classification

Text Classification for Resumes: Conducted Exploratory Data Analysis (EDA) on a vast collection of resumes. Organized the data using Bag of Words (BoW) and TF-IDF techniques. Built and evaluated multiple models, with Logistic Regression delivering standout performance. Created Word Clouds and Histograms.

data datacleaning extract-transform-load feature-extraction nlp nltk-tokenizer text-mining text-processing

Last synced: 12 Sep 2025

https://github.com/jenderal92/data-cleaning-tools

This tool is simple and effective for cleaning datasets in CSV format. With its features, you can improve data quality automatically.

data-cleaing-tools datacleaning python python27 remove-duplicates remove-empty-rows

Last synced: 25 Mar 2025

https://github.com/mariaegbuna/road-accidents

Analyzing a road accidents dataset using Python.

data-visualization datacleaning jupyter-notebook pandas-dataframe python

Last synced: 10 Oct 2025

https://github.com/shivam1808/data-cleaning-project

We take raw housing data and transform it in SQL Server to make it more usable for analysis.

analysis data datacleaning sql sqlserver

Last synced: 06 Mar 2025

https://github.com/R-Mahesh45/HR---Resume-Text-Classification

Text Classification for Resumes: Conducted Exploratory Data Analysis (EDA) on a vast collection of resumes. Organized the data using Bag of Words (BoW) and TF-IDF techniques. Built and evaluated multiple models, with Logistic Regression delivering standout performance. Created Word Clouds and Histograms.

data datacleaning extract-transform-load feature-extraction nlp nltk-tokenizer text-mining text-processing

Last synced: 13 Oct 2025

https://github.com/abhijit2505/coupon-redemption-prediction

A machine learning test case to predict the redemption of the coupon.

data-science datacleaning decision-trees logistic-regression machine-learning-algorithms python3

Last synced: 17 Mar 2025

https://github.com/sksubhadeep/nashville-housing-data-cleaning-project-using-sql

SQL Data Cleaning Project on Nashville Housing Dataset

data datacleaning sql

Last synced: 20 Feb 2025

https://github.com/kiranmayi5/data-warehouse-development-and-analysis

This project highlights my ability to design a comprehensive data warehouse and leverage SQL to generate actionable insights for strategic decision-making.

datacleaning datawarehousing etl sql

Last synced: 26 Feb 2025

https://github.com/shuklayash02/excel_complete_vrindastore_dataanalysis

Compltete AnalysisData Cleaning,processing and data analysis with interactive dashboard

analysis data data-visualization datacleaning excel excel-vba

Last synced: 12 Jun 2025

https://github.com/gaurav-van/toxic-comment-web_app

Data Science Project to classify a comment into several toxicity categories. This Repository is used for deployment of the project.

classification data-science datacleaning exploratory-data-analysis machine-learning nlp nlp-machine-learning python streamlit

Last synced: 28 Mar 2025

https://github.com/shuklayash02/sales_dashboard_powerbi

Created interactive dashboard to track and analyze online sales data Used complex parameters to drill down in worksheet and customization using filters and slicers

data-visualization datacleaning excel powerbi

Last synced: 09 Apr 2025

https://github.com/rosanafss/alteryx-journey

Practicing for Udacity Data Track. Data analysis executed by me based on the free course | Creating an Analytical Dataset | of Udacity

aggregating alteryx cross-tabbing dataanalysis datacleaning transposing webscraping

Last synced: 19 Nov 2025

https://github.com/rizz1406/customer-churn-analysis

Telco Customer Churn Analysis - Data analysis and visualization to identify churn patterns in telecom customers. Includes EDA, feature engineering, and optional machine learning modeling to predict churn and provide business insights.

churn-analysis dataanalysis dataanalysisusingpython datacleaning jupyter-notebook python visualization

Last synced: 09 Mar 2025

https://github.com/sreejabethu/sales-data-analysis-forecasting

Welcome to the Sales Data Analysis & Forecasting project! 🚀 This repository showcases my data analysis skills through exploratory data analysis (EDA), data cleaning, and visualization of sales and customer feedback data. The goal is to extract actionable insights to drive business decisions.

analysis barchart data-visualization datacleaning exploratory-data-analysis forecasting histogram matplotlib-pyplot numpy-library pandas-library pycharm-ide sales-analysis salesdata salesdataanalysis seaborn-plots transformation

Last synced: 31 Aug 2025

https://github.com/cintia0528/data_science-unsupervised_machine_learning

I aim to automate playlist creation for Moosic, a startup known for manual curation, using Machine Learning, while addressing skepticism about the ability of audio features to capture playlist "mood."

data data-preprocessing data-scaling data-science data-visualization datacleaning elbow-method kclustering machine-learning pandas python silhouette-score unsupervised-machine-learning

Last synced: 31 Mar 2025

https://github.com/makepath/medaprep

medaprep is a data preparation and feature engineering toolkit for geospatial applications.

data data-science datacleaning eda exploratory-data-analysis xarray

Last synced: 29 Jun 2025

https://github.com/wakolivotes/data-processing-and-preparation

In this tutorial, we use the Titanic Data (obtained from Kaggle) to illustrate key aspects of Data Processing and Preparation by relying on useful Python Libraries

data-science datacleaning jupyter-notebook python

Last synced: 22 Mar 2025

https://github.com/simran2911/sales-analysis-dashboard

This github repository contains Comprehensive Sales Analysis Dashboard. The objective of this Tableau project is to create an interactive and insightful dashboard that provides a comprehensive analysis of sales data.

analysis datacleaning excel tableau

Last synced: 26 Feb 2025

https://github.com/shuklayash02/complete_data_analysis_project

A Full Data Analysis project where a sales data is ask,prepare,process,analyze,share and act through data analysis process

data data-visualization dataanalysis database datacleaning powerbi sql

Last synced: 16 Jul 2025

https://github.com/cintia0528/data_cleaning_and_analytics-python

Evaluate if aggressive discounting benefits Eniac long-term, considering differing views on customer acquisition and brand positioning. Focus on data cleaning for informed decision-making.

colab-notebook data data-analysis datacleaning dataquality jupyter-notebook matplotlib pandas python seaborn

Last synced: 31 Mar 2025

https://github.com/datarohit/data-cleaning-exercise-2

In this Exercise a partial cleaning and the reordering of column headings has been done in excel and rest cleaning done in Python.

data-cleaning datacleaning pandas

Last synced: 15 Aug 2025

https://github.com/rakumar99/power-bi-projects

This repository contains various power bi projects and dashboards of Humaan Resources , Financial Analysis using Power BI Desktop.

dashboards data-analysis data-visualization databases datacleaning datamodeling etl powerbi powerquery reports

Last synced: 26 Feb 2025

https://github.com/joyalshaji135/product-sale-report-using-power-bi

In Power BI, load the Sales and Category tables, create a relationship between them using CategoryID, and define measures like Total Sales. Build a report with visuals (e.g., bar charts, tables) to display sales data by category, format the visuals, and add slicers for dynamic filtering by category and date.

charts datacleaning powerbi

Last synced: 05 Jan 2026

https://github.com/rizz1406/superstore-sales-analysis

Power BI dashboard analyzing superstore sales trends and forecasting future sales

datacleaning datamodeling datavizualization microsoft-excel powerbi powerbidashboard salesanalysis

Last synced: 02 Mar 2025

https://github.com/edochiari/layoffs-data_cleaning

SQL script for cleaning a dataset related to work layoffs. It removes duplicates, standardizes data, handles null values, and eliminates irrelevant columns and rows, ensuring data integrity

datacleaning layoffdata sql

Last synced: 29 Mar 2025

https://github.com/edochiari/tiktok-project

This project builds a predictive model to help TikTok classify user-reported content claims, improving moderation efficiency by identifying and prioritizing content that may need review. Insights from this model enable TikTok to manage reports more effectively, ensuring a safer and more engaging platform.

content-claims dataanalysis datacleaning hypothesis-testing jupyter-notebook regression tiktok

Last synced: 29 Mar 2025

https://github.com/edochiari/coffee_sales-data_analysis

This project involves creating a dynamic Coffee Sales Performance Dashboard in Excel, offering actionable insights into sales across various dimensions. Users can filter and explore data interactively, focusing on total sales, sales by country, and top customers, helping stakeholders identify trends and make informed decisions.

coffee dataanalysis datacleaning datavisualization excel sales

Last synced: 29 Mar 2025

https://github.com/edochiari/automatidata-project

This project uses taxi trip data to identify key factors that influence tipping, providing insights to help drivers maximize tips through optimized service.

dataanalysis datacleaning hypothesis-testing jupyter-notebook machine-learning regression taxi tipping

Last synced: 29 Mar 2025

https://github.com/bipinoli/complex-sentence-splitter-to-simple-sentences

A package to split a complex text into simple sentences.

datacleaning nlp-library nlp-parsing python

Last synced: 15 Jul 2025

https://github.com/shekharkram/project

A collection of data analytics projects showcasing skills in data cleaning, exploration, visualization, and basic SQL queries. Designed to demonstrate entry-level data analyst competencies using real-world datasets and tools.

datacleaning excel jupyter-notebook mysql numpy pandas postgresql python sql

Last synced: 24 Dec 2025

https://github.com/tejaswirupa/early-prediction-of-diabetes-risk-using-machine-learning

Built a predictive model using CDC health data to identify individuals at risk of developing diabetes. Achieved 90.6% F1-score using Logistic Regression and revealed key health indicators like BMI and blood pressure as top predictors.

data-science datacleaning exploratory-data-analysis modelevaluation preprocessing-data python scikit-learn supervised-machine-learning

Last synced: 15 Jul 2025

https://github.com/udhaya2823/dataspark-illuminating-insights-for-global-electronics

✨DataSpark✨ is a powerful analytics project transforming raw retail data into actionable insights for Global Electronics. By leveraging Python, SQL, and interactive visualizations, it uncovers trends in customer behavior, sales performance, and product popularity, driving smarter business decisions and boosting growth.

data-science data-visualization database-management datacleaning exploratory-data-analysis matplotlib numpy pandas powerbi python seaborn sql version-control

Last synced: 17 Jul 2025

https://github.com/huseinhaji/projects

This repository is a collection of projects I have worked on, showcasing my skills in data analysis, data science, and machine learning.

businessanalytics dataanalysis datacleaning datavisualization machinelearning matplotlib python sklearn

Last synced: 19 Jun 2025

https://github.com/edochiari/customer_clustering-project

This project applies K-Means clustering to segment customers based on RFM metrics, helping identify key customer groups for targeted marketing and loyalty strategies.

dataanalysis datacleaning jupyter-notebook kmeans-clustering

Last synced: 12 Mar 2025

https://github.com/yadavkaushal/datascience-e-commerce-shopping-details

This project analyzes customer purchase data including details such as location, company, credit card usage, browser info, job roles and purchase price. It explores patterns in payment methods, spending behavior and online transactions. Using Pandas, Matplotlib and Seaborn, we clean analyze and visualize key trends to derive actionable insights.

data datacleaning dataframe datapreprocessing dataset libraries matplotlib numpy pandas plots visulaization

Last synced: 24 Dec 2025

https://github.com/kimatudo3/atliq-hardware-dashboard

The AtliQ Hardware BI 360 Dashboard is a comprehensive business intelligence tool crafted to empower AtliQ Hardware with data-driven insights across various departments.

atliq dashboard data-engineering data-visualization database database-management datacleaning dax-query m mysql powerbi-desktop powerquery sql-server visualization

Last synced: 21 Mar 2025

https://github.com/aadityasikder/Object-Detection-with-raspberry-pi-implementing-TinyML-models

Repository for Raspberry Pi-based object detection with TinyML models like TensorFlow Lite, PyTorch Nano, including data gathering, mAP evaluation, and image data preparation in Jupyter notebooks.

data-gathering datacleaning dataprocessing image-preparation object-detection pytorch-nano raspberry-pi-4 tensorflow-lite tinyml

Last synced: 16 Dec 2025

https://github.com/lazakiro/finally-postgres

Ready-to-use PostgreSQL development environment with Docker. Simple setup, smart defaults, and comprehensive management commands for local development and testing. Zero configuration needed to start, fully configurable when needed.

containerization database datacleaning development development-environment devops docker-environment lambda-functions local-development nextjs postgres-docker postgresql quickdbd react

Last synced: 04 Mar 2025

https://github.com/vincenzopalazzo/visualsars2chart

Visual analytics data COVID-19 (SARS 2) with python and Tableau

covd-19 covid-2019 covid19 data-visualization datacleaning dataset python3

Last synced: 28 Mar 2025

https://github.com/ahmad-ali-rafique/random-forest-regressor-modeling

Detailed exploration of random forest regressors, including data cleaning, model building, and performance evaluation on various datasets.

data dataanalytics datacleaning evaluation-metrics modeling random-forest random-forest-regression regression regression-analysis

Last synced: 05 Mar 2025

https://github.com/kuldeepsharma-dataanalyst/college_database_system_sql_project

SQL project demonstrating database design, queries, and analysis for a college management system.

columns datacleaning datamanagement dbms dbmsproject mysql-database pgadmin postgresql queries rows sql sqlqueries tables

Last synced: 05 Nov 2025

https://github.com/ashish-kr-srivastava/olympic-games-eda---python

About Exploratory Data Analysis of a Historical Olympic Games Dataset, including all the games from Athens 1896 to Rio 2016.

data-visualization datacleaning eda matpotlib numpy pandas python seaborn seaborn-python

Last synced: 24 Oct 2025

https://github.com/abhijeet107/task-1

Data Cleaning and Preprocessing

datacleaning excel pandas python

Last synced: 13 Apr 2025

https://github.com/sathyanarayanan2002/ml_project

A house price prediction website built with Django allows users to input property details and receive real-time price estimates using machine learning model. The site integrates Django for backend functionality and serves machine learning predictions based on user input.

algorithms css3 datacleaning django html5 linear-regression python

Last synced: 29 Dec 2025

https://github.com/thebaldanalyst/projects

A collection of various data analytic projects showcasing skills in EDA, data cleaning, data visualization and data scrapping

dashboard datacleaning datavisualization eda excel powerbi python smss sql tableau

Last synced: 09 Apr 2025

https://github.com/mastermindromii/data-cleaning-using-power-query

A Simple Real-Time Data Cleaning Using Power Query in Power BI

datacleaning netflixdata powerbi powerquery rawdata-converter

Last synced: 25 Feb 2025

https://github.com/priyapuranik/diwali_sales_dashboard

A Power BI dashboard that analyzes Diwali sales data, providing insights into revenue, orders, and customer demographics across various categories and regions.

charts dashboard datacleaning dax-query powerbi

Last synced: 15 Jul 2025

https://github.com/ahmad-ali-rafique/electricity-consumption-analysis-household-dataset

This repository contains analysis and predictive modeling of household electricity consumption using Python. It includes data cleaning, exploratory data analysis (EDA), time series forecasting (ARIMA, SARIMA, LSTM), and model evaluation to optimize energy usage.

arima-forecasting artificial-intelligence artificial-neural-networks data data-science dataanalytics datacleaning evaluation-metrics exploratory-data-analysis long-short-term-memory lstmmodel modeling time-series timeseries-forecasting

Last synced: 23 Jun 2025