An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-preparation

A curated list of projects in awesome lists tagged with data-preparation .

https://github.com/hi-primus/bumblebee

🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

bumblebee cudf dask dask-cudf data-cleaning data-preparation data-profiling datasets gpu gui optimus prepare-data python

Last synced: 02 May 2025

https://github.com/asavinov/prosto

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

business-intelligence data-preparation data-preprocessing data-processing data-science data-wrangling feature-engineering map-reduce olap pandas python spark workflow

Last synced: 11 Apr 2025

https://github.com/sbcgua/mockup_loader

ABAP unit testing framework, prepare in Excel, reuse in abap code

abap data-preparation hacktoberfest mockup-loader sap test-automation testing-tools unit-testing

Last synced: 28 Oct 2025

https://github.com/soumyadip007/data-science-using-python-university-course-module

“Data science” is just about as broad of a term as they come. It may be easiest to describe what it is by listing its more concrete components: Data exploration & analysis. Included here: Pandas; NumPy; SciPy; a helping hand from Python's Standard Library.

data-preparation data-preprocessing data-processing data-science data-visualization jupyter-notebook knn numpy panda plotting python

Last synced: 23 Jun 2025

https://github.com/neuro-ml/reskit

A library for creating and curating reproducible pipelines for scientific and industrial machine learning

data-preparation grid-search pipeline prepare-data python reproducible-experiments reproducible-research scikit-learn

Last synced: 19 Jul 2025

https://github.com/salehjg/Shapenet2_Preparation

A python script to convert and down-sample mesh data into pointclouds using FPS algorithm.

data-preparation dataset farthest-point-sampling hdf5 python shapenet-dataset shapenetcore

Last synced: 20 Mar 2025

https://github.com/dustin-decker/featuremill

general-purpose fast, stateless, and deterministic feature extractor written in golang for use in machine learning

data-preparation feature-engineering feature-extraction go golang machine-learning vectorization

Last synced: 26 Jul 2025

https://github.com/18520339/finding-similar-images

Finding similar images from image URLs using ImageHash

data-preparation google-sheets-api gspread imagehash similar-images

Last synced: 15 Apr 2025

https://github.com/hrolive/from-data-to-insights-with-google-cloud-platform

Four-course accelerated online specialization teaches course participants how to derive insights through data analysis and visualization using the Google Cloud Platform

data-analysis data-cleaning data-preparation data-visualization sql

Last synced: 12 May 2025

https://github.com/kozodoi/dptools

Python package with utilities for data processing, aggregation, feature engineering and data versioning

aggregation data-preparation data-preprocessing data-science feature-engineering python

Last synced: 08 May 2025

https://github.com/kwokhing/exploratory-data-analysis-on-smrt-tweets

Demo on performing exploratory data analysis (EDA) on train service disruptions based on scrapped (user generated contents) tweets from the train operator's (SMRT) twitter account

data-analysis data-cleaning data-collection data-preparation exploratory-data-analysis exploratory-data-visualizations folium geospatial-data leaflet-map python python3 regex scraping selenium selenium-python social-media text-processing user-generated-content web-scraping webscraping

Last synced: 27 Jul 2025

https://github.com/nragland37/event-optimization-tool

R-based Shiny application that maps availability and identifies optimal engagement times to enhance participation within an organization

data-analysis data-cleaning data-preparation heatmap r shiny shiny-app tidyverse

Last synced: 14 Apr 2025

https://github.com/nisheethjaiswal/Data-Annotator-for-SpaCy

🚀SpAnnor annotator for Named Entity Recognition easy to use tool. The annotator allows users to quickly assign custom labels to one or more entities in the text. Easy to setup for Data Training for SpaCy 🔥.

data-annotation data-annotation-tools data-labeling data-preparation named-entity-recognition nlp spacy-nlp text-labeling

Last synced: 06 Aug 2025

https://github.com/muneeb1030/finetune-tiny-llama

Fine-tuning the Tiny Llama model to mimic my professor's writing style using the Llama Factory. The project involves data collection, preprocessing, preparation, fine-tuning, and evaluation.

data data-preparation data-preprocessing finetuning llama-factory llm pymupdf selenium-python spacy tinyllama webscraping

Last synced: 28 Dec 2025

https://github.com/mzguntalan/vegetable

Vegetable contains a design/definition of a Vector Graphic that allows it to easily render it as equally an spaced point cloud/sequence. From this, vegetable offers a way to read .ttf font files, and render their glyphs into point clouds/sequences.

data-preparation data-science point-cloud python svg ttf vector-graphics

Last synced: 09 Aug 2025

https://github.com/furk4neg3/sales-forecasting

Created AI models to forecast Wallmart's sales. Used different models, like dense, LSTM, GRU and naive model. Different window and horizon sizes are used too. Compared models visually at the end.

artificial-intelligence data-preparation data-visualization deep-learning forecasting gated-recurrent-unit lstm tensorflow tensorflow2

Last synced: 13 May 2025

https://github.com/hitesh22rana/sourcecollector

A simple tool to consolidate multiple files into a single .txt file. Perfect for feeding your files to AI tools without any fuss.

ai-tools data-preparation file-processing text-processing utility

Last synced: 04 Nov 2025

https://github.com/gracysapra/r-in-data-science

This repository contains essential guides for data analysis using R, covering topics like data preparation, data reshaping, and data visualization. Each file focuses on fundamental techniques to manipulate, clean, and visualize data effectively using R programming.

data-analysis data-preparation data-reshaping data-science data-visualization data-visualizations ggplot r r-for-data-science

Last synced: 03 Aug 2025

https://github.com/samuelbarbosadev/roof_imoveis_data_analysis

The company hired you because they want to know what would be the 5 properties they should invest in and why, and which 5 you would not recommend investing in at all.

data-preparation data-understanding data-visualization pandas python

Last synced: 13 Aug 2025

https://github.com/piebro/simple-image-classification-labeling-website

A simple website to label images for classification locally.

data-preparation image-classification image-labeling no-backend web-app

Last synced: 11 Mar 2025

https://github.com/satyacoder29/superstore-sales-analysis-

Analyzed Superstore Sales Data to uncover trends, optimize sales, and improve profitability. Explored customer segments, regional performance, and product categories using Python and Power BI. Delivered actionable insights to enhance revenue, streamline inventory, and refine marketing strategies, driving data-informed decision-making.

data-cleaning data-preparation data-preprocessing dataanalysis powerbi powerquerym relationships tableau tableau-server tableau-visualizations unions visualisation

Last synced: 20 Aug 2025

https://github.com/samuelbarbosadev/walrmart_data_analysis

You have been hired by Walmart to survey the revenue of their stores in the USA and point out which store would be best to expand its size. It is necessary to analyze the weekly sales of each store, calculate some important information that will be asked, and at the end of it all, indicate which store should be invested in.

data-preparation data-understanding data-visualization pandas python

Last synced: 22 Mar 2025

https://github.com/leftcoastnerdgirl/excel_crowdfunding_analysis

This project demonstrates the use of MS Excel for data cleansing & formatting to prepare for data analysis and visualization.

bar-charts conditional-formatting data-analysis data-analytics data-analytics-excel data-preparation data-preprocessing data-visualization excel line-graph

Last synced: 27 Jul 2025

https://github.com/chaitanya1436/student_performance_analysis

A project focused on analyzing college student performance using data on department, assessment scores, and performance labels. Implemented in Google Colab, the analysis includes data preprocessing, feature scaling, and exploratory data analysis to uncover insights and prepare the data for further analysis or modeling.

ata-preprocessing data-preparation exploratory-data-analysis feature-scaling google-colab numpy pandas scikit-learn

Last synced: 02 Aug 2025

https://github.com/antbit96/dataform_poc

Template for basic data preparation

bigquery bigquery-dataform data-preparation

Last synced: 12 Aug 2025

https://github.com/samuelsoaress/challenge-neural-networks-capgemini

this repository contains the code used to develop the whale breed recognition challenge

cnn-classification data-augmentation data-preparation tensorflow-models

Last synced: 17 Aug 2025

https://github.com/amitreddy14/2019-election-analysis-and-swing-prediction-model

This project analyzes voter behavior in India's 2019 general election, identifying patterns across demographics, economic conditions, and social factors using statistical methods and machine learning. By assessing regional disparities and government policies, we aim to elucidate India's democratic process and improve election outcome forecasting.

data-preparation feature-engineering linear-regression multilayer-perceptron support-vector-machines

Last synced: 07 Apr 2025

https://github.com/ksm26/pretraining-llms

Master the essential steps of pretraining large language models (LLMs). Learn to create high-quality datasets, configure model architectures, execute training runs, and assess model performance for efficient and effective LLM pretraining.

ai-training cost-effective-pretraining data-preparation depth-upscaling developer-advocacy high-quality-datasets hugging-face large-language-models llm-evaluation machine-learning meta-llama model-configuration model-initialization performance-assessment pretraining-llms text-generation training-runs

Last synced: 28 Mar 2025

https://github.com/ndomah/1.-the-basics

1. The Basics from The Data Engineering Academy

data-cleaning data-engineering data-preparation docker python sql

Last synced: 01 Jul 2025

https://github.com/tigureis/house-rent-analysis

House Rent Data Cleaning and Preparation: Clean and preprocess house rent data for further analysis.

data-cleaning data-preparation pandas seaborn

Last synced: 14 Jun 2025

https://github.com/pierrekieffer/datapreprocessing

Custom data preprocessing library made for machine learning

data-preparation data-preprocessing machine-learning preprocessing scikit-learn

Last synced: 31 Mar 2025

https://github.com/nadahamdy217/movies-data-etl-using-python-gcp

Developed a comprehensive ETL pipeline for movie data using Python, Docker, and a GCP Pub/Sub emulator. Successfully processed and published the data in a local Docker environment, showcasing advanced data engineering skills.

analytics data data-engineering data-ingestion data-preparation data-preprocessing data-processing data-project docker etl etl-pipeline gcp matplotlib matplotlib-pyplot numpy pandas pubsub python scipy seaborn

Last synced: 06 Jan 2026

https://github.com/notthestallion/data_preparation_4_ml_algorithm

This project will focus on data preparation and will follow the steps : data cleaning, handling text and categorical attributes, and feature scaling.

data-cleaning data-preparation data-preprocessing data-science feature-scaling ml onehot-encoder onehot-encoding

Last synced: 28 Nov 2025

https://github.com/tynoee/nashville-housing-data-cleaning

This repository contains SQL scripts used to clean and prepare the Nashville Housing dataset for analysis.

cte data-analytics data-cleaning data-engineering data-preparation data-processing database etl real-estate-data sql sql-server

Last synced: 12 Jun 2025

https://github.com/ndomah1/data-cleaning-in-mysql

This project cleans and standardizes a global dataset of tech layoffs using MySQL, transforming raw data into an analysis-ready format.

data-cleaning data-preparation layoffs mysql sql

Last synced: 25 Mar 2025

https://github.com/terilios/automated_data_scientist

Automated Data Scientist: An intelligent, adaptive data analysis tool that leverages AI-driven automation to dynamically plan, execute, and refine data science workflows. Automatically handles data preparation, analysis planning, code generation, and result interpretation using advanced language models.

adaptive-analytics ai-driven-analytics ai-powered-data-tools api-integration automated-data-science automation data-insights data-preparation data-science-workflow data-visualization dynamic-analysis-planning exploratory-data-analysis intelligent-data-processing language-models machine-learning ml-ops openai-gpt python scalable-data-analysis

Last synced: 23 Jun 2025

https://github.com/archettialberto/federated_survival_datasets

Build realistic heterogeneous datasets for federated survival analysis in a reproducible way.

data-preparation dataset datasets federated-learning heterogeneity survival-analysis time-to-event

Last synced: 08 Oct 2025

https://github.com/mohawk2/data-prepare

Module to prepare CSV (etc) data for automatic processing

data-cleaning data-preparation data-science perl

Last synced: 12 Oct 2025

https://github.com/ishmal793/lists-tuples-dictionaries-json-sets

Beginner-friendly Python practice covering core collection types like lists, tuples, dictionaries, sets, and JSON with real-world problems.

beginner-projects data-preparation data-structures dictionaries json lists python python-collections python-practice sets text-processing tuples

Last synced: 13 Oct 2025

https://github.com/chahelgupta/dep-videogames-dataset

The data extraction and processing involved thorough exploration, preprocessing, and visualization of the "Video Game Sales with Ratings" dataset.

data-analysis data-exploration data-extraction data-preparation data-preprocessing data-processing data-science data-visualization

Last synced: 15 Oct 2025

https://github.com/jackmnob/python-tableau-eda-stockdash

Data cleaning, preparation, and manipulation (EDA) for an interactive stock market dashboard with Tableau - using pandas (Python) via JupyterLab

cleaning-data dashboard data-analysis data-preparation eda jupyter-notebook jupyterlab python tableau-public

Last synced: 15 Oct 2025