Projects in Awesome Lists tagged with data-preparation
A curated list of projects in awesome lists tagged with data-preparation .
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 14 May 2025
https://github.com/skrub-data/skrub
Machine learning with dataframes
data data-analysis data-cleaning data-preparation data-preprocessing data-science data-wrangling dataframe dataframes dirty-data machine-learning
Last synced: 06 Jan 2026
https://github.com/NVIDIA/NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 29 Jul 2025
https://github.com/nvidia/nemo-curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 14 May 2025
https://github.com/data-prep-kit/data-prep-kit
Open source project for data preparation for GenAI applications
code-quality data data-prep data-preparation data-preprocessing data-preprocessing-pipelines datacuration datarecipes deduplication finetuning large-language-models large-scale-data-processing llm llmapps malware python ray spark
Last synced: 15 Dec 2025
https://github.com/NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs
data data-curation data-prep data-preparation data-processing data-processing-pipelines data-quality datacuration datarecipes deduplication fast-data-processing fine-tuning large-language-models large-scale-data-processing llm llm-data-quality llmapps python semantic-deduplication
Last synced: 20 Jul 2025
https://github.com/developmentseed/label-maker
Data Preparation for Satellite Machine Learning
computer-vision data-preparation deep-learning keras remote-sensing satellite-imagery
Last synced: 16 May 2025
https://github.com/packtworkshops/the-data-science-workshop
A New, Interactive Approach to Learning Data Science
binaryclassification clusteranalysis data-preparation datascience dimensionality-reduction ensemble-learning- feature-engineering hyperparameter-tuning- machine-learning machine-learning-pipelines python random-forest regression
Last synced: 05 Apr 2025
https://github.com/hi-primus/bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
bumblebee cudf dask dask-cudf data-cleaning data-preparation data-profiling datasets gpu gui optimus prepare-data python
Last synced: 02 May 2025
https://github.com/asavinov/prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
business-intelligence data-preparation data-preprocessing data-processing data-science data-wrangling feature-engineering map-reduce olap pandas python spark workflow
Last synced: 11 Apr 2025
https://github.com/sbcgua/mockup_loader
ABAP unit testing framework, prepare in Excel, reuse in abap code
abap data-preparation hacktoberfest mockup-loader sap test-automation testing-tools unit-testing
Last synced: 28 Oct 2025
https://github.com/soumyadip007/data-science-using-python-university-course-module
“Data science” is just about as broad of a term as they come. It may be easiest to describe what it is by listing its more concrete components: Data exploration & analysis. Included here: Pandas; NumPy; SciPy; a helping hand from Python's Standard Library.
data-preparation data-preprocessing data-processing data-science data-visualization jupyter-notebook knn numpy panda plotting python
Last synced: 23 Jun 2025
https://github.com/kukuster/sumstatsrehab
GWAS summary statistics files QC tool
bioinformatics bioinformatics-tool compbio computational-biology data-prep data-preparation data-preprocessing gwas gwas-pipeline gwas-summary-statistics summary-statistics sumstats
Last synced: 09 Apr 2025
https://github.com/ELToulemonde/dataPreparation
Data preparation for data science projects.
data-preparation data-preprocessing data-science date-conversion r speed variable-elimination variable-selection
Last synced: 30 Jul 2025
https://github.com/eltoulemonde/datapreparation
Data preparation for data science projects.
data-preparation data-preprocessing data-science date-conversion r speed variable-elimination variable-selection
Last synced: 19 Aug 2025
https://github.com/neuro-ml/reskit
A library for creating and curating reproducible pipelines for scientific and industrial machine learning
data-preparation grid-search pipeline prepare-data python reproducible-experiments reproducible-research scikit-learn
Last synced: 19 Jul 2025
https://github.com/salehjg/Shapenet2_Preparation
A python script to convert and down-sample mesh data into pointclouds using FPS algorithm.
data-preparation dataset farthest-point-sampling hdf5 python shapenet-dataset shapenetcore
Last synced: 20 Mar 2025
https://github.com/dustin-decker/featuremill
general-purpose fast, stateless, and deterministic feature extractor written in golang for use in machine learning
data-preparation feature-engineering feature-extraction go golang machine-learning vectorization
Last synced: 26 Jul 2025
https://github.com/aicorsair/dataquest-data-science-analysis-projects
A repository dedicated to storing guided projects completed while learning data science concepts with Dataquest.
classification-models cluster-analysis data-analysis data-analytics data-cleaning data-preparation data-preprocessing data-science data-visualization deep-learning excel feature-engineering machine-learning pandas-dataframe power-bi python-3 regression-models scikit-learn sql web-scraping
Last synced: 27 Oct 2025
https://github.com/rashadgarayev/image-classificationnn
Image classification svm with simple neural network.
data-preparation deep-neural-networks feature-extraction neuralnetwork opencv-python region-proposal selective-search svm
Last synced: 08 Oct 2025
https://github.com/18520339/finding-similar-images
Finding similar images from image URLs using ImageHash
data-preparation google-sheets-api gspread imagehash similar-images
Last synced: 15 Apr 2025
https://github.com/labrijisaad/prediction-du-cours-de-bourse
Forecast Apple stock prices using Python, machine learning, and time series analysis. Compare performance of four models for comprehensive analysis and prediction.
apple-inc-aapl autoregressive-integrated-moving-average-arima data-preparation data-visualization exploratory-data-analysis linear-regression long-short-term-memory-lstm machine-learning model-development model-performance-comparison pandas-ta python stock-price-forecasting support-vector-machines-svm time-series-analysis
Last synced: 08 Apr 2025
https://github.com/csfelix/data-science-mental-maps
🐍 Mental Maps Related to Contents in Data Science 🐍
computer-vision cross-validation data-preparation data-science data-transformation deep-learning encoder feature-engineering imputation machine-learning normalization one-hot-encoder ordinal-encoder pickle pipelines python scale shap-values standardization xgboost
Last synced: 28 Apr 2025
https://github.com/hrolive/from-data-to-insights-with-google-cloud-platform
Four-course accelerated online specialization teaches course participants how to derive insights through data analysis and visualization using the Google Cloud Platform
data-analysis data-cleaning data-preparation data-visualization sql
Last synced: 12 May 2025
https://github.com/kozodoi/dptools
Python package with utilities for data processing, aggregation, feature engineering and data versioning
aggregation data-preparation data-preprocessing data-science feature-engineering python
Last synced: 08 May 2025
https://github.com/kwokhing/exploratory-data-analysis-on-smrt-tweets
Demo on performing exploratory data analysis (EDA) on train service disruptions based on scrapped (user generated contents) tweets from the train operator's (SMRT) twitter account
data-analysis data-cleaning data-collection data-preparation exploratory-data-analysis exploratory-data-visualizations folium geospatial-data leaflet-map python python3 regex scraping selenium selenium-python social-media text-processing user-generated-content web-scraping webscraping
Last synced: 27 Jul 2025
https://github.com/nragland37/event-optimization-tool
R-based Shiny application that maps availability and identifies optimal engagement times to enhance participation within an organization
data-analysis data-cleaning data-preparation heatmap r shiny shiny-app tidyverse
Last synced: 14 Apr 2025
https://github.com/nisheethjaiswal/Data-Annotator-for-SpaCy
🚀SpAnnor annotator for Named Entity Recognition easy to use tool. The annotator allows users to quickly assign custom labels to one or more entities in the text. Easy to setup for Data Training for SpaCy 🔥.
data-annotation data-annotation-tools data-labeling data-preparation named-entity-recognition nlp spacy-nlp text-labeling
Last synced: 06 Aug 2025
https://github.com/muneeb1030/finetune-tiny-llama
Fine-tuning the Tiny Llama model to mimic my professor's writing style using the Llama Factory. The project involves data collection, preprocessing, preparation, fine-tuning, and evaluation.
data data-preparation data-preprocessing finetuning llama-factory llm pymupdf selenium-python spacy tinyllama webscraping
Last synced: 28 Dec 2025
https://github.com/mzguntalan/vegetable
Vegetable contains a design/definition of a Vector Graphic that allows it to easily render it as equally an spaced point cloud/sequence. From this, vegetable offers a way to read .ttf font files, and render their glyphs into point clouds/sequences.
data-preparation data-science point-cloud python svg ttf vector-graphics
Last synced: 09 Aug 2025
https://github.com/furk4neg3/sales-forecasting
Created AI models to forecast Wallmart's sales. Used different models, like dense, LSTM, GRU and naive model. Different window and horizon sizes are used too. Compared models visually at the end.
artificial-intelligence data-preparation data-visualization deep-learning forecasting gated-recurrent-unit lstm tensorflow tensorflow2
Last synced: 13 May 2025
https://github.com/carpentries-incubator/rna-seq-data-for-ml
RNA-Seq: Data Readiness for Machine Learning Applications
carpentries-incubator data-preparation data-preprocessing data-readiness english lesson machine-learning pre-alpha rna-seq
Last synced: 02 Sep 2025
https://github.com/vineet416/travel_data_analysis
Travel Data Analysis Internship Project at iNeuron.
data-analysis-python data-cleaning data-exploration data-preparation data-visualization exploratory-data-analysis jupyter-notebook matplotlib-pyplot pandas python seaborn statistical-analysis
Last synced: 23 Mar 2025
https://github.com/hitesh22rana/sourcecollector
A simple tool to consolidate multiple files into a single .txt file. Perfect for feeding your files to AI tools without any fuss.
ai-tools data-preparation file-processing text-processing utility
Last synced: 04 Nov 2025
https://github.com/serhatderya/medical_examination_research
This repository contains a research about medical examinations (such as body measurements, results from various blood tests, and lifestyle choices).
catplot data-analysis data-analytics data-cleaning data-preparation data-preprocessing data-science data-visualization eda exploratory-data-analysis exploratory-data-visualizations heatmap jupyter-notebook medical preprocessing python research seaborn
Last synced: 22 Feb 2025
https://github.com/belajarqywok/thesis_forecasting_autoprep
[ Thesis Part ] automatic data preparation
auto-scraper automation data-engineering data-preparation github-actions yahoo-finance
Last synced: 23 Jul 2025
https://github.com/gracysapra/r-in-data-science
This repository contains essential guides for data analysis using R, covering topics like data preparation, data reshaping, and data visualization. Each file focuses on fundamental techniques to manipulate, clean, and visualize data effectively using R programming.
data-analysis data-preparation data-reshaping data-science data-visualization data-visualizations ggplot r r-for-data-science
Last synced: 03 Aug 2025
https://github.com/samuelbarbosadev/roof_imoveis_data_analysis
The company hired you because they want to know what would be the 5 properties they should invest in and why, and which 5 you would not recommend investing in at all.
data-preparation data-understanding data-visualization pandas python
Last synced: 13 Aug 2025
https://github.com/piebro/simple-image-classification-labeling-website
A simple website to label images for classification locally.
data-preparation image-classification image-labeling no-backend web-app
Last synced: 11 Mar 2025
https://github.com/satyacoder29/superstore-sales-analysis-
Analyzed Superstore Sales Data to uncover trends, optimize sales, and improve profitability. Explored customer segments, regional performance, and product categories using Python and Power BI. Delivered actionable insights to enhance revenue, streamline inventory, and refine marketing strategies, driving data-informed decision-making.
data-cleaning data-preparation data-preprocessing dataanalysis powerbi powerquerym relationships tableau tableau-server tableau-visualizations unions visualisation
Last synced: 20 Aug 2025
https://github.com/mituskillologies/data-science-sep24
Programs of Data Science batch @ MITU Skillologies, September 2024
clustering data-analytics data-preparation data-preprocessing data-science data-structures data-visualization machine-learning mysql powerbi python-programming sql supervised-learning unsupervised-learning
Last synced: 03 Mar 2025
https://github.com/wlodpawlowski/machine-learning-basic-datasets
Repository which consists different code snippets and projects for my personal lessons recorded at the University of San Francisco in during learning of the Machine Learning.
building-data-products data-preparation data-products gradient-descent logistic-regression machine-learning machine-learning-a-to-z machine-learning-algorithms machine-learning-library machine-learning-mathematics machine-learning-mo machine-learning-model machine-learning-models machine-learning-practice ml model-validation nlp-machine-learning random-forest regularization rf
Last synced: 11 Jun 2025
https://github.com/chaitanyac22/investment-analysis-for-an-asset-management-company
Data analysis to identify the best sectors, countries, and a suitable investment type for making investments.
business-analytics business-intelligence data-analysis data-cleaning data-insights data-manipulation data-preparation data-visualization decision-making finance python3 risk-management statistics
Last synced: 27 Mar 2025
https://github.com/rosanafss/data-visualization-nanodegree
Data Visualization Nanodegree
dashboard data-analysis data-preparation data-visualization design interactive sketch storytelling tableau wireframe
Last synced: 19 Nov 2025
https://github.com/samuelbarbosadev/walrmart_data_analysis
You have been hired by Walmart to survey the revenue of their stores in the USA and point out which store would be best to expand its size. It is necessary to analyze the weekly sales of each store, calculate some important information that will be asked, and at the end of it all, indicate which store should be invested in.
data-preparation data-understanding data-visualization pandas python
Last synced: 22 Mar 2025
https://github.com/vineet416/eda-travel
EDA Travel data by PW Skills Data Analytics Course.
data-encoding data-preparation data-preprocessing data-visualization exploratory-data-analysis jupyter-notebook matplotlib numpy pandas plotly python sklearn-library train-test-split train-test-using-sklearn
Last synced: 07 Nov 2025
https://github.com/leftcoastnerdgirl/excel_crowdfunding_analysis
This project demonstrates the use of MS Excel for data cleansing & formatting to prepare for data analysis and visualization.
bar-charts conditional-formatting data-analysis data-analytics data-analytics-excel data-preparation data-preprocessing data-visualization excel line-graph
Last synced: 27 Jul 2025
https://github.com/chaitanya1436/student_performance_analysis
A project focused on analyzing college student performance using data on department, assessment scores, and performance labels. Implemented in Google Colab, the analysis includes data preprocessing, feature scaling, and exploratory data analysis to uncover insights and prepare the data for further analysis or modeling.
ata-preprocessing data-preparation exploratory-data-analysis feature-scaling google-colab numpy pandas scikit-learn
Last synced: 02 Aug 2025
https://github.com/antbit96/dataform_poc
Template for basic data preparation
bigquery bigquery-dataform data-preparation
Last synced: 12 Aug 2025
https://github.com/samuelsoaress/challenge-neural-networks-capgemini
this repository contains the code used to develop the whale breed recognition challenge
cnn-classification data-augmentation data-preparation tensorflow-models
Last synced: 17 Aug 2025
https://github.com/melvinjwallace/melvinjw.github.io
A portfolio of a host of projects completed using python and sql.
data data-analysis data-cleaning data-loading data-mining data-preparation data-processing data-science data-transformation data-visualization dataset matplotlib microsoft-sql-server pandas-python seaborn
Last synced: 26 Dec 2025
https://github.com/amitreddy14/2019-election-analysis-and-swing-prediction-model
This project analyzes voter behavior in India's 2019 general election, identifying patterns across demographics, economic conditions, and social factors using statistical methods and machine learning. By assessing regional disparities and government policies, we aim to elucidate India's democratic process and improve election outcome forecasting.
data-preparation feature-engineering linear-regression multilayer-perceptron support-vector-machines
Last synced: 07 Apr 2025
https://github.com/ksm26/pretraining-llms
Master the essential steps of pretraining large language models (LLMs). Learn to create high-quality datasets, configure model architectures, execute training runs, and assess model performance for efficient and effective LLM pretraining.
ai-training cost-effective-pretraining data-preparation depth-upscaling developer-advocacy high-quality-datasets hugging-face large-language-models llm-evaluation machine-learning meta-llama model-configuration model-initialization performance-assessment pretraining-llms text-generation training-runs
Last synced: 28 Mar 2025
https://github.com/ndomah/1.-the-basics
1. The Basics from The Data Engineering Academy
data-cleaning data-engineering data-preparation docker python sql
Last synced: 01 Jul 2025
https://github.com/tigureis/house-rent-analysis
House Rent Data Cleaning and Preparation: Clean and preprocess house rent data for further analysis.
data-cleaning data-preparation pandas seaborn
Last synced: 14 Jun 2025
https://github.com/pierrekieffer/datapreprocessing
Custom data preprocessing library made for machine learning
data-preparation data-preprocessing machine-learning preprocessing scikit-learn
Last synced: 31 Mar 2025
https://github.com/officialyapper/project-credit-risk-analysis
German Credit Data - 1994
aws-ec2 aws-emr-clusters classification-algorithm cloudera-manager conjoint-analysis cost-sensitive-learning data-preparation exploratory-data-analysis finance github-config learning sklearn-library spark sqlite
Last synced: 03 Jul 2025
https://github.com/nadahamdy217/movies-data-etl-using-python-gcp
Developed a comprehensive ETL pipeline for movie data using Python, Docker, and a GCP Pub/Sub emulator. Successfully processed and published the data in a local Docker environment, showcasing advanced data engineering skills.
analytics data data-engineering data-ingestion data-preparation data-preprocessing data-processing data-project docker etl etl-pipeline gcp matplotlib matplotlib-pyplot numpy pandas pubsub python scipy seaborn
Last synced: 06 Jan 2026
https://github.com/jackieocham/rest-metrics-data-analysis
Data analysis on sleep and health tracking data collected over many years
data-analysis data-cleaning data-manipulation data-preparation data-project exploratory-data-analysis initial-data-analysis mysql mysql-database sql
Last synced: 01 Apr 2025
https://github.com/notthestallion/data_preparation_4_ml_algorithm
This project will focus on data preparation and will follow the steps : data cleaning, handling text and categorical attributes, and feature scaling.
data-cleaning data-preparation data-preprocessing data-science feature-scaling ml onehot-encoder onehot-encoding
Last synced: 28 Nov 2025
https://github.com/tynoee/nashville-housing-data-cleaning
This repository contains SQL scripts used to clean and prepare the Nashville Housing dataset for analysis.
cte data-analytics data-cleaning data-engineering data-preparation data-processing database etl real-estate-data sql sql-server
Last synced: 12 Jun 2025
https://github.com/ndomah1/data-cleaning-in-mysql
This project cleans and standardizes a global dataset of tech layoffs using MySQL, transforming raw data into an analysis-ready format.
data-cleaning data-preparation layoffs mysql sql
Last synced: 25 Mar 2025
https://github.com/lkethridge/sda_project
A Statistical Data Analysis project from TripleTen
binomial-distribution continuous-variables data-aggregation data-manipulation data-preparation distribution frequency-histogram hypothesis-tests law-of-large-numbers normal-approximation normal-distribution one-tail-test paired-samples probability-theory random-sampling skewed-data standard-deviation statistical-data-analysis summary-statistics two-tail-test
Last synced: 04 Jul 2025
https://github.com/terilios/automated_data_scientist
Automated Data Scientist: An intelligent, adaptive data analysis tool that leverages AI-driven automation to dynamically plan, execute, and refine data science workflows. Automatically handles data preparation, analysis planning, code generation, and result interpretation using advanced language models.
adaptive-analytics ai-driven-analytics ai-powered-data-tools api-integration automated-data-science automation data-insights data-preparation data-science-workflow data-visualization dynamic-analysis-planning exploratory-data-analysis intelligent-data-processing language-models machine-learning ml-ops openai-gpt python scalable-data-analysis
Last synced: 23 Jun 2025
https://github.com/ndomah/the-data-engineering-academy
Materials from The Data Engineering Academy
apache-airflow apache-kafka apache-spark data-cleaning data-engineering data-preparation databricks dbt dimensional-data-modeling docker elasticsearch fastapi mongodb pipeline platform python relational-data-modeling snowflake sql
Last synced: 19 Mar 2025
https://github.com/archettialberto/federated_survival_datasets
Build realistic heterogeneous datasets for federated survival analysis in a reproducible way.
data-preparation dataset datasets federated-learning heterogeneity survival-analysis time-to-event
Last synced: 08 Oct 2025
https://github.com/mohawk2/data-prepare
Module to prepare CSV (etc) data for automatic processing
data-cleaning data-preparation data-science perl
Last synced: 12 Oct 2025
https://github.com/ishmal793/lists-tuples-dictionaries-json-sets
Beginner-friendly Python practice covering core collection types like lists, tuples, dictionaries, sets, and JSON with real-world problems.
beginner-projects data-preparation data-structures dictionaries json lists python python-collections python-practice sets text-processing tuples
Last synced: 13 Oct 2025
https://github.com/chahelgupta/dep-videogames-dataset
The data extraction and processing involved thorough exploration, preprocessing, and visualization of the "Video Game Sales with Ratings" dataset.
data-analysis data-exploration data-extraction data-preparation data-preprocessing data-processing data-science data-visualization
Last synced: 15 Oct 2025
https://github.com/jackmnob/python-tableau-eda-stockdash
Data cleaning, preparation, and manipulation (EDA) for an interactive stock market dashboard with Tableau - using pandas (Python) via JupyterLab
cleaning-data dashboard data-analysis data-preparation eda jupyter-notebook jupyterlab python tableau-public
Last synced: 15 Oct 2025