Projects in Awesome Lists tagged with dataprocessing
A curated list of projects in awesome lists tagged with dataprocessing .
https://github.com/prakhar21/50-Days-of-ML
A day to day plan for this challenge (50 Days of Machine Learning) . Covers both theoretical and practical aspects
100daysofcode 100daysofmlcode dataprocessing deep-learning deep-neural-networks machine-learning pandas python siraj-raval tutorial
Last synced: 14 Apr 2025
https://github.com/Jean-njoroge/Breast-cancer-risk-prediction
Classification of Breast Cancer diagnosis Using Support Vector Machines
breast-cancer-prediction breast-cancer-tumor breastcancer-classification classification data-analysis dataprocessing exploratory-data-analysis notebook pipelines prediction-model python supervised-learning svm
Last synced: 09 Jul 2025
https://github.com/wp-labs/warp-parse
Focusing on building industry-leading ETL engines.
data-collector dataprocessing etl etl-framework etl-pipeline etl-process events logging metrics observability oml pipeline siem streaming streaming-processing telemetry warp-parse wp-parse wpl
Last synced: 26 Apr 2026
https://github.com/csbiology/biofsharp
Open source bioinformatics and computational biology toolbox written in F#.
amino-acids biocontainers bioinformatics bioinformatics-containers biology biostatistics dataprocessing datascience docker fsharp nucleotides sequence-analysis
Last synced: 05 Apr 2025
https://github.com/pedrokehl/caminho
Tool for creating efficient data pipelines in a JavaScript environment
backpressure concurency data dataprocessing functional javascript parallel pipeline reactive typescript
Last synced: 06 Apr 2025
https://github.com/nkaz001/data-tardis
Process tardis.dev cryptocurrency data, reconstructing the market depth and computing imbalance.
cryptocurrency dataprocessing orderbook orderbook-tick-data tardis trading-strategies
Last synced: 14 Apr 2025
https://github.com/plainerman/tictactube
A versatile pipelining library created with media organization in mind.
album dataprocessing genius media metadata metadata-management pipeline pipeline-framework songs soundcloud telegram telegram-bot youtube-dl
Last synced: 10 Apr 2025
https://github.com/fern-flower-lab/sqlg-clj
The SQL Graph with Tinkerpop3 and Clojure
clj clojure clojure-library database dataprocessing graph gremlin h2 hsql maria mssql mysql postgresql rdbms-to-graph sql tinkerpop tinkerpop-graphs tinkerpop3
Last synced: 28 Feb 2026
https://github.com/divithraju/divith-raju-immigration-data-engineering
A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)
apachespark bigdata bigdataprocessing bigdataproject capstone-project datacleaning dataengineering datalake datamodeling datapipeline dataprocessing dataschema dataset datawherehouse pandas sql
Last synced: 29 Apr 2026
https://github.com/zeeshanahmad4/webscrapesummarizer
WebScrapeSummarizer 🌐✍️: A web tool that fetches and summarizes content from any domain, offering insights in a compact CSV format.
contentsummarization csv dataprocessing naturallanguageprocessing nlp openai php tools webdevelopment webscraping webtool
Last synced: 25 Oct 2025
https://github.com/gojibjib/voice-grabber
Collection of scripts to gather training (meta) data for the ML model
dataprocessing dataset europeana golang
Last synced: 10 May 2026
https://github.com/huseyincenik/data_science
Data Science materials
data data-science data-structures data-visualization dataanalysis dataengineering datapreparation dataprocessing datascience dataset time-series time-series-analysis timeline timeseries timeseries-analysis timeseriesforecasting
Last synced: 25 Jul 2025
https://github.com/rafat-decodis/robust-asr-for-low-resource-languages
Exploring Benchmark Gaps and Real-World Speech Generalization for Language in Low Resource
artificial-intelligence automatic-speech-recognition data-analysis dataprocessing whisper
Last synced: 23 Jun 2025
https://github.com/yoannpa/computational_epigenomics
This repository contains data and functions related to computational epigenomics data analysis.
450k beta-values bioinformatics computational-epigenomics coverage dataprocessing epigenomics human-methylation-450k methylation wgbs
Last synced: 23 Feb 2026
https://github.com/hq969/credit-card-fraud-detection
The Credit Card Fraud Detection project is a machine learning-based system designed to identify fraudulent transactions in real-time. Using historical transaction data, the model classifies transactions as either fraudulent or legitimate, helping financial institutions reduce financial losses and improve security.
anomaly-detection dataprocessing machine-learning-algorithms security-audit
Last synced: 17 Feb 2026
https://github.com/garugaru/gearpump-swarm
Deploy gearpump on a docker-swarm cluster
bigdata dataprocessing docker real-time swarm
Last synced: 29 Apr 2026
https://github.com/blocknotes-4515/multiple-disease-detection-system-all_in_one-
🩺 Heart Disease Detection System 💓 This AI tool predicts heart disease risk by analyzing key health metrics like age, cholesterol, and blood pressure. 🧠🔍 It provides quick, accurate results to help prevent serious conditions and support early treatment. Perfect for both healthcare pros and patients! 🌟
data-visualization dataprocessing linear-regression standard-library streamlit svm-model test-automation training
Last synced: 25 Feb 2026
https://github.com/asolimando/xqueryprojector
XQuery query processing optimization based on XML projection
dataprocessing query-optimization type-system xml xquery
Last synced: 31 Jan 2026
https://github.com/sayanmondal2098/easytoken
Tokenizer is an independent Open Source, Natural Language Processing python library which implements a tokenizer to create token from Both Sentence and Paragraph.
data-science datapreprocessing dataprocessing natural-language natural-language-processing nlp nlp-library nlp-machine-learning python-library python3 text-processing text-summarization token tokenizer
Last synced: 14 Dec 2025
https://github.com/vinayakdon/machine-learning-project-sentimental-classifier-
A sentiment classification tool using machine learning in Python to analyze and predict the sentiment of text data. Features preprocessing, model training, hyperparameter tuning, and evaluation for accurate sentiment analysis.
dataanalytics dataprocessing datascience python training-data
Last synced: 17 May 2026
https://github.com/fgonzalesc/transcripcion_ai
Transcripción de audios con Azure Speech y extracción de insights con Open AI
ai azure dataprocessing diarization openai-api python speechtotext
Last synced: 18 May 2026
https://github.com/nouranhaitham/ml_waterquality
A notebook aimed at predicting and improving water safety by analyzing contaminants and pollution levels in water sources, enhancing public health and ensuring access to clean drinking water.
classification-models cleansing-data dataprocessing dataset decision-trees gridsearchcv hyperparameter-tuning logistic-regression machine-learning prediction python randomforestclassifier regression-models water-quality
Last synced: 05 Jan 2026
https://github.com/rushikeshbihade/django_bsased_dataanalyzer_webapp
Data Analyzer is a Django web application that enables users to upload CSV files, perform data analysis using pandas and numpy, and view results and visualizations on an interactive web interface. It simplifies data analysis by offering a user-friendly platform for data upload, processing, and visualization.
backend-development bootstrap css3 dataanalysis dataprocessing django-application django-framework djangotemplates html5 javascript numpy pandas plotly-express python3 seaborn-plots webapp
Last synced: 02 Mar 2026
https://github.com/bhavik-jikadara/house-price-prediction
House Price Prediction
data-science dataprocessing eda jupyter-notebook machine-learning matplotlib model numpy pandas python seaborn test-train-dataset
Last synced: 09 Apr 2026
https://github.com/redayzarra/nlp_yelpreviews
This project covers the topic of natural language processing or NLP to classify user-generated text and determine their intent. The goal of this project is to build a model that can classify 10,000 Yelp reviews into either one-star or 5-star reviews. This project showcases a step-by-step implementation of the model as well as in-depth notes.
datapreprocessing dataprocessing machine-learning multinomial-naive-bayes naive-bayes naive-bayes-algorithm naive-bayes-classifier natural-language-processing nlp sentiment-analysis text-classification
Last synced: 22 Aug 2025
https://github.com/soumyagautam/sign-sense
Deep Learning and Neural Network based Sign Sense or 'Sign Language' to Speech converter is an desktop app which can detect hand signs in a frame and can convert them to Speech, according to their respective meaning. Opposite to this, it can also recognise your voice and can convert it to sign language.
ai cv2 dataprocessing deep-learning keras machine-learning mediapipe moviepy-library neural-network openai-whisper scikit-learn tensorflow tkinter-python
Last synced: 10 Apr 2026
https://github.com/happydream9032/tkinter_demo
This is simple automation project with Python and Tkinter Framework.
automation csv dataprocessing pyinstaller python tkinter
Last synced: 28 May 2026
https://github.com/vidhi1290/zomato-data-analysis
Zomato Data Analysis - Explore the world of Zomato restaurant data through Python and data analysis. Uncover trends and insights using Pandas for data manipulation and Matplotlib for visualization. Join us in this journey to reveal the hidden stories within the data!
data-analysis data-analysis-python data-science data-visualization dataprocessing machine-learning machine-learning-algorithms matplotlib numpy pandas python scikit-learn zomato-data-analysis
Last synced: 11 Apr 2026
https://github.com/lynk4/kaggle
This repository houses Python notebooks and scripts that contain solutions to Kaggle competitions.
aimodel data-science dataprocessing datasets datavisualization kaggle kaggle-competition kaggle-dataset kaggle-house-prices kaggle-scripts kaggle-solution machine-learning python3 titanic-dataset titanic-kaggle titanic-survival titanic-survival-prediction
Last synced: 27 Jan 2026
https://github.com/waikato-datamining/multiway-algorithms
Java library of multi-way algorithms.
dataprocessing java multiway-algorithms parafac
Last synced: 17 Oct 2025
https://github.com/hediyeorhan/logisticregressionwitharduino
accuracy activation-functions arduino-uno artificial-intelligence backpropagation dataprocessing datasets f1-score logistic-regression loss machine-learning mse performance-metrics precision recall sigmoid-function timer timer-interrupt
Last synced: 10 Jun 2026
https://github.com/mikeroyal/apache-storm-guide
Apache Storm Guide
batch-processing data-science dataprocessing hadoop real-time storm storm-topology
Last synced: 08 Jan 2026
https://github.com/addingama/sid_waterpoints
Advance IT Test for Summit Institute of Development
Last synced: 15 Mar 2025
https://github.com/silent0wings/ta-management-system
The TA Management System is a C++ project designed to manage records of Teaching Assistants (TAs) within a department. The system ensures that only eligible TAs—those who are currently registered students—are maintained in the records. The project involves filtering out records of TAs who have graduated and updating the TA file accordingly.
clientmanagement clientprofiles cplusplus dataprocessing datavalidation education faculty-management javainterfaces objectorientedprogramming projectreport scheduling softwaredevelopment stringparsing student-management teaching-assistant-management university-tool
Last synced: 01 Mar 2025
https://github.com/sravanigodavarthi/automated-elt-pipeline-aws
An Apache Airflow data pipeline is designed to perform ELT operations, utilizing Amazon S3 and Amazon Redshift Serverless.
airflow aws datamodeling datapipeline dataprocessing dataqualitycheck docker elt-pipeline parquet python redshift-serverless s3-buckets sql
Last synced: 08 May 2026
https://github.com/vigneshkanna18/foodhunter-revenue-drop-analysis
A BI solution developed for FoodHunter to investigate a significant drop in revenue over a four month period. This analysis helps uncover actionable insights through data exploration, visualization and hypothesis-driven analysis to support informed decision-making.
analysis dashboarding database dataengineering datamining dataprocessing datavisualization etl-pipeline ipynb mysql powerbi sql streamlit visualization-pipeline
Last synced: 24 Jun 2025
https://github.com/kevinndungu-source/amazon_emr_project_resources
Explore and replicate Amazon EMR (Elastic MapReduce) setup and utilization for big data processing and analytics tasks, featuring comprehensive demonstrations from VPC creation to Spark job execution.
aws-ec2 bigdata bigdatainfrastructure datamanagement dataprocessing emr-cluster juypter-notebook pyspark python
Last synced: 19 May 2026
https://github.com/bobergot/large-scale-data-processing-design-patterns
Explore essential MapReduce design patterns for big data processing! This repository includes practical implementations of patterns from the "MapReduce Design Patterns" book, complete with examples across summarization, filtering, organization, joins, and more.
bigdata bigdataanalytics cloudcomputing dataengineering dataprocessing datascience designpatterns distributedcomputing hadoop java mapreduce
Last synced: 16 Mar 2025
https://github.com/mannasoumya/imputerapi
Data Imputer API in Python
api data-cleaning data-science datapreprocessing dataprocessing imputer machine-learning machine-learning-algorithms matrix python3
Last synced: 25 Mar 2025
https://github.com/analyticalnahid/data-preprocessing
Analyze your data by applying pre-processing techniques
dataanalysis datapreprocessing dataprocessing
Last synced: 05 Sep 2025
https://github.com/neelimabonangi/real-time-weather-data-processing
Processes and analyzes near real-time weather data using the Kappa architecture,Apache Kafka,Spark,Cassandra,docker,AWS EC2,spring boot API
aws cassandra data-visualization dataanalysis dataprocessing docker ec2 json kafka kappa-architecture machine-learning restapi spark springboot-api xml
Last synced: 13 Apr 2026
https://github.com/flaviuvadan/pipe-flow
A data processing pipeline library with a common vocabulary API
dataprocessing golang pipeline
Last synced: 01 Jun 2026
https://github.com/dev-rke/liveprocess
Simple tool to process data on the fly with JavaScript
dataprocessing instant onthefly
Last synced: 08 Jan 2026
https://github.com/1401dev/customer-lifetime-value-prediction
A data science project leveraging Python and Scikit-Learn to build predictive models that estimate customer lifetime value (CLV). Includes data cleaning, feature engineering, and model selection to identify key drivers of CLV, supporting strategic decision-making in customer retention and marketing.
clv clv-analysis customer-retention data-analysis dataprocessing feature-engineering machine-learning marketing-analytics predictive-modeling python regression-analysis scikit-learn
Last synced: 06 May 2026
https://github.com/elijah-1994/pre-process-e-commerce-dataset
Importing, Cleaning, and Pre-Processing E-Commerce Data for Analysis Using MySQL.
analytics data dataanalytics datacleaning dataprocessing mysql mysql-database sql
Last synced: 11 Mar 2025
https://github.com/ngangawairimu/regression-model-for-predicting-house-prices
This project focuses on applying statistical modeling techniques to predict house prices in Melbourne using the Melbourne House Price dataset. It involves data cleaning, exploratory data analysis (EDA), feature selection, and fitting a regression model to predict the target variable, which is the house price.
datacleaning dataprocessing explanatory-data-analysis modelevaluation modelinterpretability regression-analysis
Last synced: 28 Mar 2025
https://github.com/nicolay-r/sentinerel-attitude-extraction
This repository represents studies related to sentiment attitude extraction, provided for sentiment relations (RuSentNE), for SentiNEREL dataset.
bert cnn dataprocessing lstm machine-learning nlp relationextraction russian-language sentiment-analysis
Last synced: 05 Apr 2025
https://github.com/lazycatcoder/waterheatmap
This application generates heatmaps based on temperature data. The application developed using Node.js
canvas chai dataanalysis dataprocessing expressjs heatmapping html2canvas javascript mocha mocha-chai nodejs nodejsapp temperature-map temperaturevisualization testing webdevelopment
Last synced: 08 Apr 2026
https://github.com/ssahas/implementing-gpt-from-scratch
Building a decoder-only (GPT-style) LLM from scratch using PyTorch and training it for text generation.
datacleaning dataprocessing large-language-models llm llm-inference llm-training python
Last synced: 14 Oct 2025
https://github.com/devpablooliveira/matrixplore
Web app for processing, uploading, and downloading matrices using FastAPI. Users can upload CSV files, manually input data, and download pre-set matrices. Includes analysis of matrix properties like functionality, injectivity, and surjectivity, with support for matrix combinations and transpose calculations. Built with FastAPI and Jinja2.
academictools algorithms backenddevelopment csv dataprocessing fastapi jinja2 jinja2-templates manipulation mathematics matrixoperations python templates webapplication
Last synced: 09 May 2026
https://github.com/kevinndungu-source/amazon_emr_serverless_demonstration
Explore the capabilities of Amazon EMR Serverless by processing semi-structured review data with Apache Spark, showcasing efficient big data analysis without managing clusters.
apache-spark bigdatacloud bigdatainfrastructure dataprocessing emrserverless python sql-query
Last synced: 19 Jan 2026
https://github.com/aadityasikder/Object-Detection-with-raspberry-pi-implementing-TinyML-models
Repository for Raspberry Pi-based object detection with TinyML models like TensorFlow Lite, PyTorch Nano, including data gathering, mAP evaluation, and image data preparation in Jupyter notebooks.
data-gathering datacleaning dataprocessing image-preparation object-detection pytorch-nano raspberry-pi-4 tensorflow-lite tinyml
Last synced: 16 Dec 2025
https://github.com/srimantapal205/dataengineerwireframedesigns
Data Engineer Wireframe Designs are essential for planning and visualizing data pipelines, architecture, and workflows before implementation.
data-analysis data-engineering dataflow dataflow-programming datapipeline dataprocessing development visualization
Last synced: 29 Jan 2026
https://github.com/annakhsengiv/foodhunter_revenue_drop_analysis
A BI solution developed for FoodHunter to investigate a significant drop in revenue over a four month period. This analysis helps uncover actionable insights through data exploration, visualization and hypothesis-driven analysis to support informed decision-making.
analysis dashboarding database dataengineering datamining dataprocessing datavisualization etl-pipeline ipynb mysql powerbi sql streamlit visualization-pipeline
Last synced: 07 Jul 2025
https://github.com/dawidolko/datafusion-app-python
Project as part of the Data Warehousing subject.
academic-project data dataprocessing extraction gui loading project pysimplegui python transformation
Last synced: 15 Feb 2026
https://github.com/kaustubholpadkar/r-fundamentals
This repository comprises the solutions to various problems on R Fundamentals.
advanced-database data-science datamining dataprocessing jupyter-notebook r r-packages r-programming statistical-programming
Last synced: 28 Apr 2026
https://github.com/tanzim-prog/sentiment_analysis_bing_lexicon
The motive of this project is to find out the customer satisfaction of some residential hotels of Dhaka.
dataanalysis dataprocessing datavisualization lexical-analysis sentiment-analysis webscraping
Last synced: 06 Jun 2026
https://github.com/jadesrochers/streams
Stream wrapper to allow creation of streams with just a function passed to define its operation.
Last synced: 17 Mar 2025
https://github.com/ngupta23/data_prep_helper
A helper package for preparing and combining data from a variety of sources
data data-science dataprep datapreparation dataprocessing helpers python
Last synced: 03 Apr 2025
https://github.com/trident09/net-sec-ai-mp
This project predicts network traffic patterns using a machine learning model trained on the CICIDS dataset. It includes a Streamlit app for real-time predictions, displaying predicted labels and probabilities for uploaded CSV data. The project is structured into three parts: dataset, model training, and frontend (Streamlit app).
cybersecurity dataprocessing ml network-traffic-analysis random-forest
Last synced: 29 Apr 2026
https://github.com/cagandemirmr/airbnb_available_houses
In this repo, i create dashboard using Tableau.In this process, i use SQL and Python languages.
dashboard data-visualization dataprocessing python sql tableau
Last synced: 30 Apr 2026
https://github.com/aadityasikder/object-detection-with-raspberry-pi-implementing-tinyml-models
Repository for Raspberry Pi-based object detection with TinyML models like TensorFlow Lite, PyTorch Nano, including data gathering, mAP evaluation, and image data preparation in Jupyter notebooks.
data-gathering datacleaning dataprocessing image-preparation object-detection pytorch-nano raspberry-pi-4 tensorflow-lite tinyml
Last synced: 18 Feb 2026
https://github.com/kaushik-puttaswamy/dynamic-movie-booking-insights-platform-using-snowflake
The Dynamic Movie Booking Insights Platform processes real-time booking data using Snowflake’s Dynamic Tables, Streams, and Tasks to deliver actionable insights. It features an interactive Streamlit dashboard for visualizing revenue, sales trends, and booking metric.
businessintelligence changedatacapture dataprocessing datavisualization dynamictables moviebooking python realtimeanalytics revenueinsights snowflake sql streamlit
Last synced: 20 May 2026
https://github.com/nivasharmaa/friskwatch
A Java program for analyzing stop-and-frisk data from the NYPD. Features data import, organization, and statistical analysis to compare occurrences during and after policy implementation.
data-analysis data-visualization dataprocessing datascience file-io java java-oop nypd-data
Last synced: 19 May 2026
https://github.com/qrailibs/dataflow
✨ Data processing in Node.js made multithreaded and type-safe.
data dataprocessing multithread node
Last synced: 04 May 2026
https://github.com/matancohen1205/sparklebot-iot-project
IOT final course project
broker button dataprocessing dht iot mqtt-broker publisher pyqt5 relay sqlite subscriber topics
Last synced: 04 May 2026
https://github.com/ponycool/tcga
一个开源脚本,简化TCGA(The Cancer Genome Atlas)数据的获取、解析与基因表达矩阵构建流程。通过R语言脚本实现从数据下载到矩阵合成的全自动化处理,帮助科研人员快速获取高质量的表达数据。
bioinformatics cancer-research dataprocessing geneexpression r tcga
Last synced: 30 Aug 2025
https://github.com/msamij/zig-flow
Data Engineering pipeline.
apache-spark dataprocessing distributed-computing
Last synced: 07 May 2026
https://github.com/jigyasag18/financial-risk-analysis-project
The Credit Card Financial Risk Analysis Dashboard is a real-time Power BI tool designed to provide insights into credit card transactions and customer demographics. It features interactive visualizations, efficient data processing, and actionable insights to support decision-making. Utilizing data from SQL database, the dashboard tracks key metrics
data dataanalysis database datacleaning datapreprocessing dataprocessing datavisualization financial-analysis financialriskanalysis mysql powerbi sql statistical-analysis
Last synced: 06 Mar 2026
https://github.com/x3o8/material-dataset-analysis
Analysis of Material Datasets to find trends based on composition
analysis data-science dataprocessing datasets feature-engineering machine-learning materials-science sklearn tensorflow
Last synced: 07 May 2026