Projects in Awesome Lists tagged with big-data-processing
A curated list of projects in awesome lists tagged with big-data-processing .
https://github.com/souvik-databricks/dlt-with-debug
A lightweight helper utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run.
big-data big-data-processing databricks delta-live-tables dlt etl etl-pipeline python3 spark
Last synced: 10 Sep 2025
https://github.com/airscholar/flinkcommerce
This repository contains an Apache Flink application for real-time sales analytics built using Docker Compose to orchestrate the necessary infrastructure components, including Apache Flink, Elasticsearch, and Postgres
apache-flink big-data big-data-processing python realtime-streaming
Last synced: 08 Oct 2025
https://github.com/hope-data-science/r4bd
R for Big Data (Chinese Version)
big-data big-data-analytics-techniques big-data-processing r
Last synced: 25 Jul 2025
https://github.com/akardapolov/dimension-db
Hybrid time-series and block-column storage database engine written in Java
big-data-processing column-store dbms java sql time-series
Last synced: 23 Feb 2026
https://github.com/mtumilowicz/big-data-scala-spark-batch-workshop
Introduction to Spark Batch processing.
batch-processing big-data big-data-processing spark spark-sql workshop workshop-materials
Last synced: 15 Apr 2025
https://github.com/anirban166/big-data-ft.-genomics
Analysis, organization and querying of large genomic datasets using C++, Monsoon and various data structures.
big-data-processing bioinformatics data-structures-and-algorithms genomic-sequences
Last synced: 18 Mar 2025
https://github.com/mileristovski/dataengineer-sparkstreaming
Track a Boat est un système de suivi maritime en temps réel utilisant Kafka, Spark Structured Streaming et WebSockets. Il permet de visualiser la position des navires, analyser leurs trajectoires et prévoir leurs destinations sur une carte interactive.
big-data-processing distributed-computing docker docker-compose kafka kafka-topics maritime-data pipeline python scala spark-structured-streaming websocket
Last synced: 08 Oct 2025
https://github.com/adnanrahin/nfl-big-data-bowl-2022
The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams play. Here, you'll find a summary of each data set in the 2022 Data Bowl, a list of key variables to join on, and a description of each variable.
big-data big-data-processing rdd scala spark spark-sql
Last synced: 30 Oct 2025
https://github.com/sayamalt/steel-energy-consumption-prediction-using-pyspark
Successfully established a machine learning model using PySpark which can precisely predict the energy consumption of the steel industry, up to an r2 score of approximately 99.5%.
apache-spark big-data-analytics big-data-processing cross-validation data-visualization exploratory-data-analysis hyperparameter-tuning machine-learning model-training-and-evaluation python regression spark sql
Last synced: 06 Dec 2025
https://github.com/bilgeswe/bigdatamanagement
Building a Data Pipeline with Lakehouse Architecture on Microsoft Azure Platform
azure azure-pipelines azure-service azure-storage big-data big-data-analytics big-data-processing data-visualization datalake-ingestion dataset kaggle sql uml-diagram
Last synced: 22 Jan 2026
https://github.com/turnipdo/docker-spark-setup
Setting up a Spark cluster in a Docker environment for improved repeatability and reliability. This project includes a simple transformation on a dataset containing approximately 31 million rows.
big-data-processing docker-container setup spark
Last synced: 07 Feb 2026
https://github.com/srking501/csc8101_coursework
A summative coursework for CSC8101 Engineering for AI
apache-parquet apache-spark azure-databricks big-data big-data-analytics big-data-processing data-science databri databricks-notebooks delta-file nyc-taxi-dataset parquet-files pyspark
Last synced: 12 Feb 2026
https://github.com/lefteris-souflas/redis-mongodb-assignment
Analyzing classified ads data from the used motorcycles market. Tasks involve utilizing Redis Bitmaps for analytics on seller actions and MongoDB for analyzing bike listings. Includes data installation, cleaning, and analysis.
big-data-processing bitmap json mongo-database r redis redis-vs-rdbms-comparison
Last synced: 02 Mar 2025
https://github.com/khanovico/energy-data-analysis
This is the cloud model analyzing real world dataset with BigQuery and other big-data analyzing tools. I implemented docker image for running this app on cross-platform environments.
big-data-processing bigquery docker google-app-engine jupyter-notebook mlflow python scikit-learn seaborn xgboost
Last synced: 17 Feb 2026
https://github.com/rociobenitez/happiness-index-data-processing
Repository for Big Data Processing - Contains Jupyter Notebooks and Datasets for data analysis and processing tasks related to Big Data.
big-data big-data-processing data-analysis data-processing happiness-index happiness-report jupyter-notebook matplotlib pandas seaborn
Last synced: 26 Jun 2025
https://github.com/leonardogemin/bigdatacomputing_unipd
Collection of homework (mostly Spark-based) from the course "Big Data Computing" - University of Padua.
big-data-processing java spark
Last synced: 05 Mar 2025
https://github.com/adi3042/data_science
📊🚀 Explore the Data Science Universe! Unlock insights and master data skills with hands-on assignments spanning machine learning, visualization, and more. Your journey to becoming a data expert starts here! 🎯💡 DataScienceJourney
anomaly-detection big-data-processing classification clustering computer-vision data-cleaning-and-preprocessing data-visualization deep-learning dimensionality-reduction ensemble-learning exploratory-data-analysis feature-engineering machine-learning model-deployment model-selection-and-evaluation natural-language-processing regression-analysis statistical-analysis time-series-analysis-and-forecasting
Last synced: 17 Jan 2026
https://github.com/tashi-2004/apache-hadoop-spark-hive-cyberanalytics
This project utilizes Apache Hadoop, Hive, and PySpark to process and analyze the UNSW-NB15 dataset, enabling advanced query analysis, machine learning modeling, and visualization. The project demonstrates efficient data ingestion, processing, and predictive analytics for network security insights.
ai apache-hadoop apache-hive big-data-analytics big-data-processing data-analysis data-engineering data-science data-security data-visualization hdfs machine-learning network-analysis network-security pyspark python3 threat-detection unsw-nb15-dataset
Last synced: 05 Apr 2025