Projects in Awesome Lists tagged with big-data-processing

https://github.com/souvik-databricks/dlt-with-debug

A lightweight helper utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run.

big-data big-data-processing databricks delta-live-tables dlt etl etl-pipeline python3 spark

Last synced: 10 Sep 2025

https://github.com/airscholar/flinkcommerce

This repository contains an Apache Flink application for real-time sales analytics built using Docker Compose to orchestrate the necessary infrastructure components, including Apache Flink, Elasticsearch, and Postgres

apache-flink big-data big-data-processing python realtime-streaming

Last synced: 08 Oct 2025

https://github.com/akardapolov/dimension-db

Hybrid time-series and block-column storage database engine written in Java

big-data-processing column-store dbms java sql time-series

Last synced: 24 May 2026

https://github.com/hope-data-science/r4bd

R for Big Data (Chinese Version)

big-data big-data-analytics-techniques big-data-processing r

Last synced: 25 Jul 2025

https://github.com/mtumilowicz/big-data-scala-spark-batch-workshop

Introduction to Spark Batch processing.

batch-processing big-data big-data-processing spark spark-sql workshop workshop-materials

Last synced: 15 Apr 2025

https://github.com/anirban166/big-data-ft.-genomics

Analysis, organization and querying of large genomic datasets using C++, Monsoon and various data structures.

big-data-processing bioinformatics data-structures-and-algorithms genomic-sequences

Last synced: 18 Mar 2025

https://github.com/adnanrahin/nfl-big-data-bowl-2022

The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams play. Here, you'll find a summary of each data set in the 2022 Data Bowl, a list of key variables to join on, and a description of each variable.

big-data big-data-processing rdd scala spark spark-sql

Last synced: 12 Apr 2026

https://github.com/mileristovski/dataengineer-sparkstreaming

Track a Boat est un système de suivi maritime en temps réel utilisant Kafka, Spark Structured Streaming et WebSockets. Il permet de visualiser la position des navires, analyser leurs trajectoires et prévoir leurs destinations sur une carte interactive.

big-data-processing distributed-computing docker docker-compose kafka kafka-topics maritime-data pipeline python scala spark-structured-streaming websocket

Last synced: 11 Apr 2026

https://github.com/adi3042/data_science

📊🚀 Explore the Data Science Universe! Unlock insights and master data skills with hands-on assignments spanning machine learning, visualization, and more. Your journey to becoming a data expert starts here! 🎯💡 DataScienceJourney

anomaly-detection big-data-processing classification clustering computer-vision data-cleaning-and-preprocessing data-visualization deep-learning dimensionality-reduction ensemble-learning exploratory-data-analysis feature-engineering machine-learning model-deployment model-selection-and-evaluation natural-language-processing regression-analysis statistical-analysis time-series-analysis-and-forecasting

Last synced: 17 Jan 2026

https://github.com/sayamalt/steel-energy-consumption-prediction-using-pyspark

Successfully established a machine learning model using PySpark which can precisely predict the energy consumption of the steel industry, up to an r2 score of approximately 99.5%.

apache-spark big-data-analytics big-data-processing cross-validation data-visualization exploratory-data-analysis hyperparameter-tuning machine-learning model-training-and-evaluation python regression spark sql

Last synced: 10 Mar 2026

https://github.com/rifat392000/bigdataanalytics

big-data-analytics big-data-processing cloudera-hadoop clustering eclipse google-colab-notebook hadoop-filesystem hadoop-mapreduce hue java-mapreduce pyspark-notebook python3 rdbms sql virtual-machine visualization

Last synced: 03 Feb 2026

https://github.com/latiefdatavisionary/big-data-for-data-science-college-task

big-data big-data-analysis big-data-analytics big-data-for-data-science big-data-management big-data-processing

Last synced: 07 Apr 2026

https://github.com/leonardogemin/bigdatacomputing_unipd

Collection of homework (mostly Spark-based) from the course "Big Data Computing" - University of Padua.

big-data-processing java spark

Last synced: 29 Apr 2026

https://github.com/bilgeswe/bigdatamanagement

Building a Data Pipeline with Lakehouse Architecture on Microsoft Azure Platform

azure azure-pipelines azure-service azure-storage big-data big-data-analytics big-data-processing data-visualization datalake-ingestion dataset kaggle sql uml-diagram

Last synced: 22 Jan 2026

https://github.com/turnipdo/docker-spark-setup

Setting up a Spark cluster in a Docker environment for improved repeatability and reliability. This project includes a simple transformation on a dataset containing approximately 31 million rows.

big-data-processing docker-container setup spark

Last synced: 07 Feb 2026

https://github.com/superminority/jsv

A compact way to represent a stream of similar json objects

big-data big-data-processing csv json python python3

Last synced: 17 Mar 2026

https://github.com/srking501/csc8101_coursework

A summative coursework for CSC8101 Engineering for AI

apache-parquet apache-spark azure-databricks big-data big-data-analytics big-data-processing data-science databri databricks-notebooks delta-file nyc-taxi-dataset parquet-files pyspark

Last synced: 12 Feb 2026

https://github.com/rociobenitez/happiness-index-data-processing

Repository for Big Data Processing - Contains Jupyter Notebooks and Datasets for data analysis and processing tasks related to Big Data.

big-data big-data-processing data-analysis data-processing happiness-index happiness-report jupyter-notebook matplotlib pandas seaborn

Last synced: 15 May 2026

https://github.com/lefteris-souflas/redis-mongodb-assignment

Analyzing classified ads data from the used motorcycles market. Tasks involve utilizing Redis Bitmaps for analytics on seller actions and MongoDB for analyzing bike listings. Includes data installation, cleaning, and analysis.

big-data-processing bitmap json mongo-database r redis redis-vs-rdbms-comparison

Last synced: 08 May 2026

https://github.com/khanovico/energy-data-analysis

This is the cloud model analyzing real world dataset with BigQuery and other big-data analyzing tools. I implemented docker image for running this app on cross-platform environments.

big-data-processing bigquery docker google-app-engine jupyter-notebook mlflow python scikit-learn seaborn xgboost

Last synced: 17 Feb 2026

https://github.com/tashi-2004/apache-hadoop-spark-hive-cyberanalytics

This project utilizes Apache Hadoop, Hive, and PySpark to process and analyze the UNSW-NB15 dataset, enabling advanced query analysis, machine learning modeling, and visualization. The project demonstrates efficient data ingestion, processing, and predictive analytics for network security insights.

ai apache-hadoop apache-hive big-data-analytics big-data-processing data-analysis data-engineering data-science data-security data-visualization hdfs machine-learning network-analysis network-security pyspark python3 threat-detection unsw-nb15-dataset

Last synced: 02 May 2026

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome