An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with big-data-processing

A curated list of projects in awesome lists tagged with big-data-processing .

https://github.com/souvik-databricks/dlt-with-debug

A lightweight helper utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run.

big-data big-data-processing databricks delta-live-tables dlt etl etl-pipeline python3 spark

Last synced: 10 Sep 2025

https://github.com/airscholar/flinkcommerce

This repository contains an Apache Flink application for real-time sales analytics built using Docker Compose to orchestrate the necessary infrastructure components, including Apache Flink, Elasticsearch, and Postgres

apache-flink big-data big-data-processing python realtime-streaming

Last synced: 08 Oct 2025

https://github.com/akardapolov/dimension-db

Hybrid time-series and block-column storage database engine written in Java

big-data-processing column-store dbms java sql time-series

Last synced: 23 Feb 2026

https://github.com/anirban166/big-data-ft.-genomics

Analysis, organization and querying of large genomic datasets using C++, Monsoon and various data structures.

big-data-processing bioinformatics data-structures-and-algorithms genomic-sequences

Last synced: 18 Mar 2025

https://github.com/mileristovski/dataengineer-sparkstreaming

Track a Boat est un système de suivi maritime en temps réel utilisant Kafka, Spark Structured Streaming et WebSockets. Il permet de visualiser la position des navires, analyser leurs trajectoires et prévoir leurs destinations sur une carte interactive.

big-data-processing distributed-computing docker docker-compose kafka kafka-topics maritime-data pipeline python scala spark-structured-streaming websocket

Last synced: 08 Oct 2025

https://github.com/adnanrahin/nfl-big-data-bowl-2022

The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams play. Here, you'll find a summary of each data set in the 2022 Data Bowl, a list of key variables to join on, and a description of each variable.

big-data big-data-processing rdd scala spark spark-sql

Last synced: 30 Oct 2025

https://github.com/sayamalt/steel-energy-consumption-prediction-using-pyspark

Successfully established a machine learning model using PySpark which can precisely predict the energy consumption of the steel industry, up to an r2 score of approximately 99.5%.

apache-spark big-data-analytics big-data-processing cross-validation data-visualization exploratory-data-analysis hyperparameter-tuning machine-learning model-training-and-evaluation python regression spark sql

Last synced: 06 Dec 2025

https://github.com/turnipdo/docker-spark-setup

Setting up a Spark cluster in a Docker environment for improved repeatability and reliability. This project includes a simple transformation on a dataset containing approximately 31 million rows.

big-data-processing docker-container setup spark

Last synced: 07 Feb 2026

https://github.com/lefteris-souflas/redis-mongodb-assignment

Analyzing classified ads data from the used motorcycles market. Tasks involve utilizing Redis Bitmaps for analytics on seller actions and MongoDB for analyzing bike listings. Includes data installation, cleaning, and analysis.

big-data-processing bitmap json mongo-database r redis redis-vs-rdbms-comparison

Last synced: 02 Mar 2025

https://github.com/khanovico/energy-data-analysis

This is the cloud model analyzing real world dataset with BigQuery and other big-data analyzing tools. I implemented docker image for running this app on cross-platform environments.

big-data-processing bigquery docker google-app-engine jupyter-notebook mlflow python scikit-learn seaborn xgboost

Last synced: 17 Feb 2026

https://github.com/rociobenitez/happiness-index-data-processing

Repository for Big Data Processing - Contains Jupyter Notebooks and Datasets for data analysis and processing tasks related to Big Data.

big-data big-data-processing data-analysis data-processing happiness-index happiness-report jupyter-notebook matplotlib pandas seaborn

Last synced: 26 Jun 2025

https://github.com/leonardogemin/bigdatacomputing_unipd

Collection of homework (mostly Spark-based) from the course "Big Data Computing" - University of Padua.

big-data-processing java spark

Last synced: 05 Mar 2025

https://github.com/adi3042/data_science

📊🚀 Explore the Data Science Universe! Unlock insights and master data skills with hands-on assignments spanning machine learning, visualization, and more. Your journey to becoming a data expert starts here! 🎯💡 DataScienceJourney

anomaly-detection big-data-processing classification clustering computer-vision data-cleaning-and-preprocessing data-visualization deep-learning dimensionality-reduction ensemble-learning exploratory-data-analysis feature-engineering machine-learning model-deployment model-selection-and-evaluation natural-language-processing regression-analysis statistical-analysis time-series-analysis-and-forecasting

Last synced: 17 Jan 2026

https://github.com/tashi-2004/apache-hadoop-spark-hive-cyberanalytics

This project utilizes Apache Hadoop, Hive, and PySpark to process and analyze the UNSW-NB15 dataset, enabling advanced query analysis, machine learning modeling, and visualization. The project demonstrates efficient data ingestion, processing, and predictive analytics for network security insights.

ai apache-hadoop apache-hive big-data-analytics big-data-processing data-analysis data-engineering data-science data-security data-visualization hdfs machine-learning network-analysis network-security pyspark python3 threat-detection unsw-nb15-dataset

Last synced: 05 Apr 2025