An open API service indexing awesome lists of open source software.

https://github.com/duaa-a/big-data

hands-on journey through the Big Data training by NTI. Includes labs, notebooks, and notes on tools like HDFS, Spark, Kafka, Flink, Hive, HBase and more.
https://github.com/duaa-a/big-data

big-data elasticsearch flink-sql flume-ng hadoop-cluster hadoop-hdfs hdfs hivebase kafka spark zookeeper

Last synced: 8 months ago
JSON representation

hands-on journey through the Big Data training by NTI. Includes labs, notebooks, and notes on tools like HDFS, Spark, Kafka, Flink, Hive, HBase and more.

Awesome Lists containing this project

README

          




Big Data Training — NTI


This repository contains lab work, Jupyter notebooks, and concise notes produced during the Big Data Summer Training. It focuses on practical commands, examples, and reusable snippets.


What you'll find here



  • Jupyter Notebooks & lab exercises (organized by topic folders)

  • Technical notes and key takeaways

  • Practice examples, datasets, and use-case simulations

  • Commands, configuration snippets, and environment setup


Topics covered



  • Big Data Era & Kunpeng Architecture

  • HDFS + ZooKeeper — distributed storage and cluster coordination

  • HBase + Hive — NoSQL and distributed data warehousing (SQL-like)

  • ClickHouse — OLAP database for fast analytics

  • MapReduce + YARN — distributed processing and resource manager

  • Spark + Flink — batch and stream processing

  • Flume + Kafka — data ingestion and real-time messaging pipelines

  • Elasticsearch — search and analytics


Tools & technologies




Tool / TechUse case


Linux, SQL, PythonFoundations for scripting and querying
HDFSDistributed data storage
HiveSQL-style querying on big data
HBaseNoSQL for large-scale datasets
KafkaReal-time messaging
Spark & FlinkData processing engines (batch & stream)
ClickHouseHigh-performance analytics
Flume, SqoopData ingestion from logs and DBs
ElasticsearchSearch and analytics
ZooKeeperCluster coordination


Example commands


# HDFS (pseudo-distributed)

hdfs namenode -format
start-dfs.sh
start-yarn.sh

# Kafka (local)
bin/zookeeper-server-start.sh config/zookeeper.properties &
bin/kafka-server-start.sh config/server.properties


Repository structure (suggested)


/README.html        ← this file (HTML README)

/notebooks/ ← Jupyter notebooks organized by topic
/data/ ← sample datasets (small, non-sensitive)
/scripts/ ← helper scripts and setup commands
/notes/ ← short markdown notes and key takeaways

Goal of this repo



  • Personal reference and step-by-step notes

  • Complete recap of the training with runnable examples

  • Practical showcase of Big Data skills for projects, interviews, or collaborations


Let's connect


If you'd like to collaborate or discuss Big Data topics, reach out on LinkedIn or open an issue in this repo.


[Duaa Abd-Elati](https://www.linkedin.com/in/duaa-abdelati-abdelazeem) Connect on LinkedIn

Made during the NTI Big Data Summer Training — you may reuse or adapt this README.