https://github.com/duaa-a/big-data
hands-on journey through the Big Data training by NTI. Includes labs, notebooks, and notes on tools like HDFS, Spark, Kafka, Flink, Hive, HBase and more.
https://github.com/duaa-a/big-data
big-data elasticsearch flink-sql flume-ng hadoop-cluster hadoop-hdfs hdfs hivebase kafka spark zookeeper
Last synced: 8 months ago
JSON representation
hands-on journey through the Big Data training by NTI. Includes labs, notebooks, and notes on tools like HDFS, Spark, Kafka, Flink, Hive, HBase and more.
- Host: GitHub
- URL: https://github.com/duaa-a/big-data
- Owner: DuaA-A
- Created: 2025-07-20T20:13:21.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-08-10T23:47:57.000Z (8 months ago)
- Last Synced: 2025-08-11T01:15:10.518Z (8 months ago)
- Topics: big-data, elasticsearch, flink-sql, flume-ng, hadoop-cluster, hadoop-hdfs, hdfs, hivebase, kafka, spark, zookeeper
- Homepage:
- Size: 32.9 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Big Data Training — NTI
This repository contains lab work, Jupyter notebooks, and concise notes produced during the Big Data Summer Training. It focuses on practical commands, examples, and reusable snippets.
What you'll find here
- Jupyter Notebooks & lab exercises (organized by topic folders)
- Technical notes and key takeaways
- Practice examples, datasets, and use-case simulations
- Commands, configuration snippets, and environment setup
Topics covered
- Big Data Era & Kunpeng Architecture
- HDFS + ZooKeeper — distributed storage and cluster coordination
- HBase + Hive — NoSQL and distributed data warehousing (SQL-like)
- ClickHouse — OLAP database for fast analytics
- MapReduce + YARN — distributed processing and resource manager
- Spark + Flink — batch and stream processing
- Flume + Kafka — data ingestion and real-time messaging pipelines
- Elasticsearch — search and analytics
Tools & technologies
Tool / TechUse case
Linux, SQL, PythonFoundations for scripting and querying
HDFSDistributed data storage
HiveSQL-style querying on big data
HBaseNoSQL for large-scale datasets
KafkaReal-time messaging
Spark & FlinkData processing engines (batch & stream)
ClickHouseHigh-performance analytics
Flume, SqoopData ingestion from logs and DBs
ElasticsearchSearch and analytics
ZooKeeperCluster coordination
Example commands
# HDFS (pseudo-distributed)
hdfs namenode -format
start-dfs.sh
start-yarn.sh
# Kafka (local)
bin/zookeeper-server-start.sh config/zookeeper.properties &
bin/kafka-server-start.sh config/server.properties
Repository structure (suggested)
/README.html ← this file (HTML README)
/notebooks/ ← Jupyter notebooks organized by topic
/data/ ← sample datasets (small, non-sensitive)
/scripts/ ← helper scripts and setup commands
/notes/ ← short markdown notes and key takeaways
Goal of this repo
- Personal reference and step-by-step notes
- Complete recap of the training with runnable examples
- Practical showcase of Big Data skills for projects, interviews, or collaborations
Let's connect
If you'd like to collaborate or discuss Big Data topics, reach out on LinkedIn or open an issue in this repo.
[Duaa Abd-Elati](https://www.linkedin.com/in/duaa-abdelati-abdelazeem) Connect on LinkedIn
Made during the NTI Big Data Summer Training — you may reuse or adapt this README.