https://github.com/duaa-a/big-data

hands-on journey through the Big Data training by NTI. Includes labs, notebooks, and notes on tools like HDFS, Spark, Kafka, Flink, Hive, HBase and more.
https://github.com/duaa-a/big-data

big-data elasticsearch flink-sql flume-ng hadoop-cluster hadoop-hdfs hdfs hivebase kafka spark zookeeper

Last synced: 10 months ago
JSON representation

hands-on journey through the Big Data training by NTI. Includes labs, notebooks, and notes on tools like HDFS, Spark, Kafka, Flink, Hive, HBase and more.

Host: GitHub
URL: https://github.com/duaa-a/big-data
Owner: DuaA-A
Created: 2025-07-20T20:13:21.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-08-10T23:47:57.000Z (10 months ago)
Last Synced: 2025-08-11T01:15:10.518Z (10 months ago)
Topics: big-data, elasticsearch, flink-sql, flume-ng, hadoop-cluster, hadoop-hdfs, hdfs, hivebase, kafka, spark, zookeeper
Homepage:
Size: 32.9 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Big Data Training — NTI

This repository contains lab work, Jupyter notebooks, and concise notes produced during the Big Data Summer Training. It focuses on practical commands, examples, and reusable snippets.

What you'll find here

Jupyter Notebooks & lab exercises (organized by topic folders)

Technical notes and key takeaways

Practice examples, datasets, and use-case simulations

Commands, configuration snippets, and environment setup

Topics covered

Big Data Era & Kunpeng Architecture

HDFS + ZooKeeper — distributed storage and cluster coordination

HBase + Hive — NoSQL and distributed data warehousing (SQL-like)

ClickHouse — OLAP database for fast analytics

MapReduce + YARN — distributed processing and resource manager

Spark + Flink — batch and stream processing

Flume + Kafka — data ingestion and real-time messaging pipelines

Elasticsearch — search and analytics

Tools & technologies

Tool / TechUse case

Linux, SQL, PythonFoundations for scripting and querying
HDFSDistributed data storage
HiveSQL-style querying on big data
HBaseNoSQL for large-scale datasets
KafkaReal-time messaging
Spark & FlinkData processing engines (batch & stream)
ClickHouseHigh-performance analytics
Flume, SqoopData ingestion from logs and DBs
ElasticsearchSearch and analytics
ZooKeeperCluster coordination

Example commands

# HDFS (pseudo-distributed) hdfs namenode -format start-dfs.sh start-yarn.sh

# Kafka (local) bin/zookeeper-server-start.sh config/zookeeper.properties & bin/kafka-server-start.sh config/server.properties

Repository structure (suggested)

/README.html        ← this file (HTML README)

/notebooks/          ← Jupyter notebooks organized by topic

/data/               ← sample datasets (small, non-sensitive)

/scripts/            ← helper scripts and setup commands

/notes/              ← short markdown notes and key takeaways

Goal of this repo

Personal reference and step-by-step notes

Complete recap of the training with runnable examples

Practical showcase of Big Data skills for projects, interviews, or collaborations

Let's connect

If you'd like to collaborate or discuss Big Data topics, reach out on LinkedIn or open an issue in this repo.

[Duaa Abd-Elati](https://www.linkedin.com/in/duaa-abdelati-abdelazeem) Connect on LinkedIn

Made during the NTI Big Data Summer Training — you may reuse or adapt this README.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome