{"id":30151756,"url":"https://github.com/duaa-a/big-data","last_synced_at":"2025-08-11T11:09:25.599Z","repository":{"id":309278893,"uuid":"1023237442","full_name":"DuaA-A/Big-Data","owner":"DuaA-A","description":"hands-on journey through the Big Data training by NTI. Includes labs, notebooks, and notes on tools like HDFS, Spark, Kafka, Flink, Hive, HBase and more.","archived":false,"fork":false,"pushed_at":"2025-08-10T23:47:57.000Z","size":34471,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-11T01:15:10.518Z","etag":null,"topics":["big-data","elasticsearch","flink-sql","flume-ng","hadoop-cluster","hadoop-hdfs","hdfs","hivebase","kafka","spark","zookeeper"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DuaA-A.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-20T20:13:21.000Z","updated_at":"2025-08-10T23:52:42.000Z","dependencies_parsed_at":"2025-08-11T01:25:17.727Z","dependency_job_id":null,"html_url":"https://github.com/DuaA-A/Big-Data","commit_stats":null,"previous_names":["duaa-a/big-data"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/DuaA-A/Big-Data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DuaA-A%2FBig-Data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DuaA-A%2FBig-Data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DuaA-A%2FBig-Data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DuaA-A%2FBig-Data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DuaA-A","download_url":"https://codeload.github.com/DuaA-A/Big-Data/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DuaA-A%2FBig-Data/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269873158,"owners_count":24488993,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-11T02:00:10.019Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","elasticsearch","flink-sql","flume-ng","hadoop-cluster","hadoop-hdfs","hdfs","hivebase","kafka","spark","zookeeper"],"created_at":"2025-08-11T11:02:28.528Z","updated_at":"2025-08-11T11:09:25.586Z","avatar_url":"https://github.com/DuaA-A.png","language":null,"readme":"\u003chtml lang=\"en\"\u003e\n\u003chead\u003e\n  \u003cmeta charset=\"utf-8\"\u003e\n  \u003cmeta name=\"viewport\" content=\"width=device-width,initial-scale=1\"\u003e\n  \u003cmeta name=\"description\" content=\"README for the Big Data Summer Training — NTI \u0026amp; ITIDA.\" /\u003e\n\u003c/head\u003e\n\u003cbody\u003e\n  \u003cdiv class=\"container\"\u003e\n    \u003ch1\u003eBig Data Training — NTI\u003c/h1\u003e\n    \u003cp\u003eThis repository contains lab work, Jupyter notebooks, and concise notes produced during the Big Data Summer Training. It focuses on practical commands, examples, and reusable snippets.\u003c/p\u003e\n    \u003ch2\u003eWhat you'll find here\u003c/h2\u003e\n    \u003cul\u003e\n      \u003cli\u003eJupyter Notebooks \u0026amp; lab exercises (organized by topic folders)\u003c/li\u003e\n      \u003cli\u003eTechnical notes and key takeaways\u003c/li\u003e\n      \u003cli\u003ePractice examples, datasets, and use-case simulations\u003c/li\u003e\n      \u003cli\u003eCommands, configuration snippets, and environment setup\u003c/li\u003e\n    \u003c/ul\u003e\n    \u003ch2\u003eTopics covered\u003c/h2\u003e\n    \u003cul\u003e\n      \u003cli\u003eBig Data Era \u0026amp; Kunpeng Architecture\u003c/li\u003e\n      \u003cli\u003eHDFS + ZooKeeper — distributed storage and cluster coordination\u003c/li\u003e\n      \u003cli\u003eHBase + Hive — NoSQL and distributed data warehousing (SQL-like)\u003c/li\u003e\n      \u003cli\u003eClickHouse — OLAP database for fast analytics\u003c/li\u003e\n      \u003cli\u003eMapReduce + YARN — distributed processing and resource manager\u003c/li\u003e\n      \u003cli\u003eSpark + Flink — batch and stream processing\u003c/li\u003e\n      \u003cli\u003eFlume + Kafka — data ingestion and real-time messaging pipelines\u003c/li\u003e\n      \u003cli\u003eElasticsearch — search and analytics\u003c/li\u003e\n    \u003c/ul\u003e\n    \u003ch2\u003eTools \u0026amp; technologies\u003c/h2\u003e\n    \u003ctable\u003e\n      \u003cthead\u003e\n        \u003ctr\u003e\u003cth\u003eTool / Tech\u003c/th\u003e\u003cth\u003eUse case\u003c/th\u003e\u003c/tr\u003e\n      \u003c/thead\u003e\n      \u003ctbody\u003e\n        \u003ctr\u003e\u003ctd\u003eLinux, SQL, Python\u003c/td\u003e\u003ctd\u003eFoundations for scripting and querying\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003eHDFS\u003c/td\u003e\u003ctd\u003eDistributed data storage\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003eHive\u003c/td\u003e\u003ctd\u003eSQL-style querying on big data\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003eHBase\u003c/td\u003e\u003ctd\u003eNoSQL for large-scale datasets\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003eKafka\u003c/td\u003e\u003ctd\u003eReal-time messaging\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003eSpark \u0026amp; Flink\u003c/td\u003e\u003ctd\u003eData processing engines (batch \u0026amp; stream)\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003eClickHouse\u003c/td\u003e\u003ctd\u003eHigh-performance analytics\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003eFlume, Sqoop\u003c/td\u003e\u003ctd\u003eData ingestion from logs and DBs\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003eElasticsearch\u003c/td\u003e\u003ctd\u003eSearch and analytics\u003c/td\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd\u003eZooKeeper\u003c/td\u003e\u003ctd\u003eCluster coordination\u003c/td\u003e\u003c/tr\u003e\n      \u003c/tbody\u003e\n    \u003c/table\u003e\n    \u003ch2\u003eExample commands\u003c/h2\u003e\n    \u003cpre\u003e\u003ccode\u003e# HDFS (pseudo-distributed)\nhdfs namenode -format\nstart-dfs.sh\nstart-yarn.sh\n\n# Kafka (local)\nbin/zookeeper-server-start.sh config/zookeeper.properties \u0026\nbin/kafka-server-start.sh config/server.properties\u003c/code\u003e\u003c/pre\u003e\n    \u003ch2\u003eRepository structure (suggested)\u003c/h2\u003e\n    \u003cpre\u003e\u003ccode\u003e/README.html        ← this file (HTML README)\n/notebooks/          ← Jupyter notebooks organized by topic\n/data/               ← sample datasets (small, non-sensitive)\n/scripts/            ← helper scripts and setup commands\n/notes/              ← short markdown notes and key takeaways\u003c/code\u003e\u003c/pre\u003e\n    \u003ch2\u003eGoal of this repo\u003c/h2\u003e\n    \u003cul\u003e\n      \u003cli\u003ePersonal reference and step-by-step notes\u003c/li\u003e\n      \u003cli\u003eComplete recap of the training with runnable examples\u003c/li\u003e\n      \u003cli\u003ePractical showcase of Big Data skills for projects, interviews, or collaborations\u003c/li\u003e\n    \u003c/ul\u003e\n    \u003ch2\u003eLet's connect\u003c/h2\u003e\n    \u003cp\u003eIf you'd like to collaborate or discuss Big Data topics, reach out on LinkedIn or open an issue in this repo.\u003c/p\u003e\n    [Duaa Abd-Elati](https://www.linkedin.com/in/duaa-abdelati-abdelazeem) Connect on LinkedIn \n    \u003cfooter\u003e\n      \u003csmall\u003eMade during the NTI Big Data Summer Training — you may reuse or adapt this README.\u003c/small\u003e\n    \u003c/footer\u003e\n  \u003c/div\u003e\n\u003c/body\u003e\n\u003c/html\u003e\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduaa-a%2Fbig-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fduaa-a%2Fbig-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduaa-a%2Fbig-data/lists"}