Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
https://github.com/kmohamedalie/big-data-hadoop-spark-lab
big-data coursera data-engineering docker hadoop ibm kubernetes spark
Last synced: 6 days ago
JSON representation
Big Data🛢️ with Hadoop🐘 and Spark⭐ lab🧪🥼
- Host: GitHub
- URL: https://github.com/kmohamedalie/big-data-hadoop-spark-lab
- Owner: Kmohamedalie
- License: mit
- Created: 2024-03-09T19:04:42.000Z (10 months ago)
- Default Branch: master
- Last Pushed: 2024-07-30T01:50:07.000Z (5 months ago)
- Last Synced: 2024-11-09T01:13:26.751Z (2 months ago)
- Topics: big-data, coursera, data-engineering, docker, hadoop, ibm, kubernetes, spark
- Language: Jupyter Notebook
- Homepage:
- Size: 43.1 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# **Data Engineering with Spark⭐ and Hadoop🐘**
### **Big Data🛢️ with Hadoop🐘 and Spark⭐ part of [IBM Data Engineering Professional Certificate](https://www.coursera.org/learn/introduction-to-big-data-with-spark-hadoop/home/module/1)**
### **[Apache Hadoop 🐘](https://hadoop.apache.org/)** :
![image](https://github.com/Kmohamedalie/Big-Data-Hadoop-Spark-lab/assets/63104472/b991829a-56b0-466a-85aa-5ab6b9280a6e)
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
[Hands-on Lab: Getting Started with Hive](https://github.com/Kmohamedalie/Data-Engineering/tree/master/Hands-on%20Lab%3A%20Getting%20Started%20with%20Hive): Hive is a data warehouse software within Hadoop that is designed to read, write, and manage large and tabular-type datasets and data analysis.
[Hands-on lab on Hadoop Map-Reduce](https://github.com/Kmohamedalie/IBM-Hadoop-Spark-lab/tree/master/Hands-on%20Lab%3A%20Hadoop%20MapReduce): MapReduce is a programming pattern that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop. MapReduce is a processing technique and a program model for distributed computing, it is based on Java. Distributed computing is a system or machine with multiple components located on different machines. Each component has its own job, but the components communicate with each other to run as one system to the end user." The MapReduce algorithm consists of two important tasks - Map and Reduce. Many MapReduce programs are written in Java. MapReduce can also be coded in C++, Python, Ruby, R and so on.
[Hands-on lab on Hadoop Cluster](https://github.com/Kmohamedalie/IBM-Hadoop-Spark-lab/tree/master/Hands-on%20lab%20on%20Hadoop%20Cluster): A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform parallel computations on big data sets. The Name node is the master node of the Hadoop Distributed File System (HDFS). It maintains the meta data of the files in the RAM for quick access.
### **[Apache Spark⭐](https://spark.apache.org/)** :
![image](https://github.com/Kmohamedalie/Big-Data-Hadoop-Spark-lab/assets/63104472/331e54b4-b021-47cd-9a0e-8641cc256e53)
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Its versatility, offering distributed data processing, real-time streaming, and in-memory processing capabilities, making it a powerful choice for various data processing tasks.
[Hands-on Lab: Getting Started with Spark using Python](https://github.com/Kmohamedalie/Big-Data-Hadoop-Spark-lab/tree/master/Hands-on%20Lab%3A%20Getting%20Started%20with%20Spark%20using%20Python): Introduction to Spark.
[Hands-on Lab: Introduction to DataFrames](https://github.com/Kmohamedalie/Big-Data-Hadoop-Spark-lab/blob/master/Hands-on%20Lab%3A%20Introduction%20to%20DataFrames/DataFrames.ipynb): A DataFrame is two-dimensional. Columns can be of different data types. DataFrames accept many data inputs including series and other DataFrames. You can pass indexes(row labels) and columns (column labels). Indexes can be numbers, dates, or strings/tuples.
[Hands-On Lab: Introduction to SparkSQL](https://github.com/Kmohamedalie/Big-Data-Hadoop-Spark-lab/blob/master/Hands-On%20Lab%3A%20Introduction%20to%20SparkSQL/SparkSQL.ipynb): Spark SQL is a Spark module for structured data processing. It query structured data inside Spark programs, using either SQL or a familiar DataFrame API.
[Submit Apache Spark Applications Lab](https://github.com/Kmohamedalie/Big-Data-Hadoop-Spark-lab/blob/master/Hands-on%20Lab%3A%20Submit%20Apache%20Spark%20Applications/Spark%20Application.pdf): In this lab, you will learn how to submit Apache Spark applications from a python script. This exercise is straightforward thanks to Docker Compose.
[Apache Spark Monitoring and Debugging](https://github.com/Kmohamedalie/Big-Data-Hadoop-Spark-lab/tree/master/Apache%20Spark%20Monitoring%20and%20Debugging): practice how to monitor and debug a Spark application through the web UI.
### **[Practice Project Overview](https://github.com/Kmohamedalie/Big-Data-Hadoop-Spark-lab/blob/master/Practice%20Project/FinalAssignment.ipynb)**
This practice project focuses on data transformation and integration using PySpark. You will work with two datasets, and perform various transformations such as adding columns, renaming columns, dropping unnecessary columns, joining dataframes, and finally, writing the results into a Hive warehouse and an HDFS file system.