Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/groda/big_data

Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.
https://github.com/groda/big_data

apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio

Last synced: 6 days ago
JSON representation

Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.

Awesome Lists containing this project

README

        

![big_data](https://socialify.git.ci/groda/big_data/image?description=1&font=Inter&language=1&name=1&owner=1&pattern=Diagonal%20Stripes&stargazers=1&forks=1&theme=Light)

# Big Data for beginners

Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks. Most notebooks are self-contained, with instructions for installing all required services. They can be run on Google Colab or in a virtual Ubuntu machine/container.

## Setting Up Hadoop: Single-Node Configuration
- **[Hadoop_Setting_up_a_Single_Node_Cluster.ipynb](Hadoop_Setting_up_a_Single_Node_Cluster.ipynb)**
Open In Colab
Render in nbviewer Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
- **[Hadoop_single_node_cluster_setup_Python.ipynb](Hadoop_single_node_cluster_setup_Python.ipynb)** Open In Colab Render in nbviewer Set up a single-node Hadoop cluster on Google Colab using Python ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Hadoop_minicluster.ipynb](Hadoop_minicluster.ipynb)** Open In Colab Render in nbviewer Deploy a test Hadoop Cluster with a single command and no need for configuration. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

## Running Apache Spark in Standalone Mode
- **[Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb](Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb)** Open In Colab Render in nbviewer Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
- **[Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb](Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb)** Open In Colab Render in nbviewer Set up a single-node Spark server on Google Colab using the Bigtop distribution and utilities, estimate „π“ with a Montecarlo method and run another Java ML example. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Run_Spark_on_Google_Colab.ipynb](Run_Spark_on_Google_Colab.ipynb)** Open In Colab Render in nbviewer Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Spark_Standalone_Architecture_on_Google_Colab.ipynb](Spark_Standalone_Architecture_on_Google_Colab.ipynb)** Open In Colab Render in nbviewer Explore the Spark architecture through the immersive experience of deploying a standalone setup. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

- **[PySpark_On_Google_Colab.ipynb](PySpark_On_Google_Colab.ipynb)** Open In Colab Render in nbviewer Explore the inner workings of PySpark on Google Colab ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

## MapReduce Tutorials
- **[MapReduce_Primer_HelloWorld.ipynb](MapReduce_Primer_HelloWorld.ipynb)** Open In Colab Render in nbviewer A MapReduce Primer with “Hello, World!” ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[MapReduce_Primer_HelloWorld_bash.ipynb](MapReduce_Primer_HelloWorld_bash.ipynb)** Open In Colab Render in nbviewer A MapReduce Primer with “Hello, World! in Bash with just a few lines of code” ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[mapreduce_with_bash.ipynb](mapreduce_with_bash.ipynb)** An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
- **[simplest_mapreduce_bash_wordcount.ipynb](simplest_mapreduce_bash_wordcount.ipynb)** A very basic MapReduce wordcount example
- **[mrjob_wordcount.ipynb](mrjob_wordcount.ipynb)** A simple MapReduce job with mrjob
- **[Hadoop_spilling.ipynb](Hadoop_spilling.ipynb)** Hadoop spilling explained

## PySpark Tutorials
- **[demoSparkSQLPython.ipynb](demoSparkSQLPython.ipynb)** Pyspark basic demo
- **[ngrams_with_pyspark.ipynb](ngrams_with_pyspark.ipynb)** Open In Colab Render in nbviewer Basic example of n-grams extraction with PySpark ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[generate_data_with_Faker.ipynb](generate_data_with_Faker.ipynb)** Open In Colab Render in nbviewer Data Generation and Aggregation with Python's Faker Library and PySpark ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Encoding+dataframe+columns.ipynb](Encoding+dataframe+columns.ipynb)** Open In Colab Render in nbviewer DataFrame Column Encoding with PySpark and Parquet Format ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Apache_Sedona_with_PySpark.ipynb](Apache_Sedona_with_PySpark.ipynb)** Open In Colab Render in nbviewer Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

## Miscellaneous Tutorials
- **[GutenbergBooks.ipynb](GutenbergBooks.ipynb)** Open In Colab Render in nbviewer Explore and download books from the Gutenberg books collection. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[TestDFSio.ipynb](TestDFSio.ipynb)** Demo of TestDFSio for benchmarking Hadoop clusters
- **[Unicode.ipynb](Unicode.ipynb)** Open In Colab Render in nbviewer [![live on Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/groda/big_data/master?filepath=Unicode.ipynb) Exploring Unicode categories ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[polynomial_regression.ipynb](polynomial_regression.ipynb)** Open In Colab Render in nbviewer Worked out example of polynomial regression with numpy and matplotlib ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

## Virtualization and Cloud Automation
- **[docker_for_beginners.md](docker_for_beginners.md)** Docker for beginners: an introduction to the world of containers
- **[Terraform for beginners.md](terraform_for_beginners.md)** Getting started with Terraform
- **[Terraform in 5 minutes](Terraform%20in%205%20minutes.md)** A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

## Big Data Learning Pathways
- **[online_resources.md](online_resources.md)** Online resources for learning Big Data

# About this repository

## Notebooks Testing and CI

Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: [action_log.txt](https://github.com/groda/big_data/blob/master/action_log.txt) (see also: [Google Colab vs. GitHub Ubuntu Runner](Google_Colab_vs_GitHub_ubuntu_runner.ipynb) Open In Colab Render in nbviewer).

Current status:
- [![Run Notebooks on Ubuntu](https://github.com/groda/big_data/actions/workflows/run-notebooks.yml/badge.svg)](https://github.com/groda/big_data/actions/workflows/run-notebooks.yml)
- [![Run One Notebook on Ubuntu](https://github.com/groda/big_data/actions/workflows/run-one-notebook.yml/badge.svg)](https://github.com/groda/big_data/actions/workflows/run-one-notebook.yml)

The Github workflow is a starting point for what is known as _Continuous Integration_ (CI) in DevOps/Platform Engineering circles.