https://github.com/groda/big_data

Tutorials on Big Data essentials: Hadoop, MapReduce, Spark. Explore a variety of tutorials and demonstrations on Big Data technologies, primarily in the form of Jupyter notebooks. Most notebooks are self-contained and live—ready to run with a click.
https://github.com/groda/big_data

apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/groda/big_data
Owner: groda
License: mit
Created: 2019-08-27T20:29:46.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2025-01-02T13:58:01.000Z (6 months ago)
Last Synced: 2025-03-30T19:05:42.150Z (3 months ago)
Topics: apache-sedona, apache-spark, big-data, bigdata, bigtop, docker, gutenberg-ebooks, hadoop, hadoop-cluster, hadoop-hdfs, hadoop-mapreduce, jupyter-notebook, mapreduce, mapreduce-bash, mrjob, pyspark, spark, spark-sql, testdfsio
Language: Jupyter Notebook
Homepage:
Size: 51.9 MB
Stars: 73
Watchers: 2
Forks: 26
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        ![big_data](https://socialify.git.ci/groda/big_data/image?description=1&font=Inter&language=1&name=1&owner=1&pattern=Diagonal%20Stripes&stargazers=1&forks=1&theme=Light)

# Big Data for beginners

Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks. Most notebooks are self-contained, with instructions for installing all required services. They can be run on Google Colab or in a virtual Ubuntu machine/container.

## Setting Up Hadoop: Single-Node Configuration

  - **[Hadoop_Setting_up_a_Single_Node_Cluster.ipynb](Hadoop_Setting_up_a_Single_Node_Cluster.ipynb)** 

    Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples 

  - **[Hadoop_single_node_cluster_setup_Python.ipynb](Hadoop_single_node_cluster_setup_Python.ipynb)**   Set up a single-node Hadoop cluster on Google Colab using Python ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

 - **[Hadoop_minicluster.ipynb](Hadoop_minicluster.ipynb)**   Deploy a test Hadoop Cluster with a single command and no need for configuration. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

   

## Running Apache Spark in Standalone Mode

  - **[Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb](Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb)**   Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method

  - **[Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb](Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb)**   Set up a single-node Spark server on Google Colab using the Bigtop distribution and utilities, estimate „π“ with a Montecarlo method and run another Java ML example. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

  - **[Run_Spark_on_Google_Colab.ipynb](Run_Spark_on_Google_Colab.ipynb)**   Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

  - **[Spark_Standalone_Architecture_on_Google_Colab.ipynb](Spark_Standalone_Architecture_on_Google_Colab.ipynb)**   Explore the Spark architecture through the immersive experience of deploying a standalone setup. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

## MapReduce Tutorials

- **[MapReduce_Primer_HelloWorld.ipynb](MapReduce_Primer_HelloWorld.ipynb)**   A MapReduce Primer with “Hello, World!” ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

- **[MapReduce_Primer_HelloWorld_bash.ipynb](MapReduce_Primer_HelloWorld_bash.ipynb)**   A MapReduce Primer with “Hello, World! in Bash with just a few lines of code” ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

- **[mapreduce_with_bash.ipynb](mapreduce_with_bash.ipynb)** An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer

- **[simplest_mapreduce_bash_wordcount.ipynb](simplest_mapreduce_bash_wordcount.ipynb)** A very basic MapReduce wordcount example

- **[mrjob_wordcount.ipynb](mrjob_wordcount.ipynb)** A simple MapReduce job with mrjob

- **[Hadoop_spilling.ipynb](Hadoop_spilling.ipynb)** Hadoop spilling explained

## PySpark Tutorials

- **[PySpark_On_Google_Colab.ipynb](PySpark_On_Google_Colab.ipynb)**   Explore the inner workings of PySpark on Google Colab ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

- **[PySpark_miscellanea.ipynb](PySpark_miscellanea.ipynb)**   Tips, tricks, and insights related to PySpark. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

- **[demoSparkSQLPython.ipynb](demoSparkSQLPython.ipynb)** Pyspark basic demo 

- **[ngrams_with_pyspark.ipynb](ngrams_with_pyspark.ipynb)**   Basic example of n-grams extraction with PySpark ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

- **[generate_data_with_Faker.ipynb](generate_data_with_Faker.ipynb)**   Data Generation and Aggregation with Python's Faker Library and PySpark ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

- **[Encoding+dataframe+columns.ipynb](Encoding+dataframe+columns.ipynb)**    DataFrame Column Encoding with PySpark and Parquet Format ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

- **[Apache_Sedona_with_PySpark.ipynb](Apache_Sedona_with_PySpark.ipynb)**    Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

  

## Miscellaneous Tutorials

- **[GutenbergBooks.ipynb](GutenbergBooks.ipynb)**   Explore and download books from the Gutenberg books collection.  ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true) 

- **[TestDFSio.ipynb](TestDFSio.ipynb)** Demo of TestDFSio for benchmarking Hadoop clusters

- **[Unicode.ipynb](Unicode.ipynb)**   [![live on Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/groda/big_data/master?filepath=Unicode.ipynb) Exploring Unicode categories ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true) 

- **[polynomial_regression.ipynb](polynomial_regression.ipynb)**    Worked out example of polynomial regression with numpy and matplotlib ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

- **[downloadSpark.ipynb](downloadSpark.ipynb)**   How to download and verify the Spark distribution ![recently updated](https://github.com/groda/big_data/blob/master/new3.gif?raw=true)

## Virtualization and Cloud Automation 

  - **[docker_for_beginners.md](docker_for_beginners.md)** Docker for beginners: an introduction to the world of containers

  - **[Terraform for beginners.md](terraform_for_beginners.md)** Getting started with Terraform

  - **[Terraform in 5 minutes](Terraform%20in%205%20minutes.md)** A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)

## Big Data Learning Pathways

- **[online_resources.md](online_resources.md)** Online resources for learning Big Data

# About this repository

## Notebooks Testing and CI

Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: [action_log.txt](https://github.com/groda/big_data/blob/master/action_log.txt) (see also: [Google Colab vs. GitHub Ubuntu Runner](Google_Colab_vs_GitHub_ubuntu_runner.ipynb)  ).

Current status: 

 - [![Run Notebooks on Ubuntu](https://github.com/groda/big_data/actions/workflows/run-notebooks.yml/badge.svg)](https://github.com/groda/big_data/actions/workflows/run-notebooks.yml)

 - [![Run One Notebook on Ubuntu](https://github.com/groda/big_data/actions/workflows/run-one-notebook.yml/badge.svg)](https://github.com/groda/big_data/actions/workflows/run-one-notebook.yml)

The Github workflow is a starting point for what is known as _Continuous Integration_ (CI) in DevOps/Platform Engineering circles.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/groda/big_data

Awesome Lists containing this project

README