Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/groda/big_data
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.
https://github.com/groda/big_data
apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio
Last synced: 5 days ago
JSON representation
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.
- Host: GitHub
- URL: https://github.com/groda/big_data
- Owner: groda
- License: mit
- Created: 2019-08-27T20:29:46.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2025-01-02T13:58:01.000Z (about 2 months ago)
- Last Synced: 2025-02-06T20:18:10.977Z (12 days ago)
- Topics: apache-sedona, apache-spark, big-data, bigdata, bigtop, docker, gutenberg-ebooks, hadoop, hadoop-cluster, hadoop-hdfs, hadoop-mapreduce, jupyter-notebook, mapreduce, mapreduce-bash, mrjob, pyspark, spark, spark-sql, testdfsio
- Language: Jupyter Notebook
- Homepage:
- Size: 51.9 MB
- Stars: 70
- Watchers: 3
- Forks: 26
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

# Big Data for beginners
Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks. Most notebooks are self-contained, with instructions for installing all required services. They can be run on Google Colab or in a virtual Ubuntu machine/container.
## Setting Up Hadoop: Single-Node Configuration
- **[Hadoop_Setting_up_a_Single_Node_Cluster.ipynb](Hadoop_Setting_up_a_Single_Node_Cluster.ipynb)**
![]()
Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
- **[Hadoop_single_node_cluster_setup_Python.ipynb](Hadoop_single_node_cluster_setup_Python.ipynb)**![]()
Set up a single-node Hadoop cluster on Google Colab using Python 
- **[Hadoop_minicluster.ipynb](Hadoop_minicluster.ipynb)**![]()
Deploy a test Hadoop Cluster with a single command and no need for configuration. 
## Running Apache Spark in Standalone Mode
- **[Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb](Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb)**![]()
Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
- **[Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb](Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb)**![]()
Set up a single-node Spark server on Google Colab using the Bigtop distribution and utilities, estimate „π“ with a Montecarlo method and run another Java ML example. 
- **[Run_Spark_on_Google_Colab.ipynb](Run_Spark_on_Google_Colab.ipynb)**![]()
Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version 
- **[Spark_Standalone_Architecture_on_Google_Colab.ipynb](Spark_Standalone_Architecture_on_Google_Colab.ipynb)**![]()
Explore the Spark architecture through the immersive experience of deploying a standalone setup. 
## MapReduce Tutorials
- **[MapReduce_Primer_HelloWorld.ipynb](MapReduce_Primer_HelloWorld.ipynb)**![]()
A MapReduce Primer with “Hello, World!” 
- **[MapReduce_Primer_HelloWorld_bash.ipynb](MapReduce_Primer_HelloWorld_bash.ipynb)**![]()
A MapReduce Primer with “Hello, World! in Bash with just a few lines of code” 
- **[mapreduce_with_bash.ipynb](mapreduce_with_bash.ipynb)** An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
- **[simplest_mapreduce_bash_wordcount.ipynb](simplest_mapreduce_bash_wordcount.ipynb)** A very basic MapReduce wordcount example
- **[mrjob_wordcount.ipynb](mrjob_wordcount.ipynb)** A simple MapReduce job with mrjob
- **[Hadoop_spilling.ipynb](Hadoop_spilling.ipynb)** Hadoop spilling explained## PySpark Tutorials
- **[PySpark_On_Google_Colab.ipynb](PySpark_On_Google_Colab.ipynb)**![]()
Explore the inner workings of PySpark on Google Colab 
- **[PySpark_miscellanea.ipynb](PySpark_miscellanea.ipynb)**![]()
Tips, tricks, and insights related to PySpark. 
- **[demoSparkSQLPython.ipynb](demoSparkSQLPython.ipynb)** Pyspark basic demo
- **[ngrams_with_pyspark.ipynb](ngrams_with_pyspark.ipynb)**![]()
Basic example of n-grams extraction with PySpark 
- **[generate_data_with_Faker.ipynb](generate_data_with_Faker.ipynb)**![]()
Data Generation and Aggregation with Python's Faker Library and PySpark 
- **[Encoding+dataframe+columns.ipynb](Encoding+dataframe+columns.ipynb)**![]()
DataFrame Column Encoding with PySpark and Parquet Format 
- **[Apache_Sedona_with_PySpark.ipynb](Apache_Sedona_with_PySpark.ipynb)**![]()
Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab 
## Miscellaneous Tutorials
- **[GutenbergBooks.ipynb](GutenbergBooks.ipynb)**![]()
Explore and download books from the Gutenberg books collection. 
- **[TestDFSio.ipynb](TestDFSio.ipynb)** Demo of TestDFSio for benchmarking Hadoop clusters
- **[Unicode.ipynb](Unicode.ipynb)**![]()
[](https://mybinder.org/v2/gh/groda/big_data/master?filepath=Unicode.ipynb) Exploring Unicode categories 
- **[polynomial_regression.ipynb](polynomial_regression.ipynb)**![]()
Worked out example of polynomial regression with numpy and matplotlib 
- **[downloadSpark.ipynb](downloadSpark.ipynb)**![]()
How to download and verify the Spark distribution 
## Virtualization and Cloud Automation
- **[docker_for_beginners.md](docker_for_beginners.md)** Docker for beginners: an introduction to the world of containers
- **[Terraform for beginners.md](terraform_for_beginners.md)** Getting started with Terraform
- **[Terraform in 5 minutes](Terraform%20in%205%20minutes.md)** A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management ## Big Data Learning Pathways
- **[online_resources.md](online_resources.md)** Online resources for learning Big Data# About this repository
## Notebooks Testing and CI
Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: [action_log.txt](https://github.com/groda/big_data/blob/master/action_log.txt) (see also: [Google Colab vs. GitHub Ubuntu Runner](Google_Colab_vs_GitHub_ubuntu_runner.ipynb)
![]()
).
Current status:
- [](https://github.com/groda/big_data/actions/workflows/run-notebooks.yml)
- [](https://github.com/groda/big_data/actions/workflows/run-one-notebook.yml)The Github workflow is a starting point for what is known as _Continuous Integration_ (CI) in DevOps/Platform Engineering circles.