Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/groda/big_data
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.
https://github.com/groda/big_data
apache-sedona apache-spark big-data bigdata bigtop docker gutenberg-ebooks hadoop hadoop-cluster hadoop-hdfs hadoop-mapreduce jupyter-notebook mapreduce mapreduce-bash mrjob pyspark spark spark-sql testdfsio
Last synced: 7 days ago
JSON representation
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark.
- Host: GitHub
- URL: https://github.com/groda/big_data
- Owner: groda
- License: mit
- Created: 2019-08-27T20:29:46.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2025-01-02T13:58:01.000Z (20 days ago)
- Last Synced: 2025-01-08T00:11:25.940Z (14 days ago)
- Topics: apache-sedona, apache-spark, big-data, bigdata, bigtop, docker, gutenberg-ebooks, hadoop, hadoop-cluster, hadoop-hdfs, hadoop-mapreduce, jupyter-notebook, mapreduce, mapreduce-bash, mrjob, pyspark, spark, spark-sql, testdfsio
- Language: Jupyter Notebook
- Homepage:
- Size: 51.9 MB
- Stars: 67
- Watchers: 4
- Forks: 26
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![big_data](https://socialify.git.ci/groda/big_data/image?description=1&font=Inter&language=1&name=1&owner=1&pattern=Diagonal%20Stripes&stargazers=1&forks=1&theme=Light)
# Big Data for beginners
Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks. Most notebooks are self-contained, with instructions for installing all required services. They can be run on Google Colab or in a virtual Ubuntu machine/container.
## Setting Up Hadoop: Single-Node Configuration
- **[Hadoop_Setting_up_a_Single_Node_Cluster.ipynb](Hadoop_Setting_up_a_Single_Node_Cluster.ipynb)**
Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
- **[Hadoop_single_node_cluster_setup_Python.ipynb](Hadoop_single_node_cluster_setup_Python.ipynb)** Set up a single-node Hadoop cluster on Google Colab using Python ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Hadoop_minicluster.ipynb](Hadoop_minicluster.ipynb)** Deploy a test Hadoop Cluster with a single command and no need for configuration. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
## Running Apache Spark in Standalone Mode
- **[Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb](Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb)** Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
- **[Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb](Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb)** Set up a single-node Spark server on Google Colab using the Bigtop distribution and utilities, estimate „π“ with a Montecarlo method and run another Java ML example. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Run_Spark_on_Google_Colab.ipynb](Run_Spark_on_Google_Colab.ipynb)** Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Spark_Standalone_Architecture_on_Google_Colab.ipynb](Spark_Standalone_Architecture_on_Google_Colab.ipynb)** Explore the Spark architecture through the immersive experience of deploying a standalone setup. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)## MapReduce Tutorials
- **[MapReduce_Primer_HelloWorld.ipynb](MapReduce_Primer_HelloWorld.ipynb)** A MapReduce Primer with “Hello, World!” ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[MapReduce_Primer_HelloWorld_bash.ipynb](MapReduce_Primer_HelloWorld_bash.ipynb)** A MapReduce Primer with “Hello, World! in Bash with just a few lines of code” ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[mapreduce_with_bash.ipynb](mapreduce_with_bash.ipynb)** An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
- **[simplest_mapreduce_bash_wordcount.ipynb](simplest_mapreduce_bash_wordcount.ipynb)** A very basic MapReduce wordcount example
- **[mrjob_wordcount.ipynb](mrjob_wordcount.ipynb)** A simple MapReduce job with mrjob
- **[Hadoop_spilling.ipynb](Hadoop_spilling.ipynb)** Hadoop spilling explained## PySpark Tutorials
- **[PySpark_On_Google_Colab.ipynb](PySpark_On_Google_Colab.ipynb)** Explore the inner workings of PySpark on Google Colab ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[PySpark_miscellanea.ipynb](PySpark_miscellanea.ipynb)** Tips, tricks, and insights related to PySpark. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[demoSparkSQLPython.ipynb](demoSparkSQLPython.ipynb)** Pyspark basic demo
- **[ngrams_with_pyspark.ipynb](ngrams_with_pyspark.ipynb)** Basic example of n-grams extraction with PySpark ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[generate_data_with_Faker.ipynb](generate_data_with_Faker.ipynb)** Data Generation and Aggregation with Python's Faker Library and PySpark ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Encoding+dataframe+columns.ipynb](Encoding+dataframe+columns.ipynb)** DataFrame Column Encoding with PySpark and Parquet Format ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[Apache_Sedona_with_PySpark.ipynb](Apache_Sedona_with_PySpark.ipynb)** Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
## Miscellaneous Tutorials
- **[GutenbergBooks.ipynb](GutenbergBooks.ipynb)** Explore and download books from the Gutenberg books collection. ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[TestDFSio.ipynb](TestDFSio.ipynb)** Demo of TestDFSio for benchmarking Hadoop clusters
- **[Unicode.ipynb](Unicode.ipynb)** [![live on Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/groda/big_data/master?filepath=Unicode.ipynb) Exploring Unicode categories ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[polynomial_regression.ipynb](polynomial_regression.ipynb)** Worked out example of polynomial regression with numpy and matplotlib ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)
- **[downloadSpark.ipynb](downloadSpark.ipynb)** How to download and verify the Spark distribution ![recently updated](https://github.com/groda/big_data/blob/master/new3.gif?raw=true)## Virtualization and Cloud Automation
- **[docker_for_beginners.md](docker_for_beginners.md)** Docker for beginners: an introduction to the world of containers
- **[Terraform for beginners.md](terraform_for_beginners.md)** Getting started with Terraform
- **[Terraform in 5 minutes](Terraform%20in%205%20minutes.md)** A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management ![recently updated](https://github.com/groda/big_data/blob/master/updated.gif?raw=true)## Big Data Learning Pathways
- **[online_resources.md](online_resources.md)** Online resources for learning Big Data# About this repository
## Notebooks Testing and CI
Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: [action_log.txt](https://github.com/groda/big_data/blob/master/action_log.txt) (see also: [Google Colab vs. GitHub Ubuntu Runner](Google_Colab_vs_GitHub_ubuntu_runner.ipynb) ).
Current status:
- [![Run Notebooks on Ubuntu](https://github.com/groda/big_data/actions/workflows/run-notebooks.yml/badge.svg)](https://github.com/groda/big_data/actions/workflows/run-notebooks.yml)
- [![Run One Notebook on Ubuntu](https://github.com/groda/big_data/actions/workflows/run-one-notebook.yml/badge.svg)](https://github.com/groda/big_data/actions/workflows/run-one-notebook.yml)The Github workflow is a starting point for what is known as _Continuous Integration_ (CI) in DevOps/Platform Engineering circles.