https://github.com/dimajix/spark-training
Repository used for Spark Trainings
https://github.com/dimajix/spark-training
hadoop hadoop-training hive pyspark python scala spark spark-ml spark-streaming spark-training sqoop
Last synced: about 1 year ago
JSON representation
Repository used for Spark Trainings
- Host: GitHub
- URL: https://github.com/dimajix/spark-training
- Owner: dimajix
- Created: 2015-12-28T15:08:55.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2023-04-21T20:46:45.000Z (about 3 years ago)
- Last Synced: 2025-04-01T12:49:24.621Z (about 1 year ago)
- Topics: hadoop, hadoop-training, hive, pyspark, python, scala, spark, spark-ml, spark-streaming, spark-training, sqoop
- Language: Jupyter Notebook
- Homepage: http://www.dimajix.de
- Size: 9 MB
- Stars: 53
- Watchers: 4
- Forks: 66
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spark Training Repository
This repository contains many different examples, exercises and tutorials for Spark and Hadoop trainings performed
by dimajix. You can always find the latest version on GitHub at
https://github.com/dimajix/spark-training
## Contents
The repository contains different types of documents
* Source Code for Spark/Scala
* Jupyter Notebooks for PySpark
* Zeppelin Notebooks for Spark/Scala
* Hive SQL scripts
* Pig scripts
* ...and much more
## External Dependencies
Some notebooks require some test data provided by dimajix on S3 at s3://dimajix-training/data/.
## Building Executables
The source code can be built using Maven, simply by running
mvn install
from the root directory.
## Running Examples
Most code is either provided as interactive Notebooks (Jupyter and/or Zeppelin) or as compilable programs. Programs
which create jar files always contain start scripts, which take care of setting any environment variables and Spark
configuration properties.