Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hieuung/spark-learning

Repository for learning Pyspark for ETL
https://github.com/hieuung/spark-learning

data-pipeline etl-pipeline self-learning

Last synced: about 2 months ago
JSON representation

Repository for learning Pyspark for ETL

Awesome Lists containing this project

README

        

# Spark

## Local installation
- Download lastest [spark](https://www.apache.org/dyn/closer.lua/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz)
- Extract file
``` sh
export SPARK_TAR=spark-3.5.0-bin-hadoop3.tgz
export SPARK_FOLDER=spark-3.5.0-bin-hadoop3
tar xvf $SPARK_TAR
sudo mv $SPARK_FOLDER /opt/spark
```
- Add to path

```sh
nano ~/.bashrc
```

- Add below line to file

```
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
```

```sh
source ~/.bashrc
```

- Confirm installation
```
pyspark
```

- Sumbmit job
```sh
spark-submit ./apps/rdd.py
```

## Add external jars to SPARK_HOME

``` sh
cp ./apps/resources/external_jars/* /opt/spark/jars
```

## Application using Spark

> **_NOTE:_** Read `readme` in `/apps` directory