Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hieuung/spark-learning
Repository for learning Pyspark for ETL
https://github.com/hieuung/spark-learning
data-pipeline etl-pipeline self-learning
Last synced: about 2 months ago
JSON representation
Repository for learning Pyspark for ETL
- Host: GitHub
- URL: https://github.com/hieuung/spark-learning
- Owner: hieuung
- Created: 2024-02-29T09:07:21.000Z (10 months ago)
- Default Branch: master
- Last Pushed: 2024-03-13T04:09:39.000Z (10 months ago)
- Last Synced: 2024-03-13T05:25:40.283Z (10 months ago)
- Topics: data-pipeline, etl-pipeline, self-learning
- Language: Python
- Homepage:
- Size: 1010 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Spark
## Local installation
- Download lastest [spark](https://www.apache.org/dyn/closer.lua/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz)
- Extract file
``` sh
export SPARK_TAR=spark-3.5.0-bin-hadoop3.tgz
export SPARK_FOLDER=spark-3.5.0-bin-hadoop3
tar xvf $SPARK_TAR
sudo mv $SPARK_FOLDER /opt/spark
```
- Add to path```sh
nano ~/.bashrc
```- Add below line to file
```
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
``````sh
source ~/.bashrc
```- Confirm installation
```
pyspark
```- Sumbmit job
```sh
spark-submit ./apps/rdd.py
```## Add external jars to SPARK_HOME
``` sh
cp ./apps/resources/external_jars/* /opt/spark/jars
```## Application using Spark
> **_NOTE:_** Read `readme` in `/apps` directory