https://github.com/bernhard-42/pyspark-atlas

PySpark for ETL jobs including lineage to Apache Atlas in one script via code inspection
https://github.com/bernhard-42/pyspark-atlas

Last synced: 7 months ago
JSON representation

PySpark for ETL jobs including lineage to Apache Atlas in one script via code inspection

Host: GitHub
URL: https://github.com/bernhard-42/pyspark-atlas
Owner: bernhard-42
Created: 2017-01-12T19:21:01.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-01-12T19:31:45.000Z (over 8 years ago)
Last Synced: 2025-02-28T22:35:20.805Z (7 months ago)
Language: Python
Size: 415 KB
Stars: 18
Watchers: 3
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

          ## Usage

1) Add Atlas URL and credentials

```bash

#

# Atlas Config

#

cluster   = "Sandbox"

atlasHost = "localhost:21000"

user      = "holger_gov"

password  = "holger_gov"

```

2) Add some Metadata decribing the ETL job

```bash

#

# (1) Metadata for the etl job

#

sourceDB          = "employees"

sourceTables      = ("employees", "departments", "dept_emp")

targetDB          = "default"

targetTable       = "emp_dept_flat"

targetColumns     = ["dept_no:string", "emp_no:int", "full_name:string", "from_date:string", "to_date:string", "dept_name:string"]

usedSparkPackages = []

owner = "etl"

```

3) Create a function etl that holds all ETL code that should be copied into Atlas as description of transformantion. Use Metadata to avoid inconsistencies

```bash

#

# (2) This is the actual ETL Spark function and its source will be copied into ATLAS Lineage. 

#

def etl():

	def saveToHiveAsORC(df, database, table):

	    tempTable = "%s_tmp_%d" % (table, (time.time()))

	    df.registerTempTable(tempTable)

	    sqlContext.sql("create table %s.%s stored as ORC as select * from %s" % (database, table, tempTable))

	    sqlContext.dropTempTable(tempTable)

	employees   = sqlContext.sql("select * from %s.%s" % (sourceDB, sourceTables[0]))  # -> employees.employees

	departments = sqlContext.sql("select * from %s.%s" % (sourceDB, sourceTables[1]))  # -> employees.departments

	dept_emp    = sqlContext.sql("select * from %s.%s" % (sourceDB, sourceTables[2]))  # -> employees.dept_emp

	emp_dept_flat = employees.withColumn("full_name", concat(employees["last_name"], lit(", "), employees["first_name"]))

	emp_dept_flat = emp_dept_flat.select("full_name", "emp_no").join(dept_emp,"emp_no").join(departments, "dept_no")

	saveToHiveAsORC(emp_dept_flat, targetDB, targetTable)

	return emp_dept_flat

```

Rest see [spark-etl.py](spark-etl.py)

## Result:

1) Lineage Graph

![Lineage-graph.png](Lineage-graph.png)

2) ETL Process properties

![etl-process-sample.png](etl-process-sample.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bernhard-42/pyspark-atlas

Awesome Lists containing this project

README