https://github.com/bernhard-42/pyspark-atlas
PySpark for ETL jobs including lineage to Apache Atlas in one script via code inspection
https://github.com/bernhard-42/pyspark-atlas
Last synced: 7 months ago
JSON representation
PySpark for ETL jobs including lineage to Apache Atlas in one script via code inspection
- Host: GitHub
- URL: https://github.com/bernhard-42/pyspark-atlas
- Owner: bernhard-42
- Created: 2017-01-12T19:21:01.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-01-12T19:31:45.000Z (over 8 years ago)
- Last Synced: 2025-02-28T22:35:20.805Z (7 months ago)
- Language: Python
- Size: 415 KB
- Stars: 18
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
## Usage
1) Add Atlas URL and credentials
```bash
#
# Atlas Config
#
cluster = "Sandbox"
atlasHost = "localhost:21000"
user = "holger_gov"
password = "holger_gov"
```2) Add some Metadata decribing the ETL job
```bash
#
# (1) Metadata for the etl job
#
sourceDB = "employees"
sourceTables = ("employees", "departments", "dept_emp")targetDB = "default"
targetTable = "emp_dept_flat"
targetColumns = ["dept_no:string", "emp_no:int", "full_name:string", "from_date:string", "to_date:string", "dept_name:string"]usedSparkPackages = []
owner = "etl"
```3) Create a function etl that holds all ETL code that should be copied into Atlas as description of transformantion. Use Metadata to avoid inconsistencies
```bash
#
# (2) This is the actual ETL Spark function and its source will be copied into ATLAS Lineage.
#
def etl():
def saveToHiveAsORC(df, database, table):
tempTable = "%s_tmp_%d" % (table, (time.time()))
df.registerTempTable(tempTable)
sqlContext.sql("create table %s.%s stored as ORC as select * from %s" % (database, table, tempTable))
sqlContext.dropTempTable(tempTable)employees = sqlContext.sql("select * from %s.%s" % (sourceDB, sourceTables[0])) # -> employees.employees
departments = sqlContext.sql("select * from %s.%s" % (sourceDB, sourceTables[1])) # -> employees.departments
dept_emp = sqlContext.sql("select * from %s.%s" % (sourceDB, sourceTables[2])) # -> employees.dept_empemp_dept_flat = employees.withColumn("full_name", concat(employees["last_name"], lit(", "), employees["first_name"]))
emp_dept_flat = emp_dept_flat.select("full_name", "emp_no").join(dept_emp,"emp_no").join(departments, "dept_no")
saveToHiveAsORC(emp_dept_flat, targetDB, targetTable)return emp_dept_flat
```Rest see [spark-etl.py](spark-etl.py)
## Result:
1) Lineage Graph
2) ETL Process properties
