https://github.com/bernhard-42/spark-etl-atlas
A small project to show how to add lineage to Atlas when using Spark as ETL tool
https://github.com/bernhard-42/spark-etl-atlas
Last synced: 2 months ago
JSON representation
A small project to show how to add lineage to Atlas when using Spark as ETL tool
- Host: GitHub
- URL: https://github.com/bernhard-42/spark-etl-atlas
- Owner: bernhard-42
- Created: 2016-11-29T09:04:59.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2016-11-29T09:26:15.000Z (almost 9 years ago)
- Last Synced: 2025-04-03T09:12:27.393Z (6 months ago)
- Language: Jupyter Notebook
- Size: 159 KB
- Stars: 12
- Watchers: 1
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
## Spark as ETL
If we use Saprk as ETL tool, no lineage information will be written to Atlas.Assume we want to combine three Hive tables to an un-normalized flat one. A Spark code could look like (more details in [spark-etl.scala](spark-etl.scala)):
```scala
val employees = sqlContext.sql("select * from employees.employees")
val departments = sqlContext.sql("select * from employees.departments")
val dept_emp = sqlContext.sql("select * from employees.dept_emp")val flat = employees.withColumn("full_name", concat(employees("last_name"), lit(", "), employees("first_name"))).
select("full_name", "emp_no").
join(dept_emp,"emp_no").
join(departments, "dept_no")flat.registerTempTable(tempTable)
sqlContext.sql(s"create table default.employees_flat3 stored as ORC as select * from ${tempTable}")
```## Adding Lineage
From a lineage perspective, "employees_flat3" is derived from the other threee tables.
The jupyter notebook [StoreSparkLineage.ipynb](StoreSparkLineage.ipynb) shows how to add this information to Atlas using the REST API
### Lineage Graph

### Atlas view of the Spark Processor
