https://github.com/bernhard-42/spark-etl-atlas

A small project to show how to add lineage to Atlas when using Spark as ETL tool
https://github.com/bernhard-42/spark-etl-atlas

Last synced: 2 months ago
JSON representation

A small project to show how to add lineage to Atlas when using Spark as ETL tool

Host: GitHub
URL: https://github.com/bernhard-42/spark-etl-atlas
Owner: bernhard-42
Created: 2016-11-29T09:04:59.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2016-11-29T09:26:15.000Z (almost 9 years ago)
Last Synced: 2025-04-03T09:12:27.393Z (6 months ago)
Language: Jupyter Notebook
Size: 159 KB
Stars: 12
Watchers: 1
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

          ## Spark as ETL

If we use Saprk as ETL tool, no lineage information will be written to Atlas.

Assume we want to combine three Hive tables to an un-normalized flat one. A Spark code could look like (more details in [spark-etl.scala](spark-etl.scala)):

```scala

val employees = sqlContext.sql("select * from employees.employees")

val departments = sqlContext.sql("select * from employees.departments")

val dept_emp = sqlContext.sql("select * from employees.dept_emp")

val flat = employees.withColumn("full_name", concat(employees("last_name"), lit(", "), employees("first_name"))).

                     select("full_name", "emp_no").

                     join(dept_emp,"emp_no").

                     join(departments, "dept_no")

flat.registerTempTable(tempTable)

sqlContext.sql(s"create table default.employees_flat3 stored as ORC as select * from ${tempTable}")

```

## Adding Lineage

From a lineage perspective, "employees_flat3" is derived from the other threee tables.

The jupyter notebook [StoreSparkLineage.ipynb](StoreSparkLineage.ipynb) shows how to add this information to Atlas using the REST API

### Lineage Graph 

![Lineage.png](Lineage.png)

### Atlas view of the Spark Processor

![Spark Process in Atlas.png](Spark Process in Atlas.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bernhard-42/spark-etl-atlas

Awesome Lists containing this project

README