https://github.com/dwarry/sactest
Simple test project to investigate the Spark Atlas Connector
https://github.com/dwarry/sactest
Last synced: about 1 year ago
JSON representation
Simple test project to investigate the Spark Atlas Connector
- Host: GitHub
- URL: https://github.com/dwarry/sactest
- Owner: dwarry
- Created: 2020-01-22T15:53:25.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-02-19T18:01:18.000Z (over 6 years ago)
- Last Synced: 2025-01-24T11:28:58.214Z (over 1 year ago)
- Language: Shell
- Size: 18.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# sactest
Simple test project to investigate the Spark Atlas Connector. This is intended to be a Minimal Reproducible Example of using Spark in the way that our existing ETL scripts do, in order to validate that Atlas can collect the lineage metadata we are expecting.
## Environment
* The test was run on a new cluster with HDP3.1.4 installed.
* Spark Atlas Connector was enabled in Ambari and services restarted.
* Spark entities did not appear in the Atlas search criteria, so I added the
1100-spark_model.json file to `/models/1000-Hadoop` as per the
readme for https://github.com/hortonworks-spark/spark-atlas-connector
## Files
* Schema.hql
Hive DDL file to create the test database and table
* load_test_data.py
PySpark job that loads data from a file in HDFS into a DataFrame and saves it to Hive using the Hive Warehouse Connector
* submit_load_test_data.sh
Shell script that submits the `load_test_data.py` file to run on Spark.
* test_data.csv
A few rows of test data
* load_test_data.log
The output from running `load_test_data.py`
## Results
* The output from Spark certainly shows some entries related to Atlas, but no lineage information was captured:
* No Spark Process entities were created in Atlas
* Lineage for the sac_test table just showed data being loaded from a temporary file by an LOAD DATA INPATH Hive statement.