{"id":15350559,"url":"https://github.com/maropu/spark-sql-flow-plugin","last_synced_at":"2025-06-25T21:35:59.123Z","repository":{"id":37603387,"uuid":"376813940","full_name":"maropu/spark-sql-flow-plugin","owner":"maropu","description":"Visualize column-level data lineage in Spark SQL","archived":false,"fork":false,"pushed_at":"2022-05-13T00:43:13.000Z","size":739382,"stargazers_count":91,"open_issues_count":3,"forks_count":17,"subscribers_count":5,"default_branch":"spark-3.2","last_synced_at":"2025-04-15T03:13:41.844Z","etag":null,"topics":["data-lineage","graph","graphviz","neo4j","python","scala","spark","sql","visualization"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maropu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-14T12:23:39.000Z","updated_at":"2025-04-09T03:52:36.000Z","dependencies_parsed_at":"2022-07-19T18:09:20.424Z","dependency_job_id":null,"html_url":"https://github.com/maropu/spark-sql-flow-plugin","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/maropu/spark-sql-flow-plugin","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maropu%2Fspark-sql-flow-plugin","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maropu%2Fspark-sql-flow-plugin/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maropu%2Fspark-sql-flow-plugin/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maropu%2Fspark-sql-flow-plugin/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maropu","download_url":"https://codeload.github.com/maropu/spark-sql-flow-plugin/tar.gz/refs/heads/spark-3.2","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maropu%2Fspark-sql-flow-plugin/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261958180,"owners_count":23236418,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-lineage","graph","graphviz","neo4j","python","scala","spark","sql","visualization"],"created_at":"2024-10-01T11:58:41.846Z","updated_at":"2025-06-25T21:35:59.099Z","avatar_url":"https://github.com/maropu.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![License](http://img.shields.io/:license-Apache_v2-blue.svg)](https://github.com/maropu/spark-sql-flow-plugin/blob/master/LICENSE)\n[![Build and test](https://github.com/maropu/spark-sql-flow-plugin/workflows/Build%20and%20test/badge.svg)](https://github.com/maropu/spark-sql-flow-plugin/actions?query=workflow%3A%22Build+and+test%22)\n\nThis tool enables you to easily visualize column-level reference relationship (so called data lineage) between tables/views stored in Spark SQL.\nThe feature is useful for understanding your data transformation workflow in SQL/DataFrame and deciding which tables/views should be\ncached and which ones should not to optimize performance/storage trade-off.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"resources/spark-data-repair-plugin-graphviz.svg\" width=\"800px\"\u003e\u003c/p\u003e\n\nFor instance, a diagram above is generated by this tool and it is the transformation workflow of [spark-data-repair-plugin](https://github.com/maropu/spark-data-repair-plugin).\nThe light blue nodes mean cached tables/views/plans and tables/views might be worth being cached if they are referenced by more than one plan\n(e.g., the `freq_attr_stats__5937180596123253` view in a little left side of the center is cached because the four plans reference the view).\nFor more visualization examples, please see a [tpcds-flow-tests](https://github.com/maropu/spark-sql-flow-plugin/tree/spark-3.2/src/test/resources/tpcds-flow-tests/results) folder\nand it includes data lineage for the [TPC-DS](http://www.tpc.org/tpcds/) queries.\n\n## How to Visualize Data Lineage for Your Tables/Views\n\nThe tool has interfaces for Scala and Python. If you use Python, how to generate data lineage is as follows:\n\n```\n# You need to check out this repository first\n$ git clone https://github.com/maropu/spark-sql-flow-plugin.git\n$ cd spark-sql-flow-plugin\n\n# NOTE: a script 'bin/python' automatically creates a 'conda' virtual env to install required\n# modules. If you install them by yourself (e.g., pip install -r bin/requirements.txt),\n# you need to define a env 'CONDA_DISABLED' like 'CONDA_DISABLED=1 ./bin/python'\n$ ./bin/python\n\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /__ / .__/\\_,_/_/ /_/\\_\\   version 3.2.0\n      /_/\n\nUsing Python version 3.7.10 (default, Jun  4 2021 14:48:32)\n\n# Defines some views for this example\n\u003e\u003e\u003e sql(\"CREATE TABLE TestTable (key INT, value INT)\")\n\u003e\u003e\u003e sql(\"CREATE TEMPORARY VIEW TestView1 AS SELECT key, SUM(value) s FROM TestTable GROUP BY key\")\n\u003e\u003e\u003e sql(\"CACHE TABLE TestView1\")\n\u003e\u003e\u003e sql(\"CREATE TEMPORARY VIEW TestView2 AS SELECT t.key, t.value, v.s FROM TestTable t, TestView1 v WHERE t.key = v.key\")\n\n# Generates a Graphviz dot file to represent reference relationships between views\n\u003e\u003e\u003e from sqlflow import save_data_lineage\n\u003e\u003e\u003e save_data_lineage(output_dir_path=\"/tmp/sqlflow-output\")\n\n$ ls /tmp/sqlflow-output\nsqlflow.dot     sqlflow.svg\n```\n\n`sqlflow.dot` is a Graphviz dot file and you can use the `dot` command or [GraphvizOnline](https://dreampuf.github.io/GraphvizOnline)\nto convert it into a specified image, e.g., SVG and PNG.\nIf `dot` already installed on your machine, a SVG-formatted image (`sqlflow.svg` in this example)\nis automatically generated by default:\n\n\u003cimg src=\"resources/graphviz_1.svg\" width=\"850px\"\u003e\n\nIf `contracted` is set to `true` in `save_data_lineage`, it generates the compact form\nof a data lineage diagram that omits plan nodes as follows:\n\n```\n\u003e\u003e\u003e save_data_lineage(output_dir_path=\"/tmp/sqlflow-output\", contracted = true)\n```\n\n\u003cimg src=\"resources/graphviz_2.svg\" width=\"450px\"\u003e\n\nIn case of Scala, you can use `SQLFlow.saveAsSQLFlow` instead to generate a dot file as follows:\n\n```\n$ ./bin/spark-shell\n\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /__ / .__/\\_,_/_/ /_/\\_\\   version 3.2.0\n      /_/\n\nscala\u003e sql(\"CREATE TABLE TestTable (key INT, value INT)\")\nscala\u003e sql(\"CREATE TEMPORARY VIEW TestView1 AS SELECT key, SUM(value) s FROM TestTable GROUP BY key\")\nscala\u003e sql(\"CACHE TABLE TestView1\")\nscala\u003e sql(\"CREATE TEMPORARY VIEW TestView2 AS SELECT t.key, t.value, v.s FROM TestTable t, TestView1 v WHERE t.key = v.key\")\nscala\u003e import org.apache.spark.sql.flow.SQLFlow\nscala\u003e SQLFlow.saveAsSQLFlow(Map(\"outputDirPath\" -\u003e \"/tmp/sqlflow-output\"))\n```\n\n## Automatic Tracking with Python '@auto_tracking' Decorator\n\nIf you have a set of the functions that take and return `DataFrame` for data transformation,\na Python decorator `@auto_tracking` is useful to track data lineage automatically:\n\n```\n\u003e\u003e\u003e from pyspark.sql import functions as f\n\u003e\u003e\u003e from sqlflow import auto_tracking, save_data_lineage\n\n\u003e\u003e\u003e @auto_tracking\n... def transform_alpha(df):\n...     return df.selectExpr('id % 3 AS key', 'id % 5 AS value')\n...\n\u003e\u003e\u003e @auto_tracking\n... def transform_beta(df):\n...     return df.groupBy('key').agg(f.expr('collect_set(value)').alias('value'))\n...\n\u003e\u003e\u003e @auto_tracking\n... def transform_gamma(df):\n...     return df.selectExpr('explode(value)')\n...\n\n# Applies a chain of transformation functions\n\u003e\u003e\u003e transform_gamma(transform_beta(transform_alpha(spark.range(10))))\nDataFrame[col: bigint]\n\n\u003e\u003e\u003e save_data_lineage(output_dir_path='/tmp/sqlflow-output', contracted=True)\n```\n\nAutomatically tracked data lineage is as follows:\n\n\u003cimg src=\"resources/graphviz_3.svg\" width=\"700px\"\u003e\n\n## Exports Data Lineage into Other Systems\n\n### Adjacency List\n\nMost graph processing libraries (e.g., [Python NetworkX](https://networkx.org))\ncan load a adjacency list file that includes a source node name of an edge\nand its destination node name in each line. To generate it for exporting data lineage into these libraries,\na Python user can set a `graph_sink='adjacency_list'` in `save_data_lineage`.\nNote that the output file only contains coarse-grained reference relationships between tables/views/plans\nbecause it is difficult to represent column-level references in an adjacency list.\n\n```\n# NOTE: Valid `graph_sink` value is `graphviz` or `adjacency_list` (`graphviz` by default)\n\u003e\u003e\u003e from sqlflow import save_data_lineage\n\u003e\u003e\u003e save_data_lineage(output_dir_path='/tmp/sqlflow-output', graph_sink='adjacency_list', contracted=False, options='sep=,')\n\n$ cat /tmp/sqlflow-output/sqlflow.lst\ndefault.testtable,Aggregate_4\nProject_3,testview2\nJoin_Inner_2,Project_3\nAggregate_4,testview1\ndefault.testtable,Filter_0\nFilter_1,Join_Inner_2\ntestview1,Filter_1\nFilter_0,Join_Inner_2\n```\n\nTo generate an adjacency list file of data lineage in Scala, you can specify `AdjacencyListSink`\nin `SQLFlow.saveAsSQLFlow` as follows:\n\n```\nscala\u003e import org.apache.spark.sql.flow.sink.AdjacencyListSink\nscala\u003e SQLFlow.saveAsSQLFlow(Map(\"outputDirPath\" -\u003e \"/tmp/sqlflow-output\"), graphSink=AdjacencyListSink(sep = \",\"))\n```\n\nSee a [resources/networkx_example.ipynb](resources/networkx_example.ipynb) example for how to load it into Python NetowrkX.\n\n### Neo4j Aura (Experimental feature)\n\nTo share data lineage with others, it is useful to export it into [Neo4j Aura](https://neo4j.com/cloud/aura), a fully-managed graph dtabase service:\n\n```\n\u003e\u003e\u003e from sqlflow import export_data_lineage_into\n\u003e\u003e\u003e neo4jaura_options = {'uri': 'neo4j+s://\u003cyour Neo4j database uri\u003e', 'user': '\u003cuser name\u003e', 'passwd': '\u003cpassword\u003e'}\n\u003e\u003e\u003e export_data_lineage_into(\"neo4jaura\", options=neo4jaura_options)\n```\n\nFor instance, the exported data lineage of `spark-data-repair-plugin` (the top-most example)\nis represented below on your Neo4j browser:\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"resources/spark-data-repair-plugin-neo4jaura.svg\" width=\"700px\"\u003e\u003c/p\u003e\n\nTo understand which views are referenced multiple times by plans, you can run a CYPHER query below for listing up the views.\nTherefore, the result views can be candidiates to be cached.\n\n```\n// Lists up the `View` nodes that are referenced multiple times by `Plan` nodes\nMATCH (n:LeafPlan)-[t:transformInto]-\u003e(:Plan)\nWITH n, sum(size(t.dstNodeIds)) as refs\nWHERE refs \u003e= 2\nRETURN n.name\n```\n\nAnother useful example is most-frequently query tracking; a `transformInto` relationship has the `refCnt` property\nthat is the number of references by queries, so you can select frequently-referenced `transformInto` paths by a query below:\n\n```\n// Selects the relationships whose reference count is more than 1\nMATCH p=(s)-[:transformInto*1..]-\u003e(e)\nWHERE (s:LeafPlan OR s:Table OR s:View) AND ALL(r IN relationships(p) WHERE size(r.dstNodeIds) \u003e= 2)\nMATCH (e)-[:transformInto]-\u003e(q:Query)\nRETURN p, q\n```\n\nOther useful CYPHER queries can be found in [resources/README.md](resources/README.md#list-of-other-useful-cypher-queries).\n\n### Writes Your Custom Graph Formatter\n\n`SQLFlow` extracts the column-level references of tables/views/plans as a sequence\nof `SQLFlowGraphNode`s and `SQLFlowGraphEdge`s internally.\nTherefore, you can take extracted references and transform them into string data following your custom format:\n\n```\nscala\u003e import org.apache.spark.sql.flow.{SQLFlow, SQLFlowGraphEdge, SQLFlowGraphNode}\nscala\u003e SQLFlow.printAsSQLFlow(contracted = true,\n     |   (nodes: Seq[SQLFlowGraphNode], edges: Seq[SQLFlowGraphEdge]) =\u003e {\n     |     s\"\"\"\n     |        |List of nodes:\n     |        |${nodes.map(n =\u003e s\" =\u003e $n\").mkString(\"\\n\")}\n     |        |\n     |        |List of edges:\n     |        |${edges.map(e =\u003e s\" =\u003e $e\").mkString(\"\\n\")}\n     |      \"\"\".stripMargin\n     |   })\n\nList of nodes:\n =\u003e name=`default.testtable`(`key`,`value`), type=table, cached=false\n =\u003e name=`testview2`(`key`,`value`,`s`), type=table, cached=false\n =\u003e name=`testview1`(`key`,`s`), type=table, cached=true\n\nList of edges:\n =\u003e from=`default.testtable`(idx=0), to=`testview2`(idx=0)\n =\u003e from=`default.testtable`(idx=1), to=`testview2`(idx=1)\n =\u003e from=`testview1`(idx=0), to=`testview2`(idx=0)\n =\u003e from=`testview1`(idx=1), to=`testview2`(idx=2)\n =\u003e from=`default.testtable`(idx=0), to=`testview1`(idx=0)\n =\u003e from=`default.testtable`(idx=1), to=`testview1`(idx=1)\n```\n\n## Audits queries to generate data lineage (Experimental feature)\n\n`SQLFlowListener` below incrementally appends the column-level references of a SQL/DataFrame query\ninto a specified sink every it succeeds:\n\n```\nscala\u003e import org.apache.spark.sql.flow.sink.Neo4jAuraSink\nscala\u003e val sink = Neo4jAuraSink(\"neo4j+s://\u003cyour Neo4j database uri\u003e\", \"\u003cuser name\u003e\", \"\u003cpassword\u003e\")\nscala\u003e spark.sqlContext.listenerManager.register(SQLFlowListener(sink))\n```\n\nTo load the listener when launching a Spark cluster, you can use Spark configurations:\n\n```\n./bin/spark-shell --conf spark.sql.queryExecutionListeners=org.apache.spark.sql.flow.Neo4jAuraSQLFlowListener \\\n  --conf spark.sql.flow.Neo4jAuraSink.uri=neo4j+s://\u003cyour Neo4j database uri\u003e \\\n  --conf spark.sql.flow.Neo4jAuraSink.user=\u003cuser name\u003e \\\n  --conf spark.sql.flow.Neo4jAuraSink.password=\u003cpassword\u003e\n```\n\n## Similar Technologies\n\n - Datafold, Column-level lineage, https://www.datafold.com/column-level-lineage\n - SQLFlow, Automated SQL data lineage analysis, https://www.gudusoft.com\n - Prophecy, Spark: Column Level Lineage, https://medium.com/prophecy-io/spark-column-level-lineage-478c1fe4701d\n - Collibra/SQLDep, Introducing Collibra Lineage - Automated Data Lineage, https://www.collibra.com/us/en/blog/introducing-collibra-lineage\n - Tokern, Automate data engineering tasks with column-level data lineage, https://tokern.io\n - SQLLineage, SQL Lineage Analysis Tool powered by Python, https://github.com/reata/sqllineage\n\n## References\n\n - Hui Miao et al., ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows, arXiv, arXiv:1610.04963, 2016.\n - Joseph M. Hellerstein et al, Ground: A Data Context Service, CIDR, 2017.\n - R. Castro Fernandez et al., Aurum: A Data Discovery System, Proceedings of ICDE, pp.1001-1012, 2018.\n - Alon Halevy et al., Goods: Organizing Google's Datasets. Proceedings of SIGMOD, pp.795–806, 2016.\n\n## TODO\n\n * Implements more graph exporters, e.g., for [Apache Atlas](https://atlas.apache.org/#/) and [DataHub](https://datahubproject.io/)/[Acryl Data](https://www.acryldata.io/) ([Issue#3](https://github.com/maropu/spark-sql-flow-plugin/issues/3))\n * Tracks data lineage between table/views via `INSERT` queries ([Issue#5](https://github.com/maropu/spark-sql-flow-plugin/issues/5))\n * Adds more CYPHER query examples in [resources/README.md](resources/README.md), e.g.,\n   * CYPHER query to group SQL/DataFrame queries by their similarities (alternative approach to store query logs is [spark-query-log-plugin](https://github.com/maropu/spark-query-log-plugin))\n * Supports global temp views\n## Bug Reports\n\nIf you hit some bugs and have requests, please leave some comments on [Issues](https://github.com/maropu/spark-sql-flow-plugin/issues)\nor Twitter ([@maropu](http://twitter.com/#!/maropu)).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaropu%2Fspark-sql-flow-plugin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaropu%2Fspark-sql-flow-plugin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaropu%2Fspark-sql-flow-plugin/lists"}