{"id":13941306,"url":"https://github.com/stonezhong/DataManager","last_synced_at":"2025-07-20T04:31:11.274Z","repository":{"id":44487617,"uuid":"305167041","full_name":"stonezhong/DataManager","owner":"stonezhong","description":"Better organize data in data lake and build ETL pipeline with Web UI tool.","archived":false,"fork":false,"pushed_at":"2021-04-26T11:32:40.000Z","size":2446,"stargazers_count":10,"open_issues_count":3,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-14T03:40:37.189Z","etag":null,"topics":["datalake","datawarehouse","etl","spark","sparksql"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stonezhong.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-18T18:15:51.000Z","updated_at":"2025-06-04T13:35:24.000Z","dependencies_parsed_at":"2022-09-16T14:28:11.332Z","dependency_job_id":null,"html_url":"https://github.com/stonezhong/DataManager","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/stonezhong/DataManager","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stonezhong%2FDataManager","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stonezhong%2FDataManager/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stonezhong%2FDataManager/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stonezhong%2FDataManager/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stonezhong","download_url":"https://codeload.github.com/stonezhong/DataManager/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stonezhong%2FDataManager/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266067257,"owners_count":23871324,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datalake","datawarehouse","etl","spark","sparksql"],"created_at":"2024-08-08T02:01:16.192Z","updated_at":"2025-07-20T04:31:08.659Z","avatar_url":"https://github.com/stonezhong.png","language":"JavaScript","funding_links":[],"categories":["JavaScript"],"sub_categories":[],"readme":"# Indexes\n* [Brief](#Brief)\n* [Overall Working Model](#overall-working-model)\n* [Supported Platforms](#supported-platforms)\n* [Features](#Features)\n    * [Cloud Provider Agnostic](#cloud-provider-agnostic)\n    * [Create Data Pipeline with UI](#create-data-pipeline-with-ui)\n    * [Schedule Pipelines with UI](#schedule-pipelines-with-ui)\n    * [Data Catalog](#data-catalog)\n    * [Data Lineage](#data-lineage)\n    * [Pipeline / Asset dependency](#pipeline-asset-dependency)\n    * [View Schema for Dataset](#view-schema-for-dataset)\n    * [Assets are immutable](#assets-are-immutable)\n    * [Decouple asset’s name from storage location](#decouple-assets-name-from-storage-location)\n\n# Brief\n\nData Manager is a one stop shop as Apache Spark based data lake management platform. It covers all the needs for data team that is using Apache Spark for batching process. However, data visualization and notebook (like JupyterHub) is not included in Data Manager. For more informations, please read [Data Manager wiki Pages](https://github.com/stonezhong/DataManager/wiki)\n\n# Overall Working Model\n![Pipeline Execution Diagram](https://raw.githubusercontent.com/stonezhong/DataManager/master/docs/images/pipeline_diagram.png)\n\n# Supported Platforms\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\n            \u003cimg\n                src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1200px-Apache_Spark_logo.svg.png\"\n                width=\"120px\"\n            /\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eYou setup your own Apache Spark Cluster.\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\n            \u003cimg src=\"https://miro.medium.com/max/700/1*qgkjkj6BLVS1uD4mw_sTEg.png\" width=\"120px\" /\u003e\n        \u003c/td\u003e\n        \u003ctd\u003e\n            Use \u003ca href=\"https://pypi.org/project/pyspark/\"\u003ePySpark\u003c/a\u003e package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.\n        \u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\n            \u003cimg src=\"https://databricks.com/wp-content/uploads/2019/02/databricks-generic-tile.png\" width=\"120px\"\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eYou host your spark cluster in \u003ca href=\"https://databricks.com/\"\u003edatabricks\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\n            \u003cimg\n                src=\"https://blog.ippon.tech/content/images/2019/06/emrlogogo.png\"\n                width=\"120px\"\n            /\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eYou host your spark cluster in \u003ca href=\"https://aws.amazon.com/emr/\"\u003eAmazon AWS EMR\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\n            \u003cimg\n                src=\"https://d15shllkswkct0.cloudfront.net/wp-content/blogs.dir/1/files/2020/07/100-768x402.jpeg\"\n                width=\"120px\"\n            /\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eYou host your spark cluster in \u003ca href=\"https://cloud.google.com/dataproc\"\u003eGoogle Cloud\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\n            \u003cimg\n                src=\"https://apifriends.com/wp-content/uploads/2018/05/HDInsightsDetails.png\"\n                width=\"120px\"\n            /\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eYou host your spark cluster in \u003ca href=\"https://azure.microsoft.com/en-us/services/hdinsight/\"\u003eMicrosoft Azure HDInsight\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\n            \u003cimg\n                src=\"https://cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/d3598759-8045-4b7f-9619-0fed901a9e0b/File/a35b11e3f02caf5d5080e48167cf320c/1_xtt86qweroeeldhjroaaaq.png\"\n                width=\"120px\"\n            /\u003e\n        \u003c/td\u003e\n        \u003ctd\u003e\n            You host your spark cluster in \u003ca href=\"https://www.oracle.com/big-data/data-flow/\"\u003eOracle Cloud Infrastructure, Data Flow Service\u003c/a\u003e\n        \u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\n            \u003cimg\n                src=\"https://upload.wikimedia.org/wikipedia/commons/2/24/IBM_Cloud_logo.png\"\n                width=\"120px\"\n            /\u003e\n        \u003c/td\u003e\n        \u003ctd\u003eYou host your spark cluster in \u003ca href=\"https://www.ibm.com/products/big-data-and-analytics\"\u003eIBM Cloud\u003c/a\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n# Features\n\n## Cloud Provider Agnostic\nData Manager support many Apache Spark providers, including AWS EMR, Microsoft Azure HDInsight, etc. For complete list, see [readme.md](https://github.com/stonezhong/DataManager).\n\n* Pipelines, Data Applications are sit on top of an abstraction layer that hides the cloud provider details. \n* Moving your Data Manager based data lake from one cloud provider to another just need configuration change, no code change is required.\n\nWe achieved this by introducing “job deployer” and “job submitter” via python package [spark-etl](https://github.com/stonezhong/spark_etl).\n* By using “spark_etl.vendors.oracle.DataflowDeployer” and “spark_etl.vendors.oracle.DataflowJobSubmitter”, your can run Data Manager in OCI DataFlow, here is an [example](https://github.com/stonezhong/spark_etl/blob/master/examples/config_oci.json).\n* By using “spark_etl.vendors.local.LocalDeployer” and “spark_etl.vendors.local.PySparkJobSubmitter”, you can run Data Manager with PySpack with your laptop, here is an [example](https://github.com/stonezhong/spark_etl/blob/master/examples/config_local.json)\n* By using \"spark_etl.deployers.HDFSDeployer\" and \"spark_etl.job_submitters.livy_job_submitter.LivyJobSubmitter\", you can run Data Manager with your on premise spark cluster, here is an [example](https://github.com/stonezhong/spark_etl/blob/master/examples/config_hdfs.json)\n\nEven you are using a public cloud for your production data lake, you can test your pipeline and data application using PySpark on your laptop with down scaled sample data with exactly the same code.\n\n## Create Data Pipeline with UI\nIt allows non-programmer to create pipeline via UI, the pipeline can have multiple “tasks”, each task is either executing a series of Spark-SQL statements or launching a Data Application.\n\n![alt text](http://data-manager-docs.s3-website-us-west-2.amazonaws.com/pipeline-ui.png \"Pipeline UI\")\n\n## Schedule Pipelines with UI\nUser can schedule the execution of pipeline with UI. It will execute all the pipelines related to the category of the scheduler, when these pipelines are invoked, they will use shared context defined in this scheduler.\n\n![alt text](http://data-manager-docs.s3-website-us-west-2.amazonaws.com/scheduler-ui.png \"Scheduler UI\")\n\n## Data Catalog\nIt provides a data catalog which tracks all datasets and assets. The data catalog has a REST API for client to query and register dataset and asset.\n\n## Data Lineage\n\nData Manager keeps track the data lineage info. When you view an asset, you will see:\n* The actual SQL statement that produces this asset if the asset is produced by Spark-SQL task\n* The application name and arguments if the asset is produced by a data application\n* Upstream assets, list of assets being used when producing this asset\n* Downstream assets, list of assets that require this asset when produced\n\nWith these information, you understands where data comes from, where data goes to and how it is produced.\n\n![alt text](http://data-manager-docs.s3-website-us-west-2.amazonaws.com/data-lineage-ui.png \"Data Lineage UI\")\n\n## Pipeline / Asset dependency\n\nYou can specify a list of assets as “require” for a pipeline. The scheduler will only invoke the pipeline when all the required assets shows up.\n\nNote, the required asset here in the example is \u003ccode\u003etradings:1.0:1:/{{dt}}\u003c/code\u003e, this is actually a jinja template. \"dt\" is part of the rendering context belongs to the scheduler.\n\n![alt text](http://data-manager-docs.s3-website-us-west-2.amazonaws.com/asset-dependency.png \"Pipeline / Asset Dependency\")\n\n## View Schema for Dataset\n\nIn the Dataset Browser UI, you can see the schema for any dataset. Here is an example:\n\n![alt text](http://data-manager-docs.s3-website-us-west-2.amazonaws.com/view-schema.png \"View Schema\")\n\n## Assets are immutable\nAssets are immutable, if you want to republish an asset (maybe the earlier asset has a data-bug), it will bump the revision. If you cache the asset or derived data from the asset, you can check the asset revision to decide whether you need to update your cache.\n\nEach asset, only the latest revision might be active, and the metadata tracks all the revisions.\n\nThis page also shows the revision history, so you can see if the asset has been corrected or not, and when.\n\n![alt text](http://data-manager-docs.s3-website-us-west-2.amazonaws.com/asset-revisions.png \"View Schema\")\n\n## Decouple asset’s name from storage location\n\nAs you can see, below is a task inside a pipeline, it imports an asset \u003ccode\u003etradings:1.0:1/{{dt}}\u003c/code\u003e and call it “tradings”, then run a SQL statement. User do not even need to know where is the storage location of the asset, Data Manager automatically resolve the asset name to storage location and load the asset from it.\n\nThis also make it easy to migrate data lake between cloud providers since your pipeline is not tied to the storage location of assets.\n\n![alt text](http://data-manager-docs.s3-website-us-west-2.amazonaws.com/sql-task-basic-info.png \"SQL Task - Basic Info\")\n![alt text](http://data-manager-docs.s3-website-us-west-2.amazonaws.com/sql-task-sql.png \"SQL Task - SQL Statement\")\n\nIf you go to the asset page, it shows the storage location for asset, for example:\n\n![alt text](http://data-manager-docs.s3-website-us-west-2.amazonaws.com/asset-list.png \"Asset List\")\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstonezhong%2FDataManager","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstonezhong%2FDataManager","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstonezhong%2FDataManager/lists"}