{"id":13621801,"url":"https://github.com/SETL-Framework/setl","last_synced_at":"2025-04-15T01:34:08.148Z","repository":{"id":37536255,"uuid":"229248161","full_name":"SETL-Framework/setl","owner":"SETL-Framework","description":"A simple Spark-powered ETL framework that just works 🍺","archived":false,"fork":false,"pushed_at":"2025-03-31T03:20:08.000Z","size":1410,"stargazers_count":181,"open_issues_count":5,"forks_count":32,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-04-10T05:07:46.503Z","etag":null,"topics":["big-data","data-analysis","data-engineering","data-science","data-transformation","dataset","etl","etl-pipeline","framework","machine-learning","modularization","pipeline","scala","setl","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SETL-Framework.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-20T10:56:49.000Z","updated_at":"2025-03-21T04:31:39.000Z","dependencies_parsed_at":"2023-02-17T01:25:16.399Z","dependency_job_id":"1bddd160-99a1-4357-a3a1-4c7d9e95fe44","html_url":"https://github.com/SETL-Framework/setl","commit_stats":null,"previous_names":["setl-developers/setl"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SETL-Framework%2Fsetl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SETL-Framework%2Fsetl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SETL-Framework%2Fsetl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SETL-Framework%2Fsetl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SETL-Framework","download_url":"https://codeload.github.com/SETL-Framework/setl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248989519,"owners_count":21194606,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","data-analysis","data-engineering","data-science","data-transformation","dataset","etl","etl-pipeline","framework","machine-learning","modularization","pipeline","scala","setl","spark"],"created_at":"2024-08-01T21:01:10.703Z","updated_at":"2025-04-15T01:34:08.141Z","avatar_url":"https://github.com/SETL-Framework.png","language":"Scala","funding_links":[],"categories":["Scala","大数据"],"sub_categories":[],"readme":"![logo](docs/img/logo_setl.png)\n----------\n\n[![build](https://github.com/SETL-Framework/setl/workflows/build/badge.svg?branch=master)](https://github.com/SETL-Framework/setl/actions)\n[![codecov](https://codecov.io/gh/SETL-Framework/setl/branch/master/graph/badge.svg)](https://codecov.io/gh/SETL-Framework/setl)\n[![Maven Central](https://img.shields.io/maven-central/v/io.github.setl-framework/setl_2.11.svg?label=Maven%20Central\u0026color=blue)](https://mvnrepository.com/artifact/io.github.setl-framework/setl)\n[![javadoc](https://javadoc.io/badge2/io.github.setl-framework/setl_2.12/javadoc.svg)](https://javadoc.io/doc/io.github.setl-framework/setl_2.12)\n[![documentation](https://img.shields.io/badge/docs-passing-1f425f.svg)](https://setl-framework.github.io/setl/)\n\nIf you’re a **data scientist** or **data engineer**, this might sound familiar while working on an **ETL** project: \n\n- Switching between multiple projects is a hassle \n- Debugging others’ code is a nightmare\n- Spending a lot of time solving non-business-related issues \n\n**SETL** (pronounced \"settle\") is a Scala ETL framework powered by [Apache Spark](https://spark.apache.org/) that helps you structure your Spark ETL projects, modularize your data transformation logic and speed up your development.\n\n## Use SETL\n\n### In a new project\n\nYou can start working by cloning [this template project](https://github.com/SETL-Framework/setl-template).\n\n### In an existing project\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003eio.github.setl-framework\u003c/groupId\u003e\n  \u003cartifactId\u003esetl_2.12\u003c/artifactId\u003e\n  \u003cversion\u003e1.0.0-RC2\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nTo use the SNAPSHOT version, add Sonatype snapshot repository to your `pom.xml`\n```xml\n\u003crepositories\u003e\n  \u003crepository\u003e\n    \u003cid\u003eossrh-snapshots\u003c/id\u003e\n    \u003curl\u003ehttps://s01.oss.sonatype.org/content/repositories/snapshots/\u003c/url\u003e\n  \u003c/repository\u003e\n\u003c/repositories\u003e\n\n\u003cdependencies\u003e\n  \u003cdependency\u003e\n    \u003cgroupId\u003eio.github.setl-framework\u003c/groupId\u003e\n    \u003cartifactId\u003esetl_2.12\u003c/artifactId\u003e\n    \u003cversion\u003e1.0.0-SNAPSHOT\u003c/version\u003e\n  \u003c/dependency\u003e\n\u003c/dependencies\u003e\n```\n\n## Quick Start\n\n### Basic concept\n\nWith SETL, an ETL application could be represented by a `Pipeline`. A `Pipeline` contains multiple `Stages`. In each stage, we could find one or several `Factories`.\n\nThe class `Factory[T]` is an abstraction of a data transformation that will produce an object of type `T`. It has 4 methods (*read*, *process*, *write* and *get*) that should be implemented by the developer.\n\nThe class `SparkRepository[T]` is a data access layer abstraction. It could be used to read/write a `Dataset[T]` from/to a datastore. It should be defined in a configuration file. You can have as many SparkRepositories as you want.\n\nThe entry point of a SETL project is the object `io.github.setl.Setl`, which will handle the pipeline and spark repository instantiation.\n\n### Show me some code\n\nYou can find the following tutorial code in [the starter template of SETL](https://github.com/SETL-Framework/setl-template). Go and clone it :)\n\nHere we show a simple example of creating and saving a **Dataset[TestObject]**. The case class **TestObject** is defined as follows:\n\n```scala\ncase class TestObject(partition1: Int, partition2: String, clustering1: String, value: Long)\n```\n\n#### Context initialization\n\nSuppose that we want to save our output into `src/main/resources/test_csv`. We can create a configuration file **local.conf** in `src/main/resources` with the following content that defines the target datastore to save our dataset:\n\n```txt\ntestObjectRepository {\n  storage = \"CSV\"\n  path = \"src/main/resources/test_csv\"\n  inferSchema = \"true\"\n  delimiter = \";\"\n  header = \"true\"\n  saveMode = \"Append\"\n}\n```\n\nIn our `App.scala` file, we build `Setl` and register this data store:\n```scala  \nval setl: Setl = Setl.builder()\n  .withDefaultConfigLoader()\n  .getOrCreate()\n\n// Register a SparkRepository to context\nsetl.setSparkRepository[TestObject](\"testObjectRepository\")\n\n```\n\n#### Implementation of Factory\n\nWe will create our `Dataset[TestObject]` inside a `Factory[Dataset[TestObject]]`. A `Factory[A]` will always produce an object of type `A`, and it contains 4 abstract methods that you need to implement:\n- read\n- process\n- write\n- get\n\n```scala\nclass MyFactory() extends Factory[Dataset[TestObject]] with HasSparkSession {\n  \n  import spark.implicits._\n    \n  // A repository is needed for writing data. It will be delivered by the pipeline\n  @Delivery \n  private[this] val repo = SparkRepository[TestObject]\n\n  private[this] var output = spark.emptyDataset[TestObject]\n\n  override def read(): MyFactory.this.type = {\n    // in our demo we don't need to read any data\n    this\n  }\n\n  override def process(): MyFactory.this.type = {\n    output = Seq(\n      TestObject(1, \"a\", \"A\", 1L),\n      TestObject(2, \"b\", \"B\", 2L)\n    ).toDS()\n    this\n  }\n\n  override def write(): MyFactory.this.type = {\n    repo.save(output)  // use the repository to save the output\n    this\n  }\n\n  override def get(): Dataset[TestObject] = output\n\n}\n```\n\n#### Define the pipeline\n\nTo execute the factory, we should add it into a pipeline.\n\nWhen we call `setl.newPipeline()`, **Setl** will instantiate a new **Pipeline** and configure all the registered repositories as inputs of the pipeline. Then we can call `addStage` to add our factory into the pipeline.\n\n```scala\nval pipeline = setl\n  .newPipeline()\n  .addStage[MyFactory]()\n```\n\n#### Run our pipeline\n\n```scala\npipeline.describe().run()\n```\nThe dataset will be saved into `src/main/resources/test_csv`\n\n#### What's more?\n\nAs our `MyFactory` produces a `Dataset[TestObject]`, it can be used by other factories of the same pipeline.\n\n```scala\nclass AnotherFactory extends Factory[String] with HasSparkSession {\n\n  import spark.implicits._\n\n  @Delivery\n  private[this] val outputOfMyFactory = spark.emptyDataset[TestObject]\n\n  override def read(): AnotherFactory.this.type = this\n\n  override def process(): AnotherFactory.this.type = this\n\n  override def write(): AnotherFactory.this.type = {\n    outputOfMyFactory.show()\n    this\n  }\n\n  override def get(): String = \"output\"\n}\n```\n\nAdd this factory into the pipeline:\n\n```scala\npipeline.addStage[AnotherFactory]()\n```\n\n### Custom Connector\n\nYou can implement you own data source connector by implementing the `ConnectorInterface`\n\n```scala\nclass CustomConnector extends ConnectorInterface with CanDrop {\n  override def setConf(conf: Conf): Unit = null\n\n  override def read(): DataFrame = {\n    import spark.implicits._\n    Seq(1, 2, 3).toDF(\"id\")\n  }\n\n  override def write(t: DataFrame, suffix: Option[String]): Unit = logDebug(\"Write with suffix\")\n\n  override def write(t: DataFrame): Unit = logDebug(\"Write\")\n\n  /**\n   * Drop the entire table.\n   */\n  override def drop(): Unit = logDebug(\"drop\")\n}\n```\n\nTo use it, just set the storage to **OTHER** and provide the class reference of your connector:\n\n```txt\nmyConnector {\n  storage = \"OTHER\"\n  class = \"com.example.CustomConnector\"  // class reference of your connector \n}\n```\n\n### Generate pipeline diagram\n\nYou can generate a [Mermaid diagram](https://mermaid-js.github.io/mermaid/#/) by doing:\n```scala\npipeline.showDiagram()\n```\n\nYou will have some log like this:\n```\n--------- MERMAID DIAGRAM ---------\nclassDiagram\nclass MyFactory {\n  \u003c\u003cFactory[Dataset[TestObject]]\u003e\u003e\n  +SparkRepository[TestObject]\n}\n\nclass DatasetTestObject {\n  \u003c\u003cDataset[TestObject]\u003e\u003e\n  \u003epartition1: Int\n  \u003epartition2: String\n  \u003eclustering1: String\n  \u003evalue: Long\n}\n\nDatasetTestObject \u003c|.. MyFactory : Output\nclass AnotherFactory {\n  \u003c\u003cFactory[String]\u003e\u003e\n  +Dataset[TestObject]\n}\n\nclass StringFinal {\n  \u003c\u003cString\u003e\u003e\n  \n}\n\nStringFinal \u003c|.. AnotherFactory : Output\nclass SparkRepositoryTestObjectExternal {\n  \u003c\u003cSparkRepository[TestObject]\u003e\u003e\n  \n}\n\nAnotherFactory \u003c|-- DatasetTestObject : Input\nMyFactory \u003c|-- SparkRepositoryTestObjectExternal : Input\n\n------- END OF MERMAID CODE -------\n\nYou can copy the previous code to a markdown viewer that supports Mermaid.\n\nOr you can try the live editor: https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiY2xhc3NEaWFncmFtXG5jbGFzcyBNeUZhY3Rvcnkge1xuICA8PEZhY3RvcnlbRGF0YXNldFtUZXN0T2JqZWN0XV0-PlxuICArU3BhcmtSZXBvc2l0b3J5W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIERhdGFzZXRUZXN0T2JqZWN0IHtcbiAgPDxEYXRhc2V0W1Rlc3RPYmplY3RdPj5cbiAgPnBhcnRpdGlvbjE6IEludFxuICA-cGFydGl0aW9uMjogU3RyaW5nXG4gID5jbHVzdGVyaW5nMTogU3RyaW5nXG4gID52YWx1ZTogTG9uZ1xufVxuXG5EYXRhc2V0VGVzdE9iamVjdCA8fC4uIE15RmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgQW5vdGhlckZhY3Rvcnkge1xuICA8PEZhY3RvcnlbU3RyaW5nXT4-XG4gICtEYXRhc2V0W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIFN0cmluZ0ZpbmFsIHtcbiAgPDxTdHJpbmc-PlxuICBcbn1cblxuU3RyaW5nRmluYWwgPHwuLiBBbm90aGVyRmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgU3BhcmtSZXBvc2l0b3J5VGVzdE9iamVjdEV4dGVybmFsIHtcbiAgPDxTcGFya1JlcG9zaXRvcnlbVGVzdE9iamVjdF0-PlxuICBcbn1cblxuQW5vdGhlckZhY3RvcnkgPHwtLSBEYXRhc2V0VGVzdE9iamVjdCA6IElucHV0XG5NeUZhY3RvcnkgPHwtLSBTcGFya1JlcG9zaXRvcnlUZXN0T2JqZWN0RXh0ZXJuYWwgOiBJbnB1dFxuIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifX0=\n\n```\n\nYou can either copy the code into a Markdown viewer or just copy the link into your browser ([link](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiY2xhc3NEaWFncmFtXG5jbGFzcyBNeUZhY3Rvcnkge1xuICA8PEZhY3RvcnlbRGF0YXNldFtUZXN0T2JqZWN0XV0-PlxuICArU3BhcmtSZXBvc2l0b3J5W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIERhdGFzZXRUZXN0T2JqZWN0IHtcbiAgPDxEYXRhc2V0W1Rlc3RPYmplY3RdPj5cbiAgPnBhcnRpdGlvbjE6IEludFxuICA-cGFydGl0aW9uMjogU3RyaW5nXG4gID5jbHVzdGVyaW5nMTogU3RyaW5nXG4gID52YWx1ZTogTG9uZ1xufVxuXG5EYXRhc2V0VGVzdE9iamVjdCA8fC4uIE15RmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgQW5vdGhlckZhY3Rvcnkge1xuICA8PEZhY3RvcnlbU3RyaW5nXT4-XG4gICtEYXRhc2V0W1Rlc3RPYmplY3RdXG59XG5cbmNsYXNzIFN0cmluZ0ZpbmFsIHtcbiAgPDxTdHJpbmc-PlxuICBcbn1cblxuU3RyaW5nRmluYWwgPHwuLiBBbm90aGVyRmFjdG9yeSA6IE91dHB1dFxuY2xhc3MgU3BhcmtSZXBvc2l0b3J5VGVzdE9iamVjdEV4dGVybmFsIHtcbiAgPDxTcGFya1JlcG9zaXRvcnlbVGVzdE9iamVjdF0-PlxuICBcbn1cblxuQW5vdGhlckZhY3RvcnkgPHwtLSBEYXRhc2V0VGVzdE9iamVjdCA6IElucHV0XG5NeUZhY3RvcnkgPHwtLSBTcGFya1JlcG9zaXRvcnlUZXN0T2JqZWN0RXh0ZXJuYWwgOiBJbnB1dFxuIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifX0=)) 🍻\n\n### App Configuration\n\nThe configuration system of SETL allows users to execute their Spark application in different execution environments, by\nusing environment-specific configurations.\n\nIn `src/main/resources` directory, you should have at least two configuration files named `application.conf`\nand `local.conf`\n(take a look at this [example](https://github.com/SETL-Framework/setl-template/tree/master/src/main/resources)). These\nare what you need if you only want to run your application in one single environment.\n\nYou can also create other configurations (for example `dev.conf` and `prod.conf`), in which environment-specific\nparameters can be defined.\n\n##### application.conf\n\nThis configuration file should contain universal configurations that could be used regardless the execution environment.\n\n##### env.conf (e.g. local.conf, dev.conf)\n\nThese files should contain environment-specific parameters. By default, `local.conf` will be used.\n\n##### How to use the configuration\n\nImagine the case we have two environments, a local development environment and a remote production environment. Our application\nneeds a repository for saving and loading data. In this use case, let's prepare `application.conf`, `local.conf`, `prod.conf`\nand `storage.conf`\n\n```hocon\n# application.conf\nsetl.environment = ${app.environment}\nsetl.config {\n  spark.app.name = \"my_application\"\n  # and other general spark configurations  \n}\n```\n\n```hocon\n# local.conf\ninclude \"application.conf\"\n\nsetl.config {\n  spark.default.parallelism = \"200\"\n  spark.sql.shuffle.partitions = \"200\"\n  # and other local spark configurations  \n}\n\napp.root.dir = \"/some/local/path\"\n\ninclude \"storage.conf\"\n```\n\n```hocon\n# prod.conf\nsetl.config {\n  spark.default.parallelism = \"1000\"\n  spark.sql.shuffle.partitions = \"1000\"\n  # and other production spark configurations  \n}\n\napp.root.dir = \"/some/remote/path\"\n\ninclude \"storage.conf\"\n```\n\n```hocon\n# storage.conf\nmyRepository {\n  storage = \"CSV\"\n  path = ${app.root.dir}  // this path will depend on the execution environment\n  inferSchema = \"true\"\n  delimiter = \";\"\n  header = \"true\"\n  saveMode = \"Append\"\n}\n```\n\nTo compile with local configuration, with maven, just run:\n```shell\nmvn compile\n```\n\nTo compile with production configuration, pass the jvm property `app.environment`.\n```shell\nmvn compile -Dapp.environment=prod\n```\n\nMake sure that your resources directory has filtering enabled:\n```xml\n\u003cresources\u003e\n    \u003cresource\u003e\n        \u003cdirectory\u003esrc/main/resources\u003c/directory\u003e\n        \u003cfiltering\u003etrue\u003c/filtering\u003e\n    \u003c/resource\u003e\n\u003c/resources\u003e\n```\n\n## Dependencies\n\n**SETL** currently supports the following data source. You won't need to provide these libraries in your project (except the JDBC driver):\n  - All file formats supported by Apache Spark (csv, json, parquet etc)\n  - Delta\n  - Excel ([crealytics/spark-excel](https://github.com/crealytics/spark-excel))\n  - Cassandra ([datastax/spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector))\n  - DynamoDB ([audienceproject/spark-dynamodb](https://github.com/audienceproject/spark-dynamodb))\n  - JDBC (you have to provide the jdbc driver)\n\nTo read/write data from/to AWS S3 (or other storage services), you should include the \ncorresponding hadoop library in your project.\n\nFor example\n```\n\u003cdependency\u003e\n    \u003cgroupId\u003eorg.apache.hadoop\u003c/groupId\u003e\n    \u003cartifactId\u003ehadoop-aws\u003c/artifactId\u003e\n    \u003cversion\u003e2.9.2\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nYou should also provide Scala and Spark in your pom file. SETL is tested against the following version of Spark: \n\n| Spark Version | Scala Version  | Note                         |\n| ------------- | -------------  | -----------------------------|\n|     3.0       |        2.12    | :heavy_check_mark: Ok        |\n|     2.4       |        2.12    | :heavy_check_mark: Ok        |\n|     2.4       |        2.11    | :warning: see *known issues* |\n|     2.3       |        2.11    | :warning: see *known issues* |\n\n## Known issues\n\n### Spark 2.4 with Scala 2.11\n\nWhen using `setl_2.11-1.x.x` with Spark 2.4 and Scala 2.11, you may need to include manually these following dependencies to override the default version:\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.audienceproject\u003c/groupId\u003e\n    \u003cartifactId\u003espark-dynamodb_2.11\u003c/artifactId\u003e\n    \u003cversion\u003e1.0.4\u003c/version\u003e\n\u003c/dependency\u003e\n\u003cdependency\u003e\n    \u003cgroupId\u003eio.delta\u003c/groupId\u003e\n    \u003cartifactId\u003edelta-core_2.11\u003c/artifactId\u003e\n    \u003cversion\u003e0.7.0\u003c/version\u003e\n\u003c/dependency\u003e\n\u003cdependency\u003e\n    \u003cgroupId\u003ecom.datastax.spark\u003c/groupId\u003e\n    \u003cartifactId\u003espark-cassandra-connector_2.11\u003c/artifactId\u003e\n    \u003cversion\u003e2.5.1\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Spark 2.3 with Scala 2.11\n\n- `DynamoDBConnector` doesn't work with Spark version 2.3\n- `Compress` annotation can only be used on Struct field or Array of Struct field with Spark 2.3\n\n## Test Coverage\n\n[![coverage.svg](https://codecov.io/gh/SETL-Framework/setl/branch/master/graphs/sunburst.svg)](https://codecov.io/gh/SETL-Framework/setl)\n\n## Documentation\n\n[https://setl-framework.github.io/setl/](https://setl-framework.github.io/setl/)\n\n## Contributing to SETL\n\n[Check our contributing guide](https://github.com/SETL-Framework/setl/blob/master/CONTRIBUTING.md)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSETL-Framework%2Fsetl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSETL-Framework%2Fsetl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSETL-Framework%2Fsetl/lists"}