{"id":19955783,"url":"https://github.com/setl-framework/setl-examples","last_synced_at":"2025-10-25T02:18:44.430Z","repository":{"id":37176350,"uuid":"313588303","full_name":"SETL-Framework/setl-examples","owner":"SETL-Framework","description":"Learn SETL with examples, lessons and exercises","archived":false,"fork":false,"pushed_at":"2023-04-21T20:43:52.000Z","size":327,"stargazers_count":8,"open_issues_count":3,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-07T20:56:15.957Z","etag":null,"topics":["etl","framework","scala","setl","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SETL-Framework.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-11-17T10:46:58.000Z","updated_at":"2025-02-15T14:55:22.000Z","dependencies_parsed_at":"2022-08-17T23:40:59.081Z","dependency_job_id":null,"html_url":"https://github.com/SETL-Framework/setl-examples","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SETL-Framework%2Fsetl-examples","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SETL-Framework%2Fsetl-examples/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SETL-Framework%2Fsetl-examples/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SETL-Framework%2Fsetl-examples/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SETL-Framework","download_url":"https://codeload.github.com/SETL-Framework/setl-examples/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252242195,"owners_count":21717117,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["etl","framework","scala","setl","spark"],"created_at":"2024-11-13T01:28:49.963Z","updated_at":"2025-10-25T02:18:39.381Z","avatar_url":"https://github.com/SETL-Framework.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SETL examples\n\nLessons and exercises to get familiar with the wonderful [SETL](https://github.com/SETL-Framework/setl) framework!\n\n# Chapters\n\n## 1. Entry Point and configurations\n\n\u003cdetails\u003e \u003csummary\u003e\u003cstrong\u003eLesson\u003c/strong\u003e\u003c/summary\u003e\n\n\u003ch3\u003e1.1. Entry point with basic configurations\u003c/h3\u003e\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nThe entry point is the first thing you need to learn to code with SETL. It is the starting point to run your ETL project.\n\n```\nval setl0: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n```\n\nThis is the minimum code needed to create a `Setl` object. It is the entry point of every SETL app. This will create a SparkSession, which is the entry point of any Spark job. Additionally, the `withDefaultConfigLoader()` method is used. This means that `Setl` will read the default ConfigLoader located in `resources/application.conf`, where `setl.environment` must be set. The ConfigLoader will then read the corresponding configuration file `\u003capp_env\u003e.conf` in the `resources` folder, where `\u003capp_env\u003e` is the value set for `setl.environment`.\n\n\u003e `resources/application.conf`:\n\u003e ```\n\u003e setl.environment = \u003capp.env\u003e\n\u003e ```\n\n\u003e `\u003capp.env\u003e.conf`:\n\u003e ```\n\u003e setl.config.spark {\n\u003e    some.config.option = \"some-value\"\n\u003e  }\n\u003e ```\n\nThe configuration file is where you can specify your `SparkSession` options, like when you create one in a basic `Spark` process. You must specify your `SparkSession` options under `setl.config.spark`.\n\n\u003c/details\u003e\n\n\u003ch3\u003e1.2. Entry point with specific configurations\u003c/h3\u003e\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nYou can specify the configuration file that the default `ConfigLoader` should read. In the code below, instead of reading `\u003capp_env\u003e.conf` where `\u003capp_env\u003e` is defined in `application.conf`, it will read `own_config_file.conf`.\n\u003e ```\n\u003e val setl1: Setl = Setl.builder()\n\u003e     .withDefaultConfigLoader(\"own_config_file.conf\")\n\u003e     .getOrCreate()\n\u003e ```\n\u003e \n\u003e `resources/own_config_file.conf`:\n\u003e ```\n\u003e setl.config.spark {\n\u003e    some.config.option = \"some-other-value\"\n\u003e  }\n\u003e ```\n\nYou can also set your own `ConfigLoader`. In the code below, `Setl` will load `local.conf` from the `setAppEnv()` method. If no `\u003capp_env\u003e` is set, it will fetch the environment from the default `ConfigLoader`, located in `resources/application.conf`.\n\u003e ```\n\u003e val configLoader: ConfigLoader = ConfigLoader.builder()\n\u003e     .setAppEnv(\"local\")\n\u003e     .setAppName(\"Setl2_AppName\")\n\u003e     .setProperty(\"setl.config.spark.master\", \"local[*]\")\n\u003e     .setProperty(\"setl.config.spark.custom-key\", \"custom-value\")\n\u003e     .getOrCreate()\n\u003e val setl2: Setl = Setl.builder()\n\u003e     .setConfigLoader(configLoader)\n\u003e     .getOrCreate()\n\u003e ```\n \nYou can also set your own `SparkSession` which will be used by `Setl`, with the `setSparkSession()` method. Please refer to the [documentation](https://setl-framework.github.io/setl/) or the source code of [SETL](https://github.com/SETL-Framework/setl).\n\n\u003c/details\u003e\n\n\u003ch3\u003e1.3 Utilities\u003c/h3\u003e\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\n\u003ch5\u003eHelper methods\u003c/h5\u003e\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nThere are some quick methods that can be used to set your `SparkSession` configurations.\n\u003e ```\n\u003e val setl3: Setl = Setl.builder()\n\u003e     .withDefaultConfigLoader()\n\u003e     .setSparkMaster(\"local[*]\") // set your master URL\n\u003e     .setShufflePartitions(200) // spark setShufflePartitions\n\u003e     .getOrCreate()\n\u003e ```\n \n* `setSparkMaster()` method set the `spark.master` property of the `SparkSession` in your `Setl` entry point\n* `setShufflePartitions()` method set the `spark.sql.shuffle.partitions` property of the `SparkSession` in your `Setl` entry point\n\n\u003c/details\u003e\n\n\u003ch5\u003eSparkSession options\u003c/h5\u003e\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nAs mentioned earlier, the options you want to define in your `SparkSession` must be specified under `setl.config.spark` in your configuration file. However, you can change this path by using the `setlSetlConfigPath()` method:\n\u003e ```\n\u003e val setl4: Setl = Setl.builder()\n\u003e     .withDefaultConfigLoader(\"own_config_file.conf\")\n\u003e     .setSetlConfigPath(\"myApp\")\n\u003e     .getOrCreate()\n\u003e ```\n\u003e \n\u003e `resources/own_config_file.conf`:\n\u003e ```\n\u003e myApp.spark {\n\u003e     some.config.option = \"my-app-some-other-value\"\n\u003e }\n\u003e ```\n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n##\n\n\u003cdetails\u003e \u003csummary\u003e\u003cstrong\u003eExercises\u003c/strong\u003e\u003c/summary\u003e\n\nNothing too crazy: try to build your own `Setl` object! Run your code and examine the logs to check about the options you specified. Make sure it loads the correct configuration file.\n\n\u003c/details\u003e\n\n## 2. Extract\n\n\u003cdetails\u003e \u003csummary\u003e\u003cstrong\u003eLesson\u003c/strong\u003e\u003c/summary\u003e\n\nSETL supports two types of data accessors: Connector and SparkRepository.\n* A Connector is a non-typed abstraction of data access layer (DAL). For simplicity, you can understand it to as a Spark DataFrame.\n* A SparkRepository is a typed abstraction data access layer (DAL). For simplicity, you can understand it as a Spark Dataset.\nFor more information, please refer to the [official documentation](https://setl-framework.github.io/setl/).\n\n`SETL` supports multiple data format, such as CSV, JSON, Parquet, Excel, Cassandra, DynamoDB, JDBC or Delta.\n\nTo ingest data in the `Setl` object entry point, you first must register the data, using the `setConnector()` or the `setSparkRepository[T]` methods.\n\n### 2.1 Registration with `Connector`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\n```\nval setl: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\nsetl\n    .setConnector(\"testObjectRepository\", deliveryId = \"id\")\n```\n\nThe first argument provided is a `String` that refers to an item in the specified configuration file. The second argument, `deliveryId`, must be specified for data ingestion. We will see in section **2.3** why it is necessary. Just think of it as an ID, and the only way for `SETL` to ingest a `Connector` is with its ID.\n\nNote that `deliveryId` is not necessary for the registration but it is for the ingestion. However there is no much use if we only register the data. If you are a beginner in `SETL`, you should think as setting a `Connector` must always come with a `deliveryId`.\n\n`local.conf`:\n```\nsetl.config.spark {\n  some.config.option = \"some-value\"\n}\n\ntestObjectRepository {\n  storage = \"CSV\"\n  path = \"src/main/resources/test_objects.csv\"\n  inferSchema = \"true\"\n  delimiter = \",\"\n  header = \"true\"\n  saveMode = \"Overwrite\"\n}\n```\n\nAs you can see, `testObjectRepository` defines a configuration for data of type `CSV`. This data is in a file, located in `src/main/resources/test_objects.csv`. Other classic read or write options are configured.\n\nIn summary, to register a `Connector`, you need to:\n1. Specify an item in your configuration file. This item must have a `storage` key, which represents the type of the data. Other keys might be mandatory depending on this type.\n2. Register the data in your `Setl` object, using `setConnector(\"\u003citem\u003e\", deliveryId = \"\u003cid\u003e\")`.\n\n\u003c/details\u003e\n\n### 2.2 Registration with `SparkRepository`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\n```\nval setl: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\nsetl\n    .setSparkRepository[TestObject](\"testObjectRepository\")\n```\n\nLike `setConnector()`, the argument provided is a `String` that refers to an item in the specified configuration file.\n\n`local.conf`:\n```\nsetl.config.spark {\n  some.config.option = \"some-value\"\n}\n\ntestObjectRepository {\n  storage = \"CSV\"\n  path = \"src/main/resources/test_objects.csv\"\n  inferSchema = \"true\"\n  delimiter = \",\"\n  header = \"true\"\n  saveMode = \"Overwrite\"\n}\n```\n\nNotice that the above `SparkRepository` is set with the `TestObject` type. In this example, the data we want to register is a CSV file containing two columns: `value1` of type `String` and `value2` of type `Int`. That is why the `TestObject` class should be:\n```\ncase class TestObject(value1: String,\n                      value2: Int)\n```\n\nIn summary, to register a `SparkRepository`, you need to:\n1. Specify an item in your configuration file. This item must have a `storage` key, which represents the type of the data. Other keys might be mandatory depending on this type.\n2. Create a class or a case class representing the object type of your data.\n3. Register the data in your `Setl` object, using `setSparkRepository[T](\"\u003citem\u003e\")`.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n    \n1. `Connector` or `SparkRepository`?\n\n    Sometimes, the data your are ingesting contain irrelevant information that you do not want to keep. For example, let's say that the CSV file you want to ingest contain 10 columns: `value1`, `value2`, `value3` and 7 other columns you are not interested in.\n    \n    It is possible to ingest these 3 columns only with a `SparkRepository` if you specify the correct object type of your data:\n    ```\n    case class A(value1: T1,\n                 value2: T2,\n                 value3: T3)\n    \n    setl\n        .setSparkRepository[A](\"itemInConfFile\")\n    ```\n\n    This is not possible with a `Connector`. If you register this CSV file with a `Connector`, all 10 columns will appear.\n\n2. Annotations\n\n* `@ColumnName`\n\n    `@ColumnName` is an annotation used in a case class. When you want to rename some columns in your code for integrity but also keep the original name when writing the data, you can use this annotation.\n    \n    ```\n    case class A(@ColumnName(\"value_one\") valueOne: T1,\n                 @ColumnName(\"value_two\") valueTwo: T2)\n    ```\n  \n  As you probably know, Scala does not use `snake_case` but `camelCase`. If you register a `SparkRepository` of type `[A]` in your `Setl` object, and if you read it, the columns will be named as `valueOne` and `valueTwo`. The file you read will still keep their name, i.e `value_one` and `value_two`.\n\n* `@CompoundKey`\n\n    TODO\n\n* `@Compress`\n\n    TODO\n\n\u003c/details\u003e\n\n### 2.3 Registration of multiple data sources\n\nMost of the time, you will need to register multiple data sources.\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\n#### 2.3.1 Multiple `Connector`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nLet's start with `Connector`. Note that it is perfectly possible to register multiple `Connector`, as said previously. However, there will be an issue during the ingestion. `Setl` has no way to differentiate one `Connector` from another. You will need to set what is called a `deliveryId`.\n\n```\nval setl1: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n \n// /!\\ This will work for data registration here but not for data ingestion later /!\\\nsetl1\n    .setConnector(\"testObjectRepository\")\n    .setConnector(\"pokeGradesRepository\")\n \n// Please get used to set a `deliveryId` when you register one or multiple `Connector`\nsetl1\n    .setConnector(\"testObjectRepository\", deliveryId = \"testObject\")\n    .setConnector(\"pokeGradesRepository\", deliveryId = \"grades\")\n```\n\n\u003c/details\u003e\n\n#### 2.3.2 Multiple `SparkRepository`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nLet's now look at how we can register multiple `SparRepository`. If the `SparkRepository` you register all have different type, there will be no issue during the ingestion. Indeed, `Setl` is capable of differentiating the upcoming data by inferring the object type.\n\n```\nval setl2: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\nsetl2\n    .setSparkRepository[TestObject](\"testObjectRepository\")\n    .setSparkRepository[Grade](\"pokeGradesRepository\")\n```\n\nHowever, if there are multiple `SparkRepository` with the same type, you **must** use a `deliveryId` for each of them. Otherwise, there will be an error during the data ingestion. This is the same reasoning as multiple `Connector`: there is no way to differentiate two `SparkRepository` of the same type.\n\n```\nval setl3: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\n// /!\\ This will work for data registration here but not for data ingestion later /!\\\nsetl3\n    .setSparkRepository[Grade](\"pokeGradesRepository\")\n    .setSparkRepository[Grade](\"digiGradesRepository\")\n\n// Please get used to set a `deliveryId` when you register multiple `SparkRepository` of same type\nsetl3\n    .setSparkRepository[Grade](\"pokeGradesRepository\", deliveryId = \"pokeGrades\")\n    .setSparkRepository[Grade](\"digiGradesRepository\", deliveryId = \"digiGrades\")\n```\n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n### 2.4 Data Ingestion\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nBefore deep diving into data ingestion, we first must learn about how `SETL` organizes an ETL process. `SETL` uses `Pipeline` and `Stage` to organize workflows. A `Pipeline` is where the whole ETL process will be done. The registered data are ingested inside a `Pipeline`, and all transformations and restitution will be done inside it. A `Pipeline` is composed of multiple `Stage`. A `Stage` allows you to modularize your project. It can be constituted of multiple `Factory`. You can understand a `Factory` as a module of your ETL process. So in order to \"see\" the data ingestion, we have to create a `Pipeline` and add a `Stage` to it. As it may be a little bit theoretical, let's look at some examples.\n\n`App.scala`:\n```\nval setl4: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\nsetl4\n    .setConnector(\"testObjectRepository\", deliveryId = \"testObjectConnector\")\n    .setSparkRepository[TestObject](\"testObjectRepository\", deliveryId = \"testObjectRepository\")\n\nsetl4\n    .newPipeline() // Creation of a `Pipeline`.\n    .addStage[IngestionFactory]() // Add a `Stage` composed of one `Factory`: `IngestionFactory`.\n    .run()\n```\n\nBefore running the code, let's take a look at `IngestionFactory`.\n\n```\nclass IngestionFactory extends Factory[DataFrame] with HasSparkSession {\n\n    import spark.implicits._\n\n    override def read(): IngestionFactory.this.type = this\n\n    override def process(): IngestionFactory.this.type = this\n\n    override def write(): IngestionFactory.this.type = this\n\n    override def get(): DataFrame = spark.emptyDataFrame\n}\n```\n\nThis is a skeleton of a `SETL Factory`. A `SETL Factory` contains 4 main functions: `read()`, `process()`, `write()` and `get()`. These functions will be executed in this order. These 4 functions are the core of your ETL process. This is where you will write your classic `Spark` code of data transformation.\n\nYou can see that `IngestionFactory` is a child class of `Factory[DataFrame]`. This simply means that the output of this data transformation must be a `DataFrame`. `IngestionFactory` also has the trait `HasSparkSession`. It allows you to access the `SparkSession` easily. Usually, we use it simply to import `spark.implicits`.\n\nWhere is the ingestion? \n\n```\nclass IngestionFactory extends Factory[DataFrame] with HasSparkSession {\n\n    import spark.implicits._\n\n    @Delivery(id = \"testObjectConnector\")\n    val testObjectConnector: Connector = Connector.empty\n    @Delivery(id = \"testObjectRepository\")\n    val testObjectRepository: SparkRepository[TestObject] = SparkRepository[TestObject]\n    \n    var testObjectOne: DataFrame = spark.emptyDataFrame\n    var testObjectTwo: Dataset[TestObject] = spark.emptyDataset[TestObject]\n\n    override def read(): IngestionFactory.this.type = this\n\n    override def process(): IngestionFactory.this.type = this\n\n    override def write(): IngestionFactory.this.type = this\n\n    override def get(): DataFrame = spark.emptyDataFrame\n}\n```\n\nThe structure of a `SETL Factory` starts with the `@Delivery` annotation. This annotation is the way `SETL` ingest the corresponding registered data. If you look at `App.scala` where this `IngestionFactory` is called, the associated `Setl` object has registered a `Connector` with id `testObjectConnector` and a `SparkRepository` with id `testObjectRepository`.\n\n\u003e Note that it is not mandatory to use a `deliveryId` in this case, because there is only one `Factory` with `TestObject` as object type. You can try to remove the `deliveryId` when registering the `SparkRepository` and the `id` in the `@Delivery` annotation. The code will still run. Same can be said for the `Connector`.\n\nWith the `@Delivery` annotation, we retrieved a `Connector` and `SparkRepository`. The data has been correctly ingested, but these are data access layers. To process the data, we have to retrieve the `DataFrame` of the `Connector` and the `Dataset` of the `SparkRepository`. This is why we defined two `var`, one of type `DataFrame` and one of type `Dataset[TestObject]`. We will assign values to them during the `read()` function. These `var` are accessible from all the 4 core functions, and you will use them for your ETL process.\n\nTo retrieve the `DataFrame` of the `Connector` and the `Dataset` of the `SparkRepository`, we can use the `read()` function.\n\n```\noverride def read(): IngestionFactory.this.type = {\n    testObjectOne = testObjectConnector.read()\n    testObjectTwo = testObjectRepository.findAll()\n\n    this\n}\n```\n\nThe `read()` function is typically where you will do your data preprocessing. Usually, we will simply assign values to our variables. Occasionally, this is typically where you would want to do some filtering on your data.\n\n* To retrieve the `DataFrame` of a `Connector`, use the `read()` method.\n* To retrieve the `Dataset` of a `SparkRepository`, you can use the `findAll()` method, or the `findBy()` method. The latter allows you to do filtering based on `Condition`. More info [here](https://setl-framework.github.io/setl/Condition).\n\nThe registered data is then correctly ingested. It is now ready to be used during the `process()` function.\n\n\u003c/details\u003e\n\n### 2.5 Additional resources\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\n#### 2.5.1 `AutoLoad`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nIn the previous `IngestionFactory`, we would set a `val` of type `SparkRepository` but also a `var` in which we assign the corresponding `Dataset` in the `read()` function. With `autoLoad = true`, we can skip the first step and directly declare a `Dataset`. The `Dataset` of the `SparkRepository` will be automatically assigned in it.\n\n`App.scala`:\n```\nval setl5: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\nsetl5\n    .setSparkRepository[TestObject](\"testObjectRepository\", deliveryId = \"testObjectRepository\")\n\nsetl5\n    .newPipeline()\n    .addStage[AutoLoadIngestionFactory]()\n    .run()\n```\n\n`AutoLoadIngestionFactory`\n```\nclass AutoLoadIngestionFactory extends Factory[DataFrame] with HasSparkSession {\n\n  import spark.implicits._\n\n  @Delivery(id = \"testObjectRepository\", autoLoad = true)\n  val testObject: Dataset[TestObject] = spark.emptyDataset[TestObject]\n\n  override def read(): AutoLoadIngestionFactory.this.type = {\n    testObject.show(false)\n\n    this\n  }\n\n  override def process(): AutoLoadIngestionFactory.this.type = this\n\n  override def write(): AutoLoadIngestionFactory.this.type = this\n\n  override def get(): DataFrame = spark.emptyDataFrame\n}\n```\n\nNote that there is no way to use the `findBy()` method to filter the data, compared to the previous `Factory`. Also, `autoLoad` is available for `SparkRepository` only, and not for `Connector`.\n\n\u003c/details\u003e\n\n#### 2.5.2 Adding parameters to the `Pipeline`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nIf you want to set some primary type parameters, you can use the `setInput[T]()` method. Those *inputs* are directly set in the `Pipeline`, and there are no registrations like for `Connector` or `SparkRepository`.\n\n`App.scala`:\n```\nval setl5: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\nsetl5\n    .newPipeline()\n    .setInput[Int](42)\n    .setInput[String](\"SETL\", deliveryId = \"ordered\")\n    .setInput[String](\"LTES\", deliveryId = \"reversed\")\n    .setInput[Array[String]](Array(\"S\", \"E\", \"T\", \"L\"))\n    .addStage[AutoLoadIngestionFactory]()\n    .run()\n```\n\n*Inputs* are retrieved in the same way `Connector` or `SparkRepository` are retrieved: the `@Delivery` annotation, and the `deliveryId` if necessary.\n\n`AutoLoadIngestionFactory.scala`:\n```\nclass AutoLoadIngestionFactory extends Factory[DataFrame] with HasSparkSession {\n\n    import spark.implicits._\n\n    @Delivery\n    val integer: Int = 0\n    @Delivery(id = \"ordered\")\n    val firstString: String = \"\"\n    @Delivery(id = \"reversed\")\n    val secondString: String = \"\"\n    @Delivery\n    val stringArray: Array[String] = Array()\n\n    override def read(): AutoLoadIngestionFactory.this.type = {\n      // Showing that inputs work correctly\n      println(\"integer: \" + integer) // integer: 42\n      println(\"ordered: \" + firstString) // ordered: SETL\n      println(\"reversed: \" + secondString) // reversed: LTES\n      println(\"array: \" + stringArray.mkString(\".\")) // array: S.E.T.L\n\n      this\n    }\n\n    override def process(): AutoLoadIngestionFactory.this.type = this\n\n    override def write(): AutoLoadIngestionFactory.this.type = this\n\n    override def get(): DataFrame = spark.emptyDataFrame\n}\n```\n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n### 2.6 Summary\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nIn summary, the *extraction* part of an ETL process translates to the following in a `SETL` project:\n1. Create a configuration item representing the data you want to ingest in your configuration file.\n2. Register the data in your `Setl` object by using the `setConnector()` or the `setSparkRepository[]()` method. Reminder: the mandatory parameter is the name of your object in your configuration file, and you might want to add a `deliveryId`.\n3. Create a new `Pipeline` in your `Setl` object, then add a `Stage` with a `Factory` in which you want to process your data.\n4. Create a `SETL Factory`, containing the 4 core functions: `read()`, `process()`, `write()` and `get()`.\n5. Retrieve your data using the `@Delivery` annotation.\n6. Your data is ready to be processed. \n\n\u003c/details\u003e\n\n### 2.7 Data format configuration cheat sheet\n\nCheat sheet can be found [here](https://setl-framework.github.io/setl/data_access_layer/configuration_example).\n\n\u003c/details\u003e\n\n##\n\n\u003cdetails\u003e \u003csummary\u003e\u003cstrong\u003eExercises\u003c/strong\u003e\u003c/summary\u003e\n\nIn these exercises, we are going to practice registering and ingesting different types of storage: a CSV file, a JSON file, a Parquet file, an Excel file, a table from DynamoDB, a table from Cassandra, a table from PostgreSQL and a table from Delta.\n\nAn `App.scala` is already prepared. We created a `SETL` entry point and use a configuration file located at `src/main/resources/exercise/extract/extract.conf`. In this file, a configuration object has been created for each storage type, but they are incomplete.\n\nThe goal here is to complete the configuration objects with the help of the [documentation](https://setl-framework.github.io/setl/data_access_layer/configuration_example), register the data as `Connector` in a `Pipeline` and print them in a `Factory` after ingestion.\n\nTo verify registration and ingestion, we prepared `CheckExtractFactory`. To test your code, complete the `???` parts and uncomment the corresponding lines. \n\n\u003cdetails\u003e \u003csummary\u003ea) Ingesting a CSV file\u003c/summary\u003e\n\nThe goal of this first exercise is to register and ingest a CSV file.\n\nWe are looking to read the CSV file located at `src/main/resources/exercise/extract/paris-wi-fi-service.csv`.\n1. Complete the configuration object `csvFile` in `src/main/resources/exercise/extract/extract.conf`.\n2. In `App.scala`, register a `Connector` with this data.\n3. You may create your own `Factory` and implement the `read()` function to verify if you can ingest the data. If you are not sure how yet, we already added a `Factory` to a `Stage`, which is added to the `Pipeline`. This `CheckExtractFactory` will ingest a `Connector` named `csvFileConnector`. It will read it into `csvFile` in the `read()` method, and verify the number of lines in this data. Uncomment the corresponding lines and complete the part on the `@Delivery` annotation before running your code to test your implementation of CSV file registration and ingestion.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e \u003csummary\u003eb) Ingesting a JSON file\u003c/summary\u003e\n\nThe goal of this second exercise is to register and ingest a JSON file.\n\nWe are looking to read the JSON file located at `src/main/resources/exercise/extract/paris-notable-trees.json`.\n1. Complete the configuration object `jsonFile` in `src/main/resources/exercise/extract/extract.conf`.\n2. In `App.scala`, register a `Connector` with this data.\n3. You may create your own `Factory` and implement the `read()` function to verify if you can ingest the data. If you are not sure how to do that yet, we already added a `Factory` to a `Stage`, which is added to the `Pipeline`. This `CheckExtractFactory` will ingest a `Connector` named `jsonFileConnector`. It will read it into `parquetFile` in the `read()` method, and verify the number of lines in this data. Uncomment the corresponding lines and complete the part on the `@Delivery` annotation before running your code to test your implementation of JSON file registration and ingestion.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e \u003csummary\u003ec) Ingesting a Parquet file\u003c/summary\u003e\n\nThe goal of this third exercise is to register and ingest a Parquet file.\n\nWe are looking to read the Parquet file located at `src/main/resources/exercise/extract/paris-public-toilet.parquet`.\n1. Complete the configuration object `parquetFile` in `src/main/resources/exercise/extract/extract.conf`.\n2. In `App.scala`, register a `Connector` with this data.\n3. You may create your own `Factory` and implement the `read()` function to verify if you can ingest the data. If you are not sure how to do that yet, we already added a `Factory` to a `Stage`, which is added to the `Pipeline`. This `CheckExtractFactory` will ingest a `Connector` named `parquetFileConnector`. It will read it into `parquetFile` in the `read()` method, and verify the number of lines in this data. Uncomment the corresponding lines and complete the part on the `@Delivery` annotation before running your code to test your implementation of Parquet file registration and ingestion.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e \u003csummary\u003ed) Ingesting an Excel file\u003c/summary\u003e\n\nThe goal of this fourth exercise is to register and ingest an Excel file.\n\nWe are looking to read the Excel file located at `src/main/resources/exercise/extract/paris-textile-containers.xlsx`.\n1. Complete the configuration object `excelFile` in `src/main/resources/exercise/extract/extract.conf`.\n2. In `App.scala`, register a `Connector` with this data.\n3. You may create your own `Factory` and implement the `read()` function to verify if you can ingest the data. If you are not sure how to do that yet, we already added a `Factory` to a `Stage`, which is added to the `Pipeline`. This `CheckExtractFactory` will ingest a `Connector` named `excelFileConnector`. It will read it into `excelFile` in the `read()` method, and verify the number of lines in this data. Uncomment the corresponding lines and complete the part on the `@Delivery` annotation before running your code to test your implementation of Excel file registration and ingestion.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e \u003csummary\u003ee) Ingesting data from DynamoDB\u003c/summary\u003e\n\nThe goal of this fifth exercise is to register and ingest data from a table in DynamoDB.\n\nTo work on this exercise, we need to host a local [DynamoDB](https://aws.amazon.com/fr/dynamodb/) server. To do that, we prepared a `docker-compose.yml` in the `exercise-environment/` folder. Make sure you have [Docker](https://www.docker.com/) installed. In a terminal, change your directory to `exercise-environment/` and execute `docker-compose up`. It will create a local DynamoDB server at `http://localhost:8000`. It will also create a table `orders_table` in the `us-east-1` region, and populate it with some data.\n\nMake sure you launch the Docker containers before starting this exercise.\n\nWe are looking to read the `orders_table` table from DynamoDB, located at the `us-east-1` region.\n1. Complete the configuration object `dynamoDBData` in `src/main/resources/exercise/extract/extract.conf`.\n2. In `App.scala`, register a `Connector` with this data. We already set the endpoint to be `http://localhost:8000` so that the requests are pointing to your local DynamoDB instance.\n3. You may create your own `Factory` and implement the `read()` function to verify if you can ingest the data. If you are not sure how to do that yet, we already added a `Factory` to a `Stage`, which is added to the `Pipeline`. This `CheckExtractFactory` will ingest a `Connector` named `dynamoDBDataConnector`. It will read it into `dynamoDBData` in the `read()` method, and verify the number of lines in this data. Uncomment the corresponding lines and complete the part on the `@Delivery` annotation before running your code to test your implementation of DynamoDB data registration and ingestion.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e \u003csummary\u003ef) Ingesting data from Cassandra\u003c/summary\u003e\n\nThe goal of this sixth exercise is to register and ingest data from a table in Cassandra.\n\nTo work on this exercise, we need to host a local [Cassandra](https://cassandra.apache.org/) server. To do that, we prepared a `docker-compose.yml` in the `exercise-environment/` folder. Make sure you have [Docker](https://www.docker.com/) installed. In a terminal, change your directory to `exercise-environment/` and execute `docker-compose up`. It will create a local Cassandra server at `http://localhost:9042`. It will also create a keyspace `mykeyspace` and a table `profiles`, and populate it with some data.\n\nMake sure you launch the Docker containers before starting this exercise.\n\nWe are looking to read the `profiles` table from Cassandra, located at the `mykeyspace` keyspace.\n1. Complete the configuration object `cassandraDBData` in `src/main/resources/exercise/extract/extract.conf`.\n2. In `App.scala`, register a `Connector` with this data. We already set the endpoint to be `http://localhost:9042` so that the requests are pointing to your local Cassandra instance.\n3. You may create your own `Factory` and implement the `read()` function to verify if you can ingest the data. If you are not sure how to do that yet, we already added a `Factory` to a `Stage`, which is added to the `Pipeline`. This `CheckExtractFactory` will ingest a `Connector` named `cassandraDataConnector`. It will read it into `cassandraData` in the `read()` method, and verify the number of lines in this data. Uncomment the corresponding lines and complete the part on the `@Delivery` annotation before running your code to test your implementation of Cassandra data registration and ingestion.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e \u003csummary\u003eg) Ingesting data from PostgreSQL\u003c/summary\u003e\n\nThe goal of this seventh exercise is to register and ingest data from a table in PostgreSQL.\n\nTo work on this exercise, we need to host a local [PostgreSQL](https://www.postgresql.org/) server. To do that, we prepared a `docker-compose.yml` in the `exercise-environment/` folder. Make sure you have [Docker](https://www.docker.com/) installed. In a terminal, change your directory to `exercise-environment/` and execute `docker-compose up`. It will create a local PostgreSQL server at `http://localhost:5432`. It will also create a database `postgres` and a table `products`, and populate it with some data.\n\nYou will also need the PostgreSQL JDBC driver. As specified in the documentation, you must provide a JDBC driver when using JDBC storage type. o provide the PostgreSQL JDBC driver, head to https://jdbc.postgresql.org/download.html, download the driver, and make the JDBC library jar available to the project. If you are using IntelliJ IDEA, right click on the jar and click on `Add as Library`.\n\nMake sure you launch the Docker containers before starting this exercise.\n\nWe are looking to read the `products` table from PostgreSQL, located at the `postgres` database.\n1. Complete the configuration object `jdbcDBData` in `src/main/resources/exercise/extract/extract.conf`.\n2. In `App.scala`, register a `Connector` with this data. Remember that the endpoint should be `http://localhost:5432`.\n3. You may create your own `Factory` and implement the `read()` function to verify if you can ingest the data. If you are not sure how to do that yet, we already added a `Factory` to a `Stage`, which is added to the `Pipeline`. This `CheckExtractFactory` will ingest a `Connector` named `jdbcDataConnector`. It will read it into `jdbcData` in the `read()` method, and verify the number of lines in this data. Uncomment the corresponding lines and complete the part on the `@Delivery` annotation before running your code to test your implementation of PostgreSQL data registration and ingestion.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e \u003csummary\u003eh) Ingesting a local Delta table\u003c/summary\u003e\n\nThe goal of this eighth exercise is to register and ingest a local Delta table.\n\nWe are looking to read the Delta table located at `src/main/resources/exercise/extract/delta-table`. This table contains two versions, and we will read those two versions.\n1. Complete the two configuration objects `deltaDataVersionZero` and `deltaDataVersionOne` in `src/main/resources/exercise/extract/extract.conf`.\n2. In `App.scala`, register one `Connector` with version zero data and one with version one data.\n3. You may create your own `Factory` and implement the `read()` function to verify if you can ingest the data. If you are not sure how to do that yet, we already added a `Factory` to a `Stage`, which is added to the `Pipeline`. This `CheckExtractFactory` will ingest a `Connector` named `deltaDataVersionZeroConnector` and a `Connector` named `deltaDataVersionOneConnector`. It will read it into `deltaDataVersionZero` and `deltaDataVersionOne` respectively in the `read()` method, and verify the number of lines in this data. Uncomment the corresponding lines and complete the part on the `@Delivery` annotation before running your code to test your implementation of local Delta table registration and ingestion.\n\n\u003c/details\u003e\n\nTo challenge yourself, try to replace the different `Connector` with `SparkRepository`. Make use of what you have learned in the *lesson*!\n\n\u003c/details\u003e\n\n## 3. Transform\n\n\u003cdetails\u003e \u003csummary\u003e\u003cstrong\u003eLesson\u003c/strong\u003e\u003c/summary\u003e\n\nTransformations in `SETL` are the easiest part to learn. There is nothing new if you are used to write ETL jobs with `Spark`. This is where you will transfer the code you write with `Spark` into `SETL`.\n\n### 3.1 `Factory`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nAfter seeing what the `read()` function in a `Factory` looks like, let's have a look at the `process()` function that is executed right after.\n```\nclass ProcessFactory extends Factory[DataFrame] with HasSparkSession {\n\n    @Delivery(id = \"testObject\")\n    val testObjectConnector: Connector = Connector.empty\n\n    var testObject: DataFrame = spark.emptyDataFrame\n\n    var result: DataFrame = spark.emptyDataFrame\n\n    override def read(): ProcessFactory.this.type = {\n      testObject = testObjectConnector.read()\n\n      this\n    }\n\n    override def process(): ProcessFactory.this.type = {\n      val testObjectDate = testObject.withColumn(\"date\", lit(\"2020-11-20\"))\n\n      result = testObjectDate\n        .withColumnRenamed(\"value1\", \"name\")\n        .withColumnRenamed(\"value2\", \"grade\")\n\n      this\n    }\n\n    override def write(): ProcessFactory.this.type = this\n\n    override def get(): DataFrame = spark.emptyDataFrame\n}\n```\n\nYou should understand the first part of the code with the ingestion thanks to the `@Delivery` and the `read()` function. Here is declared a `var result` in which will be stored the result of the data transformations. It is declared globally so that it can be accessed later in the `write()` and `get()` functions. The data transformations are what is inside the `process()` function, and you must surely know what they do.\n\nAs it is previously said, there is nothing new to learn here: you just write your `Spark` functions to transform your data, and this is unrelated to `SETL`. \n\n\u003c/details\u003e\n\n### 3.2 `Transformer`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nYou might not learn anything new for `SETL` for data transformations in itself, but `SETL` helps you to structure them. We will now take a look about `SETL Transformer`. You already know about `Factory`. A `Factory` can contain multiple `Transformer`. A `Transformer` is a piece of highly reusable code that represents one data transformation. Let's look at how it works.\n\n```\nclass ProcessFactoryWithTransformer extends Factory[DataFrame] with HasSparkSession {\n\n    @Delivery(id = \"testObject\")\n    val testObjectConnector: Connector = Connector.empty\n\n    var testObject: DataFrame = spark.emptyDataFrame\n\n    var result: DataFrame = spark.emptyDataFrame\n\n    override def read(): ProcessFactoryWithTransformer.this.type = {\n        testObject = testObjectConnector.read()\n  \n        this\n    }\n\n    override def process(): ProcessFactoryWithTransformer.this.type = {\n        val testObjectDate = new DateTransformer(testObject).transform().transformed\n        result = new RenameTransformer(testObjectDate).transform().transformed\n  \n        this\n    }\n\n    override def write(): ProcessFactoryWithTransformer.this.type = this\n\n    override def get(): DataFrame = spark.emptyDataFrame\n}\n```\n\nIf you compare this `Factory` with the previous `ProcessFactory` in the last section, it does the same job. However, the workflow is more structured. You can see that in the `process()` function, there is no `Spark` functions for data transformations. Instead, we used `Transformer`. The data transformation will be done in `Transformer`. This allows to make to code highly reusable and add a lot more structure to it. In the previous `ProcessFactory`, we can divide the job by two: the first process is adding a new column, and the second process is renaming the column.\n\nFirst, we are calling the first `Transformer` by passing our input `DataFrame`. The `transform()` method is then called, and the result is retrieved with the `transformed` getter. The second data transformation is done with `RenameTransformer`, and the result is assigned to our `result` variable. Let's have a look at each `Transformer`.\n\nA `Transformer` has two core methods:\n* `transform()` which is where the data transformation should happen.\n* `transformed` which is a getter to retrieve the result.\n\nTypically, we will also declare a variable in which we will assign the result of the transformation. In this case, `transformedData`. The `transformed` getter returns this variable. This is why in `ProcessingFactoryWithTransformer`, the `transform()` method is called, before calling the `transformed` getter.\n\n`DateTransformer.scala`:\n```\nclass DateTransformer(testObject: DataFrame) extends Transformer[DataFrame] with HasSparkSession {\n    private[this] var transformedData: DataFrame = spark.emptyDataFrame\n\n    override def transformed: DataFrame = transformedData\n\n    override def transform(): DateTransformer.this.type = {\n      transformedData = testObject\n          .withColumn(\"date\", lit(\"2020-11-20\"))\n\n      this\n    }\n}\n```\n\n`DateTransformer` represents the first data transformation that is done in the `ProcessFactory` in the previous section: adding a new column.\n\n`RenameTransformer`:\n```\nclass RenameTransformer(testObjectDate: DataFrame) extends Transformer[DataFrame] with HasSparkSession {\n    private[this] var transformedData: DataFrame = spark.emptyDataFrame\n\n    override def transformed: DataFrame = transformedData\n\n    override def transform(): RenameTransformer.this.type = {\n      transformedData = testObjectDate\n        .withColumnRenamed(\"value1\", \"name\")\n        .withColumnRenamed(\"value2\", \"grade\")\n\n      this\n    }\n}\n```\n\n`RenameTransformer` represents the second data transformation that is done in the `ProcessFactory` in the previous section: renaming the columns.\n\n\u003c/details\u003e\n\n### 3.3 Summary\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nThe classic data transformations happen in the `process()` function of your `Factory`. This is how you write your data transformations in `SETL`, given that you already did what is needed in the Extract part. You have two solutions:\n1. Write all the data transformations with `Spark` functions in the `process()` function of your `Factory`. Remember to set a global variable to store the result so that it can be used in the next functions of the `Factory`.\n2. Organize your workflow with `Transformer`. This is best for code reusability, readability, understanding and structuring. To use a `Transformer`, remember that you need to pass parameters, usually the `DataFrame` or the `Dataset` you want to transform, eventually some parameters. You need to add the `transform()` function which is where the core `Spark` functions should be called, and the `transformed` getter to retrieve the result. \n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n##\n\n\u003cdetails\u003e \u003csummary\u003e\u003cstrong\u003eExercises\u003c/strong\u003e\u003c/summary\u003e\n\nIn this exercise, we are going to practice about how to structure a `SETL` project for transformation processes.\n\nAn App.scala is already prepared. We created a SETL entry point and use a configuration file located at `src/main/resources/exercise/transform/transform.conf`. In this file, configuration objects are already created. We will be working with `pokeGrades.csv` and `digiGrades.csv`, both files located at `src/main/resources/`. We are looking to looking to compute the mean score of each \"poke\" and then of each \"digi\".\n\n1. We are going to extract the data: `pokeGrades.csv` and `digiGrades.csv`, from `src/main/resources/`. In `App.scala`, register these two as `SparkRepository`.\n2. Next step is to complete the `Factory`. Head over to `MeanGradeFactory` to complete the part about data ingestion.\n3. You should know a `Factory` has 4 core mandatory functions. Leave the `write()` and `get()` functions as they are. Use the `read()` function if necessary. Keep in mind about what `autoLoad` is.\n4. In the `process()` function, we are going to compute the mean grade for `pokeGrades` and `digiGrades` data. To do that, we are going to create a `Transformer`, named `MeanGradeTransformer`. This `Transformer` takes a parameter of type `Dataset[Grade]` and outputs an object of type `DataFrame`. There should be two columns: one column `name` and one column `grade` for the mean grade.\n5. In the `process()` function, we can now call the `Transformer` on each data, apply transformations and store the result in variables.\n6. Lastly, we can merge the two results and verify the final `DataFrame` by printing it.\n\nFollow the instructions in the code to achieve this exercise. If you'd like to challenge yourself, try to write a complete `Pipeline` by yourself, without the help of the prepared code files. For example, you can try to find the top-3 scores of each \"poke\" and each \"digi\".\n\n\u003c/details\u003e\n\n## 4. Load\n\n\u003cdetails\u003e \u003csummary\u003e\u003cstrong\u003eLesson\u003c/strong\u003e\u003c/summary\u003e\n\nThe Load processes with SETL correspond to two key ideas: writing the output, or passing the output. Passing the output allows to pass the result of a `Factory` to another `Factory`, for example. The second `Factory` is then using the result of a previous `Factory` as an input.\n\n### 4.1 Writing an output\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nIn order to write data, you need to register a `Connector` or a `SparkRepository`. As you probably already know, if you want to write a `DataFrame`, register a `Connector`. If you want to write a `Dataset`, register a `SparkRepository`. Do not forget that you must create a configuration item in the configuration file. There, you can specify the path of your output.\n\n`App.scala`:\n```\nval setl0: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\nsetl0\n    .setConnector(\"testObjectRepository\", deliveryId = \"testObject\")\n    .setConnector(\"testObjectWriteRepository\", deliveryId = \"testObjectWrite\")\n\nsetl0\n    .newPipeline()\n    .setInput[String](\"2020-11-23\", deliveryId = \"date\")\n    .addStage[WriteFactory]()\n```\n\n`local.conf`:\n```\ntestObjectRepository {\n  storage = \"CSV\"\n  path = \"src/main/resources/test_objects.csv\"\n  inferSchema = \"true\"\n  delimiter = \",\"\n  header = \"true\"\n  saveMode = \"Overwrite\"\n}\n\ntestObjectWriteRepository {\n  storage = \"EXCEL\"\n  path = \"src/main/resources/test_objects_write.xlsx\"\n  useHeader = \"true\"\n  saveMode = \"Overwrite\"\n}\n```\n\n`WriteFactory.scala`:\n```\nclass WriteFactory extends Factory[DataFrame] with HasSparkSession {\n\n    @Delivery(id = \"date\")\n    val date: String = \"\"\n    @Delivery(id = \"testObject\")\n    val testObjectConnector: Connector = Connector.empty\n    @Delivery(id = \"testObjectWrite\")\n    val testObjectWriteConnector: Connector = Connector.empty\n\n    var testObject: DataFrame = spark.emptyDataFrame\n\n    var result: DataFrame = spark.emptyDataFrame\n\n    override def read(): WriteFactory.this.type = {\n        testObject = testObjectConnector.read()\n\n        this\n    }\n\n    override def process(): WriteFactory.this.type = {\n        result = testObject\n            .withColumn(\"date\", lit(date))\n\n        this\n    }\n  \n    override def write(): WriteFactory.this.type = {\n        testObjectWriteConnector.write(result.coalesce(1))\n\n        this\n    }\n\n    override def get(): DataFrame = spark.emptyDataFrame\n}\n```\n\nNote that in the `Deliveries`, there is one with the ID `testObjectWrite`. It has been previously registered in the `Pipeline`. We are retrieving it, but using it as a way to write our output.\n\nThe `write()` function is the third executed function in a `Factory`, after `read()` and `process()`. The idea is to call the `write()` method of a `Connector` or a `SparkRepository`, and pass the result `DataFrame` or `Dataset` as argument. `SETL` will automatically read the configuration item; storage type, path and options, and write the result there.\n\nThe advantage of using `SETL` for the Load process is that it makes it easier for you because you can change everything you need in your configuration item. If you ever want to change the data storage, you only need to modify the value of the corresponding key. Same for the path, or other options.\n\n**In summary**, to write an output in `SETL`, you need to:\n1. Create a configuration item in your configuration file\n2. Register the corresponding `Connector` or `SparkRepository`\n3. Ingest it in your `Factory` with the `@Delivery` annotation\n4. Use it in the `write()` function to write your output\n\n\n\u003c/details\u003e\n\n### 4.2 Getting an output\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nAs SETL is organized with `Factory`, it is possible to pass the result of a `Factory` to another. The result of a `Factory` can be of any type, it generally is a `DataFrame` or a `Dataset`. \n\n#### 4.2.1 Getting a `DataFrame`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nWe are now going to ingest data and make some transformations in `FirstFactory`, then use the result in `SecondFactory`. You can see in the `Pipeline` that `FirstFactory` is before `SecondFactory`.\n\n`App.scala`:\n```\nval setl1: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\nsetl1.setConnector(\"testObjectRepository\", deliveryId = \"testObject\")\n\nsetl1\n    .newPipeline()\n    .setInput[String](\"2020-12-18\", deliveryId = \"date\")\n    .addStage[FirstFactory]()\n    .addStage[SecondFactory]()\n    .run()\n```\n\n`FirstFactory.scala`:\n```\nclass FirstFactory extends Factory[DataFrame] with HasSparkSession {\n\n    @Delivery(id = \"date\")\n    val date: String = \"\"\n    @Delivery(id = \"testObject\")\n    val testObjectConnector: Connector = Connector.empty\n\n    var testObject: DataFrame = spark.emptyDataFrame\n\n    var result: DataFrame = spark.emptyDataFrame\n\n    override def read(): FirstFactory.this.type = {\n        testObject = testObjectConnector.read()\n\n        this\n    }\n\n  override def process(): FirstFactory.this.type = {\n    result = testObject\n      .withColumn(\"date\", lit(date))\n\n    this\n  }\n\n  override def write(): FirstFactory.this.type = this\n\n  override def get(): DataFrame = result\n}\n```\n\nThis `FirstFactory` is similar to the previous `WriteFactory`. Instead of writing the result, we are going to pass it in the `get()` function. The `get()` function is the fourth executed function in a `Factory`, after `read()`, `process()` and `write()`. In the above example, the output is simply returned.\n\nRemember that the type of the output is defined at the start of the `Factory`, when specifying the parent class. In this case, the output is a `DataFrame`. This output is then injected in the `Pipeline` as a `Deliverable`. The other `Factory` can then ingest it.\n\n`SecondFactory.scala`:\n```\nclass SecondFactory extends Factory[DataFrame] with HasSparkSession {\n\n    import spark.implicits._\n\n    @Delivery(producer = classOf[FirstFactory])\n    val firstFactoryResult: DataFrame = spark.emptyDataFrame\n\n    var secondResult: DataFrame = spark.emptyDataFrame\n\n    override def read(): SecondFactory.this.type = this\n\n    override def process(): SecondFactory.this.type = {\n        secondResult = firstFactoryResult\n            .withColumn(\"secondDate\", $\"date\")\n\n        secondResult.show(false)\n\n        this\n    }\n\n    override def write(): SecondFactory.this.type = this\n\n    override def get(): DataFrame = secondResult\n}\n```\n\nIn this `SecondFactory`, we want to retrieve the output produced by `FirstFactory`. Noticed that we used the `producer` argument in the `@Delivery` annotation. This is how `SETL Pipeline` retrieves the output of a `Factory`: the result of a `Factory` is injected into the `Pipeline` as a `Deliverable`, which can be ingested with the `@Delivery` annotation. \n\n\u003c/details\u003e\n\n#### 4.2.2 Getting a `Dataset`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nIn the previous `Pipeline`, we retrieved the result of `FirstFactory` to use it in `SecondFactory`. The result of `FirstFactory` was a `DataFrame`, and we needed to retrieve it in `SecondFactory` by using the `producer` argument in the `@Delivery` annotation. In the following `Pipeline`, we are going to produce a `Dataset` from `FirstFactoryBis` and use it in `SecondFactoryBis`.\n\n`App.scala`:\n```\nval setl2: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .getOrCreate()\n\nsetl2\n    .setConnector(\"testObjectRepository\", deliveryId = \"testObject\")\n\nsetl2\n    .newPipeline()\n    .setInput[String](\"2020-12-18\", deliveryId = \"date\")\n    .addStage[FirstFactoryBis]()\n    .addStage[SecondFactoryBis]()\n    .run()\n```\n\n`FirstFactoryBis.scala`:\n```\nclass FirstFactoryBis extends Factory[Dataset[TestObject]] with HasSparkSession {\n\n    import spark.implicits._\n\n    @Delivery(id = \"testObject\")\n    val testObjectConnector: Connector = Connector.empty\n\n    var testObject: DataFrame = spark.emptyDataFrame\n\n    var result: Dataset[TestObject] = spark.emptyDataset[TestObject]\n\n    override def read(): FirstFactoryBis.this.type = {\n      testObject = testObjectConnector.read()\n\n      this\n  }\n\n    override def process(): FirstFactoryBis.this.type = {\n        result = testObject\n            .withColumn(\"value1\", concat($\"value1\", lit(\"42\")))\n            .as[TestObject]\n\n        this\n    }\n\n    override def write(): FirstFactoryBis.this.type = this\n\n    override def get(): Dataset[TestObject] = result\n}\n```\n\nNoticed that the `FirstFactoryBis` is a child class of `Factory[Dataset[TestObject]]`, meaning that the output of it must be a `Dataset[TestObject]`. `result` is a variable of type `Dataset[TestObject]`, and the `get()` function returns it. This `Dataset` is injected into the `Pipeline`.\n\n`SecondFactoryBis.scala`:\n```\nclass SecondFactoryBis extends Factory[DataFrame] with HasSparkSession {\n\n    import spark.implicits._\n\n    @Delivery(id = \"date\")\n    val date: String = \"\"\n    @Delivery\n    val firstFactoryBisResult: Dataset[TestObject] = spark.emptyDataset\n\n    var secondResult: DataFrame = spark.emptyDataFrame\n\n    override def read(): SecondFactoryBis.this.type = this\n\n    override def process(): SecondFactoryBis.this.type = {\n        secondResult = firstFactoryBisResult\n            .withColumn(\"secondDate\", lit(\"date\"))\n\n        secondResult.show(false)\n\n        this\n    }\n\n    override def write(): SecondFactoryBis.this.type = this\n\n    override def get(): DataFrame = secondResult\n}\n```\n\nThe result of `FirstFactoryBis` is a `Dataset[TestObject]`. We used the `@Delivery` annotation to retrieve it. Compared to `SecondFactory`, we did not need to use the `producer` in the `@Delivery` annotation. This is because the `Pipeline` can infer on the data, and the only `Dataset[TestObject]` that it found is produced by `FirstFactoryBis`. So there is no need to specify it. This is the same mechanism that explains why a `Connector` needs a `deliveryId` to be retrieved, and not a `SparkRepository[T]` if there is only one of type T that is registered.\n\n\u003c/details\u003e\n\n#### 4.2.3 Summary\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nIn summary, to use the output of a `Factory` in another one:\n1. Check the type of the output.\n2. Make sure that the `Stage` of the first `Factory` is before the `Stage` of the second `Factory`.\n3. The second `Factory` must be a child class of `Factory[T]` where `T` is the type of the output of the first `Factory`.\n4. Retrieve the output of the first `Factory` by using the `@Delivery` annotation. If it is a `DataFrame`, also use the `producer` argument.\n\nNote: Although it is possible to retrieve the output of a `Factory` in another one, most of the time, we would prefer to save the output of the first `Factory` in a `Connector` or `SparkRepository`, and re-use the same `Connector` or `SparkRepository` in the second `Factory` to retrieve the output.\n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n##\n\n\u003cdetails\u003e \u003csummary\u003e\u003cstrong\u003eExercises\u003c/strong\u003e\u003c/summary\u003e\n\nIn this exercise, we are going to practice about how to the Load processes with SETL, that is, how to pass the result of a `Factory` to another `Factory`, and how to write the result of a `Factory`.\n\nAn App.scala is already prepared. We created a SETL entry point and use a configuration file located at `src/main/resources/exercise/load/load.conf`. In this file, configuration objects are already created. We will be working with `pokeGrades.csv` and `digiGrades.csv`, both files located at `src/main/resources/`. We are going to find out how many exams there are per year. To do that, a first `Factory` will \"compute\" all the dates from the data, and pass this result to the second `Factory`. This second `Factory` ingest the result, extract the year of each date and count the number of exams per year.\n\n1. We are going to extract the data: `pokeGrades.csv` and `digiGrades.csv`, from `src/main/resources/`. In `App.scala`, register these two as `SparkRepository`. Also register a `Connector` where to write your output. Remind that a configuration file have been provided.\n2. Next step is to complete the two `Factory`. Head over to `GetExamsDateFactory` first.\n3. Complete the part on the data ingestion by setting the `Delivery`. In `GetExamsDateFactory`, the goal is to get the different exam dates of `pokeGrades.csv` and `digiGrades.csv`. Use the `read()` function if necessary. In the `process()` function, concatenate both the \"poke\" and the \"digi\" data. Then, only keep the `date` column, as it is the only relevant column in this exercise. Leave the `write()` function as is, and complete the `get()` function by returning the result of your process.\n4. Now, head over to `ExamStatsFactory`. This `Factory` will ingest the result of `GetExamsDateFactory`. As usual, the first step is to add the `Delivery`. Remember that to write an output, you also have to add a `Connector` or `SparkRepository` for the output, as it can define the storage type and the path. Also remember about `producer`. Go over to the lesson if you forgot about it.\n5. Use the `read()` function if necessary. In the `process()` function, we are looking to compute the number of exams per year. Our input data is a `DataFrame` of 1 single column `date`. Replace the `date` column by extracting the year only: it is the first 4 characters of the `date` column. Then, count the number of `date`. Use the `groupBy()`, `agg()` and `count()` functions. In our data, each year is duplicated 10 times. Indeed, for each exam, there are always 10 \"poke\" or 10 \"digi\". As a consequence, we need to divide the count by 10.\n6. Use the output Delivery you declared to save the result output in the `write()` function. Complete the `get()` function to return the result, even though it is not used. If you run the code, you should have the number of exams per year, located in `src/main/resources/examsStats/`.\n\nFollow the instructions in the code to achieve this exercise. If you'd like to challenge yourself, try to write a complete `Pipeline` by yourself, without the help of the prepared code files. For example, you can try to save in a file the list of \"poke\" and \"digi\".\n\nNote that these exercises are simply used to practise `SETL` and their structure may very well not be optimized for your production workflow. These are just simple illustrations of what you can usually do with the framework. \n\n\u003c/details\u003e\n\n## 5. From local to production environment\n\n\u003cdetails\u003e \u003csummary\u003e\u003cstrong\u003eLesson\u003c/strong\u003e\u003c/summary\u003e\n\nThe difference between multiple development environment consists in the location of files/data we want to read and the location of files/data we want to write.\n\n### 5.1 Changing the `path`\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nIn order to see how `SETL` handles between local and production environment, we are going to set two `Connector`: one for `local` and one for `prod`.\n\n`App.scala`:\n```\nval setl: Setl = Setl.builder()\n    .withDefaultConfigLoader(\"storage.conf\")\n    .setSparkMaster(\"local[*]\")\n    .getOrCreate()\n\nsetl\n    .setConnector(\"pokeGradesRepository\", deliveryId = \"pokeGradesRepository\")\n    .setConnector(\"pokeGradesRepositoryProd\", deliveryId = \"pokeGradesRepositoryProd\")\n\nsetl\n    .newPipeline()\n    .addStage[ProductionFactory]()\n    .run()\n```\n\n`storage.conf`:\n```\nsetl.config.spark {\n  spark.hadoop.fs.s3a.access.key = \"dummyaccess\" // Used to connect to AWS S3 prod environment\n  spark.hadoop.fs.s3a.secret.key = \"dummysecret\" // Used to connect to AWS S3 prod environment\n  spark.driver.bindAddress = \"127.0.0.1\"\n}\n\npokeGradesRepository {\n  storage = \"CSV\"\n  path = \"src/main/resources/pokeGrades.csv\"\n  inferSchema = \"true\"\n  delimiter = \",\"\n  header = \"true\"\n  saveMode = \"Overwrite\"\n}\n\npokeGradesRepositoryProd {\n  storage = \"CSV\"\n  path = \"s3a://setl-examples/pokeGrades.csv\"\n  inferSchema = \"true\"\n  delimiter = \",\"\n  header = \"true\"\n  saveMode = \"Overwrite\"\n}\n```\n\nThe difference between these two repositories is the path. The first object uses a local path, and the second uses a AWS S3 path, considered as a production environment. They are exactly the same file. When ingesting these files into a `Factory`, we can retrieve the same `DataFrame`. Thus, in `SETL`, it is possible to switch your development environment **without looking** at the code. You just need to make adjustments to the `path` of your configuration objects.\n\n\u003c/details\u003e\n\n### 5.2 Generalize your configuration\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nMost of the time, you will have a lot of configuration objects for both input and output. Changing the path for all of these objects may not be efficient. Instead of having two configuration objects (`pokeGradesRepository` and `pokeGradesRepositoryProd`) like in the last section, you can simply declare one configuration object, and make it reusable.\n\n`local.conf`:\n```\nsetl.config.spark {\n  some.config.option = \"some-value\"\n}\n\nroot {\n  path = \"src/main/resources\"\n}\n\ninclude \"smartConf.conf\" // /!\\ important\n``` \n\n`prod.conf`\n```\nsetl.config.spark {\n  spark.hadoop.fs.s3a.endpoint = \"http://localhost:9090\"\n  spark.hadoop.fs.s3a.access.key = \"dummyaccess\"\n  spark.hadoop.fs.s3a.secret.key = \"dummysecret\"\n  spark.hadoop.fs.s3a.path.style.access = \"true\"\n  spark.driver.bindAddress = \"127.0.0.1\"\n}\n\nroot {\n  path = \"s3a://setl-examples\"\n}\n\ninclude \"smartConf.conf\" // /!\\ important\n```\n\n`smartConf.conf`:\n```\nsmartPokeGradesRepository {\n  storage = \"CSV\"\n  path = ${root.path}\"/pokeGrades.csv\"\n  inferSchema = \"true\"\n  delimiter = \",\"\n  header = \"true\"\n  saveMode = \"Overwrite\"\n}\n```\n\nIf you look at `smartConf.conf`, notice the `path` key: it uses the `root.path` key. `smartConf.conf` is included in both in `local.conf` and `prod.conf`, which are the configuration files to be loaded. In `local.conf`, `root.path` is set to a value corresponding to a local path, and in `prod.conf`, it is set to a value corresponding to a prod path, which is a S3 path in this example. Let's now see how to switch development environment.\n\nNote that in the `Setl` object below, we used the `withDefaultConfigLoader()` method. This means that `application.conf` will be loaded, and it retrieves the `app.environment`. `app.environment` is a VM option. By default, it is set to `local` in the `pom.xml` file. Depending on the `app.environment`, it will load the corresponding configuration file, i.e `\u003capp.environment\u003e.conf`.\n\n`App.scala`:\n```\nval smartSetl: Setl = Setl.builder()\n    .withDefaultConfigLoader()\n    .setSparkMaster(\"local[*]\")\n    .getOrCreate()\n\nsmartSetl.setConnector(\"smartPokeGradesRepository\", deliveryId = \"smartPokeGradesRepository\")\nprintln(smartSetl.getConnector[Connector](\"smartPokeGradesRepository\").asInstanceOf[FileConnector].options.getPath)\n```\n\nNow, to see how easy it is to switch development environment with `SETL`, change the VM option `-Dapp.environment` by setting it to `local` or `prod`. If you run `App.scala`, you will see that the path will change according to the environment:\n* `src/main/resources/pokeGrades.csv` if `-Dapp.environment=local`\n* `s3a://setl-examples/pokeGrades.csv` if `-Dapp.environment=prod`\n\n\u003c/details\u003e\n\n### 5.3 Summary\n\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\nIn summary, you can change your development environment by changing to path of your configuration objects. However, this can be obnoxious especially if you have a lot of input/output storage object. By writing a general configuration file, you simply need adjust the VM option to switch your development environment, and get the corresponding paths of your data.\n\nRemind that `SETL` aims at simplifying the Extract and Load processes so that a Data Scientist can focus on his core job: data transformations. On top of that, it gives structure and allows more modularization of your code!\n\n\u003c/details\u003e\n\n\n\u003c/details\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsetl-framework%2Fsetl-examples","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsetl-framework%2Fsetl-examples","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsetl-framework%2Fsetl-examples/lists"}