{"id":15021506,"url":"https://github.com/absaoss/pramen","last_synced_at":"2025-07-29T00:32:47.390Z","repository":{"id":50354036,"uuid":"504494078","full_name":"AbsaOSS/pramen","owner":"AbsaOSS","description":"Resilient data pipeline framework running on Apache Spark","archived":false,"fork":false,"pushed_at":"2025-06-27T07:45:50.000Z","size":4197,"stargazers_count":24,"open_issues_count":31,"forks_count":3,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-06-27T08:38:15.855Z","etag":null,"topics":["big-data","data-pipeline","etl","hacktoberfest","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSES/LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-06-17T10:35:24.000Z","updated_at":"2025-06-27T07:45:52.000Z","dependencies_parsed_at":"2023-02-14T18:16:11.612Z","dependency_job_id":"29894e4e-7298-470f-bc14-7a737d577dd3","html_url":"https://github.com/AbsaOSS/pramen","commit_stats":{"total_commits":927,"total_committers":9,"mean_commits":103.0,"dds":0.2038834951456311,"last_synced_commit":"31bc89f7a6ceb0b6e6acb5242f10130e649b169c"},"previous_names":[],"tags_count":65,"template":false,"template_full_name":null,"purl":"pkg:github/AbsaOSS/pramen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fpramen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fpramen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fpramen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fpramen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/pramen/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fpramen/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267610344,"owners_count":24115433,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-28T02:00:09.689Z","response_time":68,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","data-pipeline","etl","hacktoberfest","scala","spark"],"created_at":"2024-09-24T19:56:39.433Z","updated_at":"2025-07-29T00:32:47.353Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# About Pramen\n[![ScalaCI](https://github.com/AbsaOSS/pramen/workflows/ScalaCI/badge.svg)](https://github.com/AbsaOSS/pramen/actions)\n[![PythonCI](https://github.com/AbsaOSS/pramen/workflows/python_ci/badge.svg)](https://github.com/AbsaOSS/pramen/actions)\n[![FOSSA Status](https://app.fossa.com/api/projects/custom%2B24661%2Fgithub.com%2FAbsaOSS%2Fpramen.svg?type=shield)](https://app.fossa.com/projects/custom%2B24661%2Fgithub.com%2FAbsaOSS%2Fpramen)\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-api_2.12/badge.svg)](https://mvnrepository.com/artifact/za.co.absa.pramen/pramen-api)\n[![PyPI](https://badge.fury.io/py/pramen-py.svg)](https://badge.fury.io/py/pramen-py)\n\n\nPramen is a framework for defining data pipelines based on Spark and a configuration driven tool to run and\ncoordinate those pipelines. The project focuses around Hadoop and Spark, but can run arbitrary jobs.\n\nThe idea behind Pramen pipelines is simple. A pipeline consists of\n* _Sources_ are the data systems that are not managed by the pipeline. An example could be an operational relational database.\n  - Ingestion jobs are used to get data from external systems into the metastore.\n* _Metastore_ is the data storage managed by the pipeline. Data in the metastore is accessed by table names.\n  The metastore hides the underlying storage and format, which is usually Parquet or Delta on HDFS or S3.\n  - Transformation jobs are used to transform data from the metastore and save the results back to the metastore.\n    Transformers can be written in Scala or in Python.\n* _Sinks_ are targets to send data from the metastore. An example could be a Kafka cluster or a CSV file in a local folder.\n  - Sink jobs are used to send data from the metastore to sinks\n\nThe architecture is customizable, so you can define your own sources, transformers and sinks and deploy it independently\nfrom the framework.\n\n![](resources/concepts.png)\n\nWith Pramen you can:\n* Build a data lake for tabular data.\n  - Define ingestion jobs to get data from external data sources to HDFS or S3.\n  - Organize data by partitioning it according to event or snapshot date.\n* Create ETL data pipelines\n  - Define ingestion jobs to _extract_ data from external sources to the metastore.\n  - Use transformers _transform_ data inside the metastore.\n  - Use sinks to _load_ data from the metastore to the target system.\n* Create ML pipelines\n  - Define ingestion jobs to get raw data to the metastore.\n  - Use transformers to clean, aggregate and extract features from the raw data  in the metastore.\n  - Use sinks to train and deploy models or to send data from the metastore to target systems.\n\nThere are many other data pipeline management tools. Why you would want to use Pramen?\n\n* Declarative pipeline definitions\n  - You define dependencies for transformers and Pramen will resolve them for you making sure a transformation\n    runs only when all dependencies are satisfied.\n* Auto-healing as much as possible\n  - Keeping pipeline state allows quicker recovery from a faulty source or transformation since the framework will\n    automatically determine which jobs to run. Jobs that already succeeded won't run again by default.\n  - Handling of late data and retrospective updates to data in data sources by re-running dependent jobs.\n  - Handling of schema changes from data sources.\n* Functional design\n  - The discipline on restricting mutable operations allows re-playable deterministic pipelines.\n  - Easier to test individual transformers.\n* Language support\n  - You can use Scala and Python transformers and combine them.\n* Extendable\n  - If your data source or sink is not supported by Pramen yet? You can implement your own very easy.\n* Built-in support of various relational database sources\n  - Pramen already supports getting data from the following RDMS: PostgreSQL, MySql, Oracle Data Warehouse, Microsoft SQL Server,\n    Denodo Virtualized and other standard JDBC compliant data sources \n    \n# Typical Use Case and Benefits\n\nMany environments still have numerous heterogeneous data sources that aren't integrated into a central data lake environment.\n\nPramen provides the ability to ingest and manage data pipelines en-masse from sourcing to producing.\n\nPramen assists with simplifying the efforts of ingestion and orchestration to a \"no/low-code\" level:\n - Automatic data loading and recovery (including missed and late data sources)\n - Automatic data reloading (partial or incorrect data load)\n - Automatic orchestration and coordination of dependent jobs (re-run downstream Pramen jobs automatically when upstream jobs are re-executed)\n\nIn addition to basic error notification, typical operational warnings are generated through email notifications such as:\n - Changes to upstream schema (unexpected changes to source data schemas) \n - Sourcing performance thresholds (unexpected slower than expected data throughput)\n\n**With Pramen data engineers and data scientists may focus on development and worry less about monitoring and maintaining existing data and machine learning pipelines.**\n\n# Quick start\n\n1. Get Pramen pipeline runner:\n\n2. You can download Pramen from GitHub releases: [Pramen Releases](https://github.com/AbsaOSS/pramen/releases)\n\n   Or you can build it from source for your Spark environment by running:\n   ```sh\n   git clone https://github.com/AbsaOSS/pramen\n   cd pramen\n   sbt -DSPARK_VERSION=\"3.3.4\" ++2.12.18 assembly \n   ```\n   (You need JDK 1.8 installed to run this)\n \n   You can specify your Spark path and run the full end to end example from (Linux or Mac only): [pramen/examples/combined_example.sh](pramen/examples/combined_example.sh) \n\n   For all possible build options look see [Building Pramen to suite your environment](#building-pramen-to-suite-your-environment)\n\n2. Define an ingestion pipeline\n\n   Paste the contents of [ingestion_pipeline.conf](pramen/examples/ingestion_pipeline/ingestion_pipeline.conf)\n   to a local file.\n\n3. Run the pipeline. Depending on the environment the command may vary. Here is an example for Yarn:\n   ```sh\n   spark-submit --master yarn \\\n     --deploy-mode client \\\n     --num-executors 1 \\\n     --driver-memory 2g \\\n     --executor-memory 2g \\\n     --class za.co.absa.pramen.runner.PipelineRunner \\\n     pipeline-runner-0.12.10.jar \\\n     --workflow ingestion_pipeline.conf \\\n     --rerun 2022-01-01\n   ```\n\n# Building the project\n\nPramen is built using SBT.\n\n**Note** By default `sbt test` runs unit tests and integration tests. In order to run just unit tests, please use\n`sbt t` alias.\n\n- `sbt +t` - runs unit tests only, for all Scala versions\n- `sbt test` - runs all tests (unit and integration)\n- `sbt unit:test` - runs unit tests only\n- `sbt integration:test` - runs integration tests only\n\nInstall locally for `sbt` projects:\n```\nsbt +publishLocal\n```\n\nInstall locally for `Maven` projects:\n```\nsbt +publishM2\n```\n\n## Project structure\nPramen consists of a few components:\n- `pramen-api` - contains traits (interfaces) for defining custom transformations, sources and sinks. \n- `pramen-core` - contains the orchestration and run logic.\n- `pramen-extras` - contains additional sources and sinks that are not part of the core since they add many additional\n  dependencies.\n\nA Pramen's data pipeline runs on a Spark cluster (standalone, Yarn, EMR, Databricks, etc). API and core are provided as\nlibraries to link. Usually to define data pipeline components all you need link is the API. Running a pipeline requires\ncreating an uber jar containing all the dependencies. \n\n## Linking\n\nIn order to implement custom sources, transformers and sinks you need to link Pramen API, and sometimes, the core.\nPramen libraries are available at [Maven Central (API)](https://mvnrepository.com/artifact/za.co.absa.pramen/pramen-api)\nand [Maven Central (framework)](https://mvnrepository.com/artifact/za.co.absa.pramen/pramen-core)\n\nYou can link against Pramen to build your transformers in Scala at the following coordinates:\n\nAPI (for defining custom sources, transformers, and sinks):\n\u003ctable\u003e\n\u003ctr\u003e\u003cth\u003eScala 2.11\u003c/th\u003e\u003cth\u003eScala 2.12\u003c/th\u003e\u003cth\u003eScala 2.13\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-api_2.11\"\u003e\u003cimg src = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-api_2.11/badge.svg\" alt=\"Maven Central\"\u003e\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-api_2.12\"\u003e\u003cimg src = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-api_2.12/badge.svg\" alt=\"Maven Central\"\u003e\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-api_2.13\"\u003e\u003cimg src = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-api_2.13/badge.svg\" alt=\"Maven Central\"\u003e\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\nFramework core (for advanced usage):\n\u003ctable\u003e\n\u003ctr\u003e\u003cth\u003eScala 2.11\u003c/th\u003e\u003cth\u003eScala 2.12\u003c/th\u003e\u003cth\u003eScala 2.13\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-core_2.11\"\u003e\u003cimg src = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-core_2.11/badge.svg\" alt=\"Maven Central\"\u003e\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-core_2.12\"\u003e\u003cimg src = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-core_2.12/badge.svg\" alt=\"Maven Central\"\u003e\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-core_2.13\"\u003e\u003cimg src = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa.pramen/pramen-core_2.13/badge.svg\" alt=\"Maven Central\"\u003e\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\nThe following Scala and Spark combinations are supported:\n\n| Scala version | Spark version |\n|:-------------:|:-------------:|\n|     2.11      |     2.4.8     |\n|     2.12      | 3.0 and above |\n|     2.13      | 3.2 and above |\n\nPramen for Python transformers is available in PyPi: [![PyPI](https://badge.fury.io/py/pramen-py.svg)](https://badge.fury.io/py/pramen-py)\n\n## Getting Pramen runner for your environment\n\nPramen is released as a set of thin JAR libraries. When running on a specific environment you might want to include all\ndependencies in an uber jar that you can build for your Scala version. You can do that by either \n- Downloading pre-compiled version of Pramen runners at the [Releases](https://github.com/AbsaOSS/pramen/releases) section of the project.\n- Or by building Pramen from source and creating an uber JAR file that contains all dependencies required to run the pipeline on a Spark cluster (see below).\n\n### Building a Pramen runner JAR from sources\n\nCreating an uber jar for Pramen is very easy. Just clone the repository and run one of the following commands:\n```sh\nsbt ++2.11.12 assembly \nsbt ++2.12.20 assembly\nsbt ++2.13.16 assembly\n```\n\nYou can collect the uber jar of Pramen either at\n- `core/target/scala-2.x/` for the pipeline runner.\n- `extras/target/scala-2.x/` for extra pipeline elements.\n\nSince `1.7.0` Pramen runner bundle does not include Delta Lake format classes since they are most often available in \nSpark distributions. This makes the runner independent of Spark version. But if you want to include Delta Lake files\nin your bundle, use one of example commands specifying your Spark version:\n```sh\nsbt -DSPARK_VERSION=\"2.4.8\" -Dassembly.features=\"includeDelta\" ++2.11.12 assembly \nsbt -DSPARK_VERSION=\"3.3.4\" -Dassembly.features=\"includeDelta\" ++2.12.20 assembly\nsbt -DSPARK_VERSION=\"3.5.5\" -Dassembly.features=\"includeDelta\" ++2.13.16 assembly\n```\n\nThen, run `spark-shell` or `spark-submit` adding the fat jar as the option.\n```sh\n$ spark-shell --jars pramen-runner_2.12-1.7.5-SNAPSHOT.jar\n```\n\n# Creating a data pipeline\n\nPramen is a configuration driven tool. There are plenty of ways you can customize your pipeline. For the definitive set\nof possible options please loot at [reference.conf](pramen/core/src/main/resources/reference.conf).\n\nLet's take a look on components of a data pipeline in more detail.\n\n## Pipeline components\n\nA pipeline consists of _common options_, _sources_, _the metastore_, _sinks_, and _operations_. All these\ndefinitions form the workflow config. For big pipelines these definitions can be split among multiple files. Check out\n`examples/` folder for example workflow definitions. Let's take a look at each section of a workflow separately.\n\nCurrently there are 3 types of jobs:\n- _Ingestion_ jobs to get data from external sources to the metastore.\n- _Transformation jobs_ to transform data inside the metastore.\n- _Sink_ jobs to send data from the metastore to external systems.\n\n### Common options\nPramen pipeline should have several options defined. Here is the minimum configuration. For the list of all options\nand their default values check out [reference.conf](pramen/core/src/main/resources/reference.conf).\n\n```hocon\npramen {\n  environment.name = \"AWS Glue (DEV)\"\n  pipeline.name = \"CDC PoC\"\n\n  bookkeeping.enabled = true\n  bookkeeping.jdbc {\n    driver = \"org.postgresql.Driver\"\n    url = \"jdbc:postgresql://myhost:5432/pramen_database\"\n    user = \"postgresql_user\"\n    password = \"password\"\n  }\n  temporary.directory = \"s3://bucket/prefix/tmp/\"\n}\n```\n\n#### Email notifications\nOne section of config defines options for email notifications. You can define\n```hocon\nmail {\n  # SMTP configuration\n  # Any options from https://javaee.github.io/javamail/docs/api/com/sun/mail/smtp/package-summary.html\n  smtp.host = \"smtp.example.com\"\n  smtp.port = \"25\"\n  smtp.auth = \"false\"\n  smtp.starttls.enable = \"false\"\n  smtp.EnableSSL.enable = \"false\"\n  debug = \"false\"\n  \n  # A custom email sender (optional)\n  send.from = \"Pramen \u003cpramen.noreply@example.com\"\n  \n  # Email recipients\n  send.to = \"user1@example.com, user2@example.com\"\n  \n  # A list of allowed domains (optional)\n  allowed.domains = [ \"example.com\", \"test.com\" ]\n}\n```\n\n### Dates\nBefore diving into pipeline definition it is important to understand how dates are handled. Pramen is a batch data\npipeline manager for input data updates coming from applications which are usually referred to as _source systems_. Pramen is designed\nfor updates coming from source systems daily or less frequently (weekly, monthly, etc). While it is possible to setup\npipelines in which data is updates several times a day, say, hourly, it might be more complicated since the design\ntargets daily or less frequent batch jobs.\n\nDaily batch jobs usually process data generated by a source system on the previous day, also known as 'T+1' data processing,\nmeaning that the pipeline processes data at the state of the end of a day T is processed at the day T+1.\n\nData coming from source systems are usually classified as _entities_ and _events_. An entity type of data contains current\nstate of an application, or a snapshot. An event type of data contains a change in state of an application, such as bank \ntransactions, for example. In both cases input data is stored in a batch-processing-friendly storage that supports\npartitioning by a set of columns for performance. Pramen partitions data by the _information date_ which is a generalized\nconcept that unifies snapshot date for entities and event date for events. Information date of data is then used to define\ntransformations, dependencies etc.\n\nIn order to infer information dates a _run date_ is used. A run date is the date the job is scheduled to run.\nFor example, for daily T+1 jobs, an information date can be defined as the previous days. For monthly job information date\nis determined by a convention you agree for your pipeline. For example, it can be\nagreed that all data that are loaded monthly should have information date as the beginning of that month. But other\nconventions are possible.\n\nHere is how it works:\n![](resources/date_inference.png)\n\nPramen has a flexible [date expression DSL](#date-functions) that allows defining expressions to calculate information date\nfrom run date, and define ranges of data to load from the information date of the job. Using these expressions you can define,\nfor example, \n  - a sourcing job runs on Mondays, loads data from the source database for Monday-Sunday last week partitioning it by\n    information date defined as Saturday previous week.\n    \u003cdetails\u003e\n    \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n    ```hocon\n     schedule.type = \"weekly\"\n     schedule.days.of.week = [ 1 ] # Mondays\n     info.date.expr = \"@runDate - 2\" # Saturday previous week\n     \n     date.from = \"lastMonday(@infoDate)\"\n     date.to = \"lastMonday(@infoDate) + 6\" \n    ```\n    \u003c/details\u003e\n\n  - a sink job that runs on Tuesdays and sends data accumulated between Monday-Sunday previous week, to a Kafka cluster.\n    \u003cdetails\u003e\n    \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n    ```hocon\n     schedule.type = \"weekly\"\n     schedule.days.of.week = [ 2 ] # Tuesdays\n     info.date.expr = \"@runDate - 1\" # Monday\n    \n     date.from = \"@infoDate - 7\"\n     date.to = \"@infoDate - \n    ```\n    \u003c/details\u003e\n\nIn order to understand how Pramen does this, another important concept has to be introduced - _the metastore_.\n\n### Metastore\nA metastore helps to abstract away the tabular data and the underlying storage. The idea is simple: a data pipeline \ndesigner can choose different ways of storing data, but implementers of transformations don't need to worry about it\nsince they can access data by table names.\n\n#### How does this work? \n\nPramen can store data in folders of file systems supported by Hadoop (HDFS, S3, AzureFS,\netc.) in Parquet or Delta format, or as tables in Delta Lake. But implementers of transformations do not need to worry\nabout the underlying storage. They can access it using `getTable()` method of a metastore object provided to them. The\nframework will provide them with a Spark DataFrame. \n\n\u003e The advantage of such approach is that transformations are storage agnostic and can be migrated from one storage\n\u003e system / data format to another seamlessly.\n\n#### Defining a metastore\nA metastore is simply a mapping from a _table name_ to a _path_ where the data is stored.\n\n##### Storage types\nCurrently, the following underlying storage is supported. \n- Parquet files in Hdfs\n- Delta files in Hdfs\n- Delta Lake tables\n\nHere is an example of a metastore configuration with a single table (Parquet format):\n```hocon\npramen.metastore {\n  tables = [\n    {\n      name = \"table_name\"\n      format = \"parquet\"\n      path = \"hdfs://cluster/path/to/parquet/folder\"\n      records.per.partition = 1000000\n      \n      # (Experimental) Save mode to use when writing to partitions.\n      # Supported: overwrite (default), append\n      #save.mode = append\n\n      information.date.column = \"INFORMATION_DATE\"\n      information.date.format = \"yyyy-MM-dd\"\n      \n      # This is partitioning by generated columns. For monthly partitions 2 columns are going to be created:\n      # /path/dataset/pramen_year=2024/pramen_month=12/*.parquet\n      # /path/dataset/pramen_year=2025/pramen_month=1/*.parquet\n      # /path/dataset/pramen_year=2025/pramen_month=2/*.parquet\n      # This is supported only for Delta Lake format (format=\"delta\").\n      information.date.partition.by = true\n      information.date.partition.period = \"month\" # Can be \"day\", \"month\", \"year_month\", \"year\"\n      information.date.partition.year.column = \"pramen_year\"\n      information.date.partition.month.column = \"pramen_month\"\n\n      # You can set the start date beyond which Pramen won't allow writes:\n      information.date.start = \"2022-01-01\"\n\n      # Alternatively, you can set the time window in days that from the current system\n      # date that the table could be written to. This is helpful is the table is archived.\n      #information.date.max.days.behind = 180\n\n      # Optional Spark configuration that will be used when writing to the table\n      # Useful to use Spark Committers (partitioned, directory, magic) only for some of tables. \n      spark.conf {\n        spark.sql.sources.commitProtocolClass = \"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol\"\n        spark.sql.parquet.output.committer.class = \"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter\"\n      }\n    }\n  ]\n}\n```\n\nMetastore table options:\n\n| Name                                      | Description                                                                                                                                                                               |\n|-------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `name`                                    | Name of the metastore table                                                                                                                                                               |\n| `format`                                  | Storage format (`parquet`, `delta`, `raw` [files], `transient` [do not persist between runs])                                                                                             |\n| `path`                                    | Path to the data in the metastore.                                                                                                                                                        |\n| `table`                                   | Delta Lake table name (if Delta Lake tables are the underlying storage).                                                                                                                  |\n| `cache.policy`                            | For `transient` format only. Cache policy defines how to store transient tables for the duration of the pipeline. Available options: `cache`, `no_cache`, `persist`.----                  |\n| `records.per.partition`                   | Number of records per partition (in order to avoid small files problem).                                                                                                                  |\n| `information.date.column`                 | Name of the column that contains the information date. *                                                                                                                                  |\n| `information.date.format`                 | Format of the information date used for partitioning (in Java format notation). *                                                                                                         |\n| `information.date.start`                  | The earliest date the table contains data for. *                                                                                                                                          |\n| `information.date.max.days.behind`        | The time window in days from the current system date when it is allowed to write/rerun. Useful if the underlying storage archives data automatically.                                     |\n| `information.date.partition.by`           | If `true` (default) the table will be partitioned by the information date. If `false`, the table won't be partitioned (supported for `delta` format only).                                |\n| `information.date.partition.period`       | Can be `day` (default), `month`, `year_month`, `year`. Specifies partition period. If not `day`, the table will be partitioned by a generated column (supported for `delta` format only). |\n| `information.date.partition.year.column`  | The name of generated column when `information.date.partition.period` is `month` or `year`.                                                                                               |\n| `information.date.partition.month.column` | The name of generated column when `information.date.partition.period` is `month` or `year_month`.                                                                                         |\n| `save.moden`                              | (experimental) Save mode to use when writing partitions. Supported: `overwrite` (default), `append`.                                                                                      |\n| `read.option`                             | Arbitrary read options to pass to the Spark reader when reading the table.                                                                                                                |\n| `write.option`                            | Arbitrary write options to pass to the Spark reader when reading the table.                                                                                                               |\n\n`*` - It is recommended to standardize information date column used for partitioning folders in the metastore. You can\ndefine default values for the information date column at the top of configuration and it will be used by default if not\noverridden explicitly for a metastore table.  \n\nDefault information date settings can be set using the following configuration keys:\n\n| Name                             | Default value     | Description                                        |\n|----------------------------------|-------------------|----------------------------------------------------|\n| `pramen.information.date.column` | pramen_info_date  | Default information date column name.              |\n| `pramen.information.date.format` | yyyy-MM-dd        | Default information date format.                   |\n| `pramen.information.date.start`  | 2020-01-01        | Default starting date for tables in the metastore. |\n\nStorage type examples:\n\nA config for a Parquet folder example:\n```hocon\n{\n  name = \"table_name\"\n  format = \"parquet\"\n  path = \"hdfs://cluster/path/to/parquet/folder\"\n  records.per.partition = 1000000\n}\n```\n\nA config for an non-partitioned Delta folder example:\n```hocon\n{\n  name = \"table_name\"\n  format = \"delta\"\n  path = \"s3://cluster/path/to/delta/folder\"\n  information.date.partition.by = false\n}\n```\n\nA config for a Delta Lake table example with a partitioning by a generated column:\n```hocon\n{\n  name = \"table_name\"\n  format = \"delta\"\n  path = \"s3://cluster/path/to/delta/folder\"\n  information.date.partition.by = false\n  information.date.partition.period = \"year_month\"\n  information.date.partition.month.column = \"pramen_year_month\" // s3://cluster/path/to/delta/folder/pramen_year_month=2025-02\n}\n```\n\nA config for a Delta Lake table example with default daily partitioning:\n```hocon\n{\n  name = \"table_name\"\n  format = \"delta\"\n  table = \"delta_lake_table_name\"\n}\n```\n\n### Sources\n\nSources define endpoints and paths go get data into the pipeline. Currently, Pramen supports the following\nbuilt-in sources:\n\n- **JDBC source** - allows fetching data from a relational database. The following RDBMS dialects are supported at\n  the moment:\n   - PostgreSQL\n   - Oracle (a JDBC driver should be provided in the classpath)\n   - Microsoft SQL Server\n   - DB2\n   - Denodo (a JDBC driver should be provided in the classpath)\n   - Hive 1/2\n- **Parquet on Hadoop** - allows fetching data in Parquet format from any Hadoop-compatible store: HDFS, S3, etc.\n\nYou can define your own source by implementing the corresponding interface.\n\nSources are defined like this:\n```hocon\npramen.sources = [\n  {\n    # The name of the source. It will be used to refer to the source in the pipeline.\n    name = \"source1_name\"\n    # The factory class of the source determines the source type.\n    factory.class = \"za.co.absa.pramen.core.source.JdbcSource\"\n    \n    # Depending of the factory source parameters vary.\n  },\n  {\n    name = \"source2_name\"\n    factory.class = \"za.co.absa.pramen.core.source.SparkSource\"\n    format = \"parquet\"\n    # ...\n  }\n  ## etc.\n]\n```\n\nYou can specify a minimum records require to exist at the source in order to consider a table to have data.\nBy default it is set to 1. You can override it for a specific source using `minimum.records` parameter.\nWhen set to 0, the source will be considered to have data even if it is empty. Therefore ingesting of empty\ntables will be allowed.\n\nYou can override this parameter per-table using `minimum.records` parameter in the table definition. See the\nsection on [sourcing jobs](#sourcing-jobs) for more details.\n```hocon\npramen.sources = [\n  {\n    name = \"source1_name\"\n    factory.class = \"za.co.absa.pramen.core.source.SparkSource\"\n    \n    format = \"parquet\"\n    \n    minimum.records = 0\n\n    # If true, fails the pipeline is there is no data any time when it is expected\n    fail.if.no.data = false\n\n    # If true, fails the pipeline is there is no data for jobs trying to catch late data\n    fail.if.no.late.data = false\n\n    # If true, fails the pipeline is there is no data for jobs checking new data as expected\n    fail.if.no.new.data = false\n  }\n]\n```\n\nBuilt-in sources:\n\n| Factory Class                                    | Description                                                                                                               |\n|--------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|\n| `za.co.absa.pramen.core.source.JdbcSource`       | JDBC Source                                                                                                               |\n| `za.co.absa.pramen.core.source.SparkSource`      | Any format supported by Spark on Hadoop source.                                                                           |\n| `za.co.absa.pramen.core.source.LocalSparkSource` | Any format supported by Spark on a local file system on the driver node.                                                  |\n| `za.co.absa.pramen.core.source.RawFileSource  `  | Copies files defined by a pattern to the metastore table in 'raw' format, without looking at the contents of input files. |\n\nHere is how each of these sources can be configured:\n\n#### JDBC source\nHere is how you can configure a JDBC source. The source defines an end point. Which exact table to load\nis determined by the pipeline configuration.\n```hocon\n{\n    name = \"source1_name\"\n    factory.class = \"za.co.absa.pramen.core.source.JdbcSource\"\n\n    jdbc = {\n      # Driver fully qualified class\n      driver = \"org.postgresql.Driver\"\n      \n      # The connection URL \n      url = \"jdbc:postgresql://example1.com:5432/test_db\"\n      \n      # Optional fallback URLs to try in case of a failure of the primary URL\n      fallback.url.1 = \"jdbc:postgresql://example2.com:5432/test_db\"\n      fallback.url.2 = \"jdbc:postgresql://example3.com:5432/test_db\"\n      \n      # Authentication credentials\n      user = \"my_login\"\n      password = \"some_password\"\n      \n      # (Optional) The number of times to retry connecting to the server in case of a failure\n      # If multiple URLs are specified, the retry will be attempted on the next URL each time.\n      # 'retries = 1' means that the connection will be attempted only once.\n      retries = 3\n\n      # (Optional) The timeout for connecting to the JDBC host.\n      connection.timeout = 60\n\n      # (Optional) For built-in JDBC connector the default behavior is sanitize date and timestamp fields\n      # by bounding to the range of 0001-01-01 ... 9999-12-31. This behavior can be switched off like this\n      sanitize.datetime = false\n      \n      # Any option passed as 'option.' will be passed to the JDBC driver. Example:\n      #option.database = \"test_db\"\n      \n      # (Optional) Autocommit, false by default. Used only when 'use.jdbc.native = true'\n      #autocommit = false\n    }\n\n    # Any option passed as '.option' here will be passed to the Spark reader as options. For example,\n    # the following options increase the number of records Spark is going to fetch per batch increasing\n    # the throughput of the sourcing.\n    option.fetchsize = 50000\n    option.batchsize = 50000\n  \n    # If set to true, Pramen will use its built-in JDBC connector instead of Spark built in one.\n    # Pramen JDBC connector is slower, and does not support all data types of various RDMS, but supports\n    # SQL queries that do not start with \"SELECT\".\n    use.jdbc.native = false\n    \n    # Consider the pipeline as failed if at least one table has no data at the scheduled time (new or late).\n    # Useful for auto-retrying ingestion pipelines.\n    fail.if.no.data = false\n\n    # If true, fails the pipeline is there is no data for jobs trying to catch late data\n    fail.if.no.late.data = false\n\n    # If true, fails the pipeline is there is no data for jobs checking new data as expected\n    fail.if.no.new.data = false\n\n    # One of: auto (default), always, never \n    # - When 'auto', an identifier will be quoted if it contains invalid characters. This includes any characters \n    #   outside the scope of A-Z, a-z, 0-9, and underscore (_).\n    # - When 'always', all input table names and column names will be validated and quoted, if not quoted already.\n    # - When 'never', Pramen will use names as configured without changing them.\n    # Keep in mind that quoted identifiers are case sensitive in most relational databases.\n    identifier.quoting.policy = \"auto\"\n\n    # (Optional) Specifies which special characters need to be replaced with '_' character in encountered in colum names.\n    # If not specified, the global default defined at 'pramen.special.characters.in.column.names' is going to be used.\n    #special.characters.in.column.names = \"' :+-=\u003c\u003e()[]{}*?/\\\\\\\"\"\n\n    # Specifies if tables of the data source have an information date colunn\n    has.information.date.column = true\n    \n    # If information column is present, specify its parameters:\n    information.date {\n      column = \"info_date\"\n      # Column format. Can be one of: \"date\", \"string\", \"number\", \"datetime\"\n      date.type = \"date\"\n      \n      # The format of the information date. If date.type = \"date\" the format is usually:\n      date.app.format = \"yyyy-MM-dd\"\n      \n      # When date.type = \"number\" the format is usually:\n      #date.app.format = \"yyyyMMdd\"\n      \n      # When date.type = \"string\" the format may vary significantly \n      # The format should be specified according to `java.time` spec:\n      # https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html\n    }\n\n    # If enabled, additional metadata will be added to the Spark schema\n    # Currently, it includes 'maxLength' for VARCHAR(n) fields.\n    # This is turned off by default because it requires connecting to the database one more time, which slows\n    # the ingestion a little.\n    enable.schema.metadata = false\n\n    # Convert decimals with no scale to integers and longs, fix 'NUMBER' SQL to Spark mapping. \n    correct.decimals.in.schema = true\n    \n    # Fix the input precision interpretation (fixes errors like \"Decimal precision 14 exceeds max precision 13\")\n    correct.decimals.fix.precision = true\n\n    # This is an experimental feature, please use with caution. \n    # When set to true, Pramen won't query the source for the record count as a separate query. It will always fetch\n    # the data first, cache it in temporary directory first. This is used on very large tables for sources that require\n    # full scan on count queries (for example, Hive 1.0 on Map Reduce)\n    # By default, count queries are enabled.\n    #disable.count.query = true\n    \n    # Specifies the maximum number of records to fetch. Good for testing purposes.\n    #limit.records = 100\n\n    # Specify the timezone of the database server, if it is different from the default timezone.\n    # It is needed for incremental ingestion based on offset field that has a timestamp or datetime data type.\n    #server.timezone = \"Africa/Johannesburg\"\n\n    # Optionally, you can specify a class for a custom SQL generator for your RDMS engine.\n    # The class whould extend 'za.co.absa.pramen.api.sql.SqlGenerator'\n    #sql.generator.class = \"com.example.MySqlGenerator\"\n  }\n```\n\nYou can specify more than one JDBC url. Pramen will always try the primary URL first. If connection fails,\nit will try fallback URLs in random order. If the primary URL is not specified, Pramen will try fallback URLs in\nrandom order. You can also specify the number of retries. By default, the number of retries is the same as the number\nof URLs.\n\n```hocon\n    jdbc = {\n      # The primary connection URL \n      url = \"jdbc:postgresql://example1.com:5432/test_db\"\n      fallback.url.1 = \"jdbc:postgresql://example2.com:5432/test_db\"\n      fallback.url.2 = \"jdbc:postgresql://example3.com:5432/test_db\"\n   \n      # (Optional) The number of times to retry connecting to the server in case of a failure\n      # If multiple URLs are specified, the retry will be attempted on the next URL each time.\n      # 'retries = 1' means that the connection will be attempted only once.\n      retries = 5\n\n      # (Optional) The timeout for connecting to the JDBC host.\n      connection.timeout = 60\n}\n```\n\n#### Spark source (CSV example)\nPramen supports loading data to the metastore from any format that Spark directly supports. You can provide\nany format-specific options for the Spark reader (spark.read...). \n\nFor a Spark source you should define:\n- The format (`csv`, `json`, `parquet`, etc.)\n- [Optionally] a schema in a [Spark SQL notation](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html).\n- Format-specific options (for CSV it would be a delimiter character etc.).\n- The presence and the format of the information date column. If no information column is\n  present, Pramen will take the snapshot of all data at scheduled times.\n\nHere is how you can configure a CSV source:\n```hocon\n{\n    name = \"my_csv_source\"\n    factory.class = \"za.co.absa.pramen.core.source.SparkSource\"\n\n    format = \"csv\"\n    \n    # You can define a schema for CSV here or directly at the operation level  \n    schema = \"id int, name string\"\n    \n    option {\n       header = true\n       delimiter = \",\"\n    }\n    \n    minimum.records = 1\n    fail.if.no.data = false\n    \n    has.information.date.column = false\n}\n```\n\nAt the operation level you can define\n- The path to a CSV file or directory.\n- You can override schema and other options.\n\nAn operation for ingesting a CSV file from S3 can look like this:\n```hocon\npramen.operations = [\n  {\n    name = \"Sourcing of a CSV file\"\n    type = \"ingestion\"\n    schedule.type = \"daily\"\n\n    source = \"my_csv_source\"\n    tables = [\n      {\n        input.path = s3a://bucket/path/to/file.csv\n        source {\n          schema = \"id int, name string\"\n        }\n        output.metastore.table = my_table\n      }\n    ]\n  }\n]\n```\n\n#### Spark source (catalog example)\nYou can use `SparkSource` to ingest data available in Spark Catalog (Hive/Glue/etc).\n\nYou can ingest tables and run queries to get the data you want. `input.table` will be read using `spark.table()`, \n`input.sql` will be read using `spark.sql()`. Here is an example:\n\n```hocon\npramen.sources = [\n  {\n    name = \"my_catalog_source\"\n    factory.class = \"za.co.absa.pramen.core.source.SparkSource\"\n\n    minimum.records = 1\n\n    has.information.date.column = true\n    information.date.column = \"info_date\"\n  }\n]\n\npramen.operations = [\n  {\n    name = \"Sourcing of data from the Catalog\"\n    type = \"ingestion\"\n    schedule.type = \"daily\"\n\n    source = \"my_catalog_source\"\n    tables = [\n      {\n        input.table = \"catalog_db.catalog_table1\"\n        output.metastore.table = my_table1\n      },\n      {\n        # You can also run queries against the Spark catalog. \n        input.sql = \"SELECT * FROM catalog_db.catalog_table2 WHERE record_type = 'A'\"\n        output.metastore.table = my_table2\n      }\n    ]\n  }  \n]\n```\n\n#### Local Spark source (CSV example)\nYou can use Pramen to load data from the local filesystem of the Spark driver. This is useful only when the pipeline is\nset up to run in client mode (Yarn). Pramen will move local files to a temporary location in HDFS/S3, and then load them.\nThe Local Spark Source is a wrapper around the Spark Source. It supports all the same options as the Spark Source.\nAlso, it adds a couple of mandatory additional options.\n- A path to a temp folder\n- [Optional] File mask to load.\n- [Optional] A flag for recursive directory search.\n\nHere is how you can configure a source taking data from a local CSV folder:\n```hocon\n{\n    name = \"my_local_csv_source\"\n    factory.class = \"za.co.absa.pramen.core.source.LocalSparkSource\"\n    \n    # Options, specific to the Local Spark Source\n    temp.hadoop.path = \"/temp/path\"\n    file.name.pattern = \"*.csv\"\n    recursive = false\n\n    # Options for the underlying Spark Source\n    format = \"csv\"\n    has.information.date.column = false\n    \n    option {\n       header = true\n       delimiter = \",\"\n    }\n}\n```\n\nAt the operation level you can define the path to load files from.\n\nAn operation for ingesting CSV files from a local directory can look like this:\n```hocon\npramen.operations = [\n  {\n    name = \"Sourcing of Csome SV files\"\n    type = \"ingestion\"\n    schedule.type = \"daily\"\n\n    source = \"my_local_csv_source\"\n    tables = [\n      {\n        input.path = /local/path/to/files\n        source {\n          schema = \"id int, name string\"\n        }\n        output.metastore.table = my_table\n      }\n    ]\n  }\n]\n```\n\n### Incremental Ingestion (experimental)\nPramen `version 1.10` introduces the concept of incremental ingestion. It allows running a pipeline multiple times a day\nwithout reprocessing data that was already processed. In order to enable it, use `incremental` schedule when defining your\ningestion operation:\n```hocon\nschedule = \"incremental\"\n```\n\nIn order for the incremental ingestion to work you need to define a monotonically increasing field, called an offset.\nUsually, this incremental field can be a counter, or a record creation timestamp. You need to define the offset field in\nyour source. The source should support incremental ingestion in order to use this mode.\n```hocon\noffset.column {\n  name = \"created_at\"\n  type = \"datetime\"\n}\n```\n\nOffset types available at the moment:\n\n| Type     | Description                                |\n|----------|--------------------------------------------|\n| integral | Any integral type (`short`, `int`, `long`) |\n| datetime | A `datetime `or `timestamp` fields         |\n| string   | Only `string` / `varchar(n)` types.        |\n\nOnly ingestion jobs support incremental schedule at the moment. Incremental transformations and sinks are planned to be\navailable soon.\n\n### Incremental transformers and sinks (experimental)\nIn order for a transformer or a sink to use a table from metastore in incremental way, the code should invoke \n`metastore.getCurrentBatch()` method instead of `metastore.getTable()`. `metastore.getCurrentBatch()` also works for \nnormal batch pipelines.\n\n- When `getCurrentBatch()` is used with daily, weekly or monthly schedule, it returns data for the information date \n  corresponding to the running job, same as invoking `metastore.getTable(\"my_table\", Some(infoDate), Some(infoDate))`.\n- When `getCurrentBatch()` is used with incremental schedule, it returns only latests non-processed data. The offset \n  management is used to keep tracked of processed data.\n- The column `pramen_batchid` is added automatically to output tables of ingested and transformed data in order to track \n  offsets. The exception is metastore `raw` format, which keeps original files as they are, and so we can't add the \n  `pramen_batchid` column to such tables.\n- The offsets manager updates the offsets only after output of transformers or sinks have succeeded. It does the update \n  in transactional manner. But if update failed in the middle, duplicates are possible on next runs, so we can say that \n  Pramen provides 'AT LEAST ONCE' semantics for incremental transformation pipelines.\n- Reruns are possible for full days to remove duplicates. But for incremental sinks, such ask Kafka sink duplicates still \n  might happen.\n\n### Sinks\nSinks define a way data needs to be sent to a target system. Built-in sinks include:\n- Kafka sink.\n- CSV in a local folder sink.\n- Command Line sink.\n- Spark sink.\n- Dynamic Conformance Engine (Enceladus) sink.\n\nYou can define your own sink by implementing `Sink` trait and providing the corresponding class name in pipeline configuration.\n\n#### Kafka sink\nA Kafka sink allows sending data from a metastore table to a Kafka topic in Avro format.\nYou can define all endpoint and credential options in the sink definitions. The output topic\nname should be defined in the definition of the pipeline operation.\n\nHere is an example of a Kafka sink definition:\n\n```hocon\n{\n  # Define a name to reference from the pipeline:\n  name = \"kafka_avro\"\n  factory.class = \"za.co.absa.pramen.extras.sink.KafkaSink\"\n  \n  writer.kafka {\n    brokers = \"mybroker1:9092,mybroker2:9092\"\n    schema.registry.url = \"https://my.schema.regictry:8081\"\n    \n    # Can be one of: topic.name, record.name, topic.record.name\n    schema.registry.value.naming.strategy = \"topic.name\"\n    \n    # Arbitrary options for creating a Kafka Producer\n    option {\n      kafka.sasl.jaas.config = \"...\"\n      kafka.sasl.mechanism = \"...\"\n      kafka.security.protocol = \"...\"\n      # ...\n    }\n    \n    # Arbitrary options for Schema registry\n    schema.registry.option {\n      basic.auth.credentials.source = \"...\"\n      basic.auth.user.info = \"...\"\n      # ...\n    }\n  }\n}\n```\n\nThe corresponding pipeline operation could look like this:\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n  name = \"Kafka sink\"\n  type = \"sink\"\n  sink = \"kafka_avro\"\n  schedule.type = \"daily\"\n  # Optional dependencies\n  dependencies = [\n    {\n      tables = [ dependent_table ]\n      date.from = \"@infoDate\"\n    }\n  ]\n  tables = [\n    {\n      input.metastore.table = metastore_table\n      output.topic.name = \"my.topic\"\n      \n      # All following settings are OPTIONAL\n      \n      # Date range to read the source table for. By default the job information date is used.\n      # But you can define an arbitrary expression based on the information date.\n      # More: see the section of documentation regarding date expressions, and the list of functions allowed.\n      date {\n        from = \"@infoDate\"\n        to = \"@infoDate\"\n      }\n      transformations = [\n       { col = \"col1\", expr = \"lower(some_string_column)\" }\n      ],\n      filters = [\n        \"some_numeric_column \u003e 100\"\n      ]\n      columns = [ \"col1\", \"col2\", \"col2\", \"some_numeric_column\" ]\n    }\n  ]\n}\n```\n\u003c/details\u003e\n\n#### CSV sink\nThe CSV sink allows generating CSV files in a local folder (on the edge node) from a table in the metastore. \n\nHere is an example of a CSV sink definition:\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n  name = \"local_csv\"\n  factory.class = \"za.co.absa.pramen.core.sink.LocalCsvSink\"\n  temp.hadoop.path = \"/tmp/csv_sink\"\n  \n  # This defines output file name pattern.\n  # The below options will produce files like: FILE_20220118_122158.csv\n  file.name.pattern = \"FILE_@timestamp\"\n  file.name.timestamp.pattern = \"yyyyMMdd_HHmmss\"\n  \n  # This can be one of the following: no_change, make_upper, make_lower\n  column.name.transform = \"make_upper\"\n  \n  # This defines the format of date and timestamp columns as they are exported CSV\n  date.format = \"yyyy-MM-dd\"\n  timestamp.format = \"yyyy-MM-dd HH:mm:ss Z\"\n  \n  # This defines arbitrary options passed to the CSV writer. The full list of options is available here:\n  # https://spark.apache.org/docs/latest/sql-data-sources-csv.html\n  option {\n    sep = \"|\"\n    quoteAll = \"false\"\n    header = \"true\"\n  }\n}\n```\n\u003c/details\u003e\n\nThe corresponding pipeline operation could look like this:\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n  name = \"CSV sink\"\n  type = \"sink\"\n  sink = \"local_csv\"\n  schedule.type = \"daily\"\n  dependencies = [\n    {\n      tables = [ dependent_table ]\n      date.from = \"@infoDate\"\n    }\n  ]\n  tables = [\n    {\n      input.metastore.table = metastore_table\n      output.path = \"/local/csv/path\"\n      # Date range to read the source table for. By default the job information date is used.\n      # But you can define an arbitrary expression based on the information date.\n      # More: see the section of documentation regarding date expressions, an the list of functions allowed.\n      date {\n        from = \"@infoDate\"\n        to = \"@infoDate\"\n      }\n      transformations = [\n       { col = \"col1\", expr = \"lower(some_string_column)\" }\n      ],\n      filters = [\n        \"some_numeric_column \u003e 100\"\n      ]\n      columns = [ \"col1\", \"col2\", \"col2\", \"some_numeric_column\" ]\n    }\n  ]\n}\n```\n\n\u003c/details\u003e\n\n#### Command Line sink\n\nCommand Line sink allows outputting batch data to an application written in any language as long as it can be run from a command line.\nThe way it works as one of the following scenarios:\n\nScenario 1.\n1. Data for the sink will be prepared at a temporary path on Hadoop (HDFS, S3, etc.) in a format of user's choice.\n2. Then, a custom command line will be invoked on the edge node passing the temporary path URI as a parameter.\n3. Once the process has finished, the exit code will determine if the sink succeeded (exit code 0 means success, of course).\n4. After the execution the data in the temporary folder will be cleaned up.\n\nScenario 2.\n1. Pramen runs a command line that processes data in the metastore in any way possible.\n2. The command line tool honors exit codes. Returns 0 on success and non-zero on failure.\n3. [Optionally] A RegEx expression is provided to extract number of records written from\n   program's output (both stdin and stdout) \n\nHere is an example of scenario 1 with a command line sink definition that outputs to a CSV in a temporary folder and runs a command line:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n  name = \"cmd_line\"\n  factory.class = \"za.co.absa.pramen.core.sink.CmdLineSink\"\n  \n  # A temporary folder in Hadoop to put data to.\n  temp.hadoop.path = \"/tmp/cmd_line_sink\"\n  \n  # Defines the output data format.\n  format = \"csv\"\n  \n  # The number of command line log lines to include in email notification in case the job fails.\n  include.log.lines = 1000\n  \n  # This defines arbitrary options passed to the CSV writer. The full list of options is available here:\n  option {\n    sep = \"|\"\n    quoteAll = \"false\"\n    header = \"true\"\n  }\n}\n```\n\u003c/details\u003e\n\nThe pipeline operation for this sink could look like this:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n  name = \"Command Line sink\"\n  type = \"sink\"\n  sink = \"cmd_line\"\n  schedule.type = \"daily\"\n  \n  # Optional dependency definitions\n  dependencies = [\n    {\n      tables = [ dependent_table ]\n      date.from = \"@infoDate\"\n    }\n  ]\n  \n  tables = [\n    {\n      input.metastore.table = metastore_table\n      # Supported substitutions:\n      # - @dataPath - the path to generated data or to the original metastore table\n      # - @partitionPath - the path to the partition corresponding to the information date being processed\n      # - @bucket - the bucket of the table location if the output is on S3\n      # - @prefix - the prefix on the bucket for tables located on S3\n      # - @partitionPrefix - the prefix to the data for the information date currently being processed\n      # - @infoDate - the information date in yyyy-MM-dd format\n      # - @infoMonth - the information month in yyyy-MM format\n      output.cmd.line = \"/my_apps/cmd_line_tool --path @dataPath --partition-path @partitionPath --date @infoDate\"\n      \n      ## All following settings are OPTIONAL\n      \n      # Date range to read the source table for. By default the job information date is used.\n      # But you can define an arbitrary expression based on the information date.\n      # More: see the section of documentation regarding date expressions, an the list of functions allowed.\n      date {\n        from = \"@infoDate\"\n        to = \"@infoDate\"\n      }\n      \n      transformations = [\n       { col = \"col1\", expr = \"lower(some_string_column)\" }\n      ],\n      \n      filters = [\n        \"some_numeric_column \u003e 100\"\n      ]\n      \n      columns = [ \"col1\", \"col2\", \"col2\", \"some_numeric_column\" ]\n    }\n  ]\n}\n```\n\u003c/details\u003e\n\nHere is an example of scenario 2 with a command line sink runs a command and record count regex\nexpressions are provided. The regex expression searches for \"Records written: nnn\":\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n  name = \"cmd_line2\"\n  factory.class = \"za.co.absa.pramen.core.sink.CmdLineSink\"\n  \n  # This RegEx ecpression parses the program output for the number of records written.\n  # Example string that would match the expression:\n  # Records written: 1000\n  record.count.regex = \"Records\\\\s*written:\\\\s*(\\d+)\"\n  \n  # [Optional] RegEx expressin of the successful execution that does not produce number of records\n  # You can set it if it is different from 'record.count.regex'. \n  zero.records.success.regex = \"The\\sjob\\shas\\ssucceeded\\..*\"\n  \n  # [Optional] An expression that secified that the job has failed even if the exit status it 0\n  failure.regex = \"FAILED\"\n  \n  # [Optional] RegEx expressions that specify output filters. If an output line matches any\n  # of the expressions, it will be ignored. This is useful for running legacy programs that \n  # produce lots of unnecessary output.\n  # For example, it can be used to filter out progress bar from logs.   \n  output.filter.regex = [\n    \"Filtered\\sout\\sline\\s1\",\n    \"Progress:\\ssomehting\",    \n  ]  \n  \n  # The number of command line log lines to include in email notification in case the job fails.\n  include.log.lines = 1000\n}\n```\n\u003c/details\u003e\n\nThe pipeline operation for this sink could look like this:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n  name = \"Command Line sink\"\n  type = \"sink\"\n  sink = \"cmd_line2\"\n  schedule.type = \"daily\"\n  \n  # Optional dependency definitions\n  dependencies = [\n    {\n      tables = [ dependent_table ]\n      date.from = \"@infoDate\"\n    }\n  ]\n  \n  tables = [\n    {\n      # This is still necessary for a sink\n      input.metastore.table = metastore_table\n      \n      # Command line template to run\n      output.cmd.line = \"/my_apps/cmd_line_tool --date @infoDate\"\n    }\n  ]\n}\n```\n\u003c/details\u003e\n\n### Spark sink\n\nThis sink allows writing data using Spark, similarly as you would do using `df.write.format(...).save(...)`.\n\nHere is an example of a Spark sink definition:\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n    # Define a name to reference from the pipeline:\n    name = \"spark_sink\"\n    factory.class = \"za.co.absa.pramen.core.sink.SparkSink\"\n    \n    # Output format. Can be: csv, parquet, json, delta, etc (anything supported by Spark). Default: parquet\n    format = \"parquet\"\n    \n    # Save mode. Can be overwrite, append, ignore, errorifexists. Default: errorifexists\n    mode = \"overwrite\"\n    \n    ## Only one of these following two options should be specified\n    # Optionally repartition the dataframe according to the specified number of partitions\n    number.of.partitions = 10\n    # Optionally repartition the dataframe according to the number of records per partition\n    records.per.partition = 1000000\n    \n    # If true (default), the data will be saved even if it does not contain any records. If false, the saving will be skipped\n    save.empty = true\n\n    # The number of attempts to make against the target\n    retries = 5\n  \n    # If non-empty, the data will be partitioned by the specified columns at the output path. Default: []\n    partition.by = [ pramen_info_date ]\n    \n    # These are additional option passed to the writer as 'df.write(...).options(...)'\n    option {\n      compression = \"gzip\"\n    }\n}\n```\n\u003c/details\u003e\n\nThe corresponding pipeline operation could look like this:\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n    name = \"Spark sink\"\n    type = \"sink\"\n    sink = \"spark_sink\"\n    \n    schedule.type = \"daily\"\n    \n    # Optional dependencies\n    dependencies = [\n      {\n        tables = [ dependent_table ]\n        date.from = \"@infoDate\"\n      }\n    ]\n    \n    tables = [\n      {\n        input.metastore.table = metastore_table\n        output.path = \"/datalake/base/path\"\n    \n        # Date range to read the source table for. By default the job information date is used.\n        # But you can define an arbitrary expression based on the information date.\n        # More: see the section of documentation regarding date expressions, an the list of functions allowed.\n        date {\n          from = \"@infoDate\"\n          to = \"@infoDate\"\n        }\n    \n        transformations = [\n         { col = \"col1\", expr = \"lower(some_string_column)\" }\n        ],\n        filters = [\n          \"some_numeric_column \u003e 100\"\n        ]\n        columns = [ \"col1\", \"col2\", \"col2\", \"some_numeric_column\" ]\n      }\n    ]\n}\n```\n\n\u003c/details\u003e\n\n\n\n### Dynamic Conformance Engine (Enceladus) sink\n\nThis sink is used to send data to the landing area of the Enceladus Data Lake (also known as 'raw folder'). You can configure\noutput format, partition patterns and info file generation option for the sink.\n\nHere is an example configuration of a sink:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n  # Define a name to reference from the pipeline:\n  name = \"enceladus_raw\"\n  \n  factory.class = \"za.co.absa.pramen.extras.sink.EnceladusSink\"\n  \n  # Output format. Can be: csv, parquet, json, delta, etc (anything supported by Spark). Default: parquet\n  format = \"csv\"\n  \n  # Save mode. Can be overwrite, append, ignore, errorifexists. Default: errorifexists\n  mode = \"overwrite\"\n  \n  # Information date column, default: enceladus_info_date\n  info.date.column = \"enceladus_info_date\"\n  \n  # Partition pattern. Default: {year}/{month}/{day}/v{version}\n  partition.pattern = \"{year}/{month}/{day}/v{version}\"\n  \n  # If true (default), the data will be saved even if it does not contain any records. If false, the saving will be skipped\n  save.empty = true\n  \n  # Optionally repartition te dataframe according to the number of records per partition\n  records.per.partition = 1000000\n  \n  # The timezone used for the info file\n  timezone = \"Africa/Johannesburg\"\n  \n  # Setup Enceladus main class and command line template if you want to run it from Pramen\n  enceladus.run.main.class = \"za.co.absa.enceladus.standardization_conformance.StandardizationAndConformanceJob\"\n  # Command line template for Enceladus\n  # You can use the following variables: @datasetName, @datasetName, @datasetVersion, @infoDate, @infoVersion, @rawPath, @rawFormat.\n  enceladus.command.line.template = \"--dataset-name @datasetName --dataset-version @datasetVersion --report-date @infoDate --menas-auth-keytab menas.keytab --raw-format @rawFormat\"\n  \n  # Output format options\n  option {\n    sep = \"|\"\n    quoteAll = \"false\"\n    header = \"false\"\n  }\n\n  # Optional S3 version buckets cleanup via a special REST API\n  cleanup.api.url = \"https://hostname/api/path\"\n  cleanup.api.key = \"aabbccdd\"\n  cleanup.api.trust.all.ssl.certificates = false\n\n  # Info file options\n  info.file {\n    generate = true\n    source.application = \"Unspecified\"\n    country = \"Africa\"\n    history.type = \"Snapshot\"\n    timestamp.format = \"dd-MM-yyyy HH:mm:ss Z\"\n    date.format = \"yyyy-MM-dd\"\n  }\n\n  # Hive properties\n  hive = {\n    # The API to use to query Hive. Valid values are: \"sql\" (default), \"spark_catalog\"\n    api = \"sql\"\n    database = \"my_hive_db\"\n    ignore.failures = false\n    escape.column.names = true\n  }\n}\n```\n\u003c/details\u003e\n\nThe pipeline operation for this sink could look like this:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\n{\n  name = \"Enceladus sink\"\n  type = \"sink\"\n  sink = \"enceladus_raw\"\n  \n  schedule.type = \"daily\"\n  \n  tables = [\n    {\n      input.metastore.table = metastore_table\n      output.path = \"/datalake/base/path\"\n      \n      # Optional info version (default = 1)\n      output.info.version = 1\n      \n      # Optional when running Enceladus from Pramen\n      output.dataset.name = \"my_dataset\"\n      output.dataset.version = 2\n      \n      # Optional Hive table to repair after Enceladus is executed\n      hive.table = \"my_database.my_table\"\n    }\n  ]\n}\n```\n\u003c/details\u003e\n\nFull Enceladus ingestion configuration examples: \n - [examples/enceladus_sourcing](examples/enceladus_sourcing)\n - [examples/enceladus_single_config](examples/enceladus_single_config)\n\n## Implementing transformers in Scala\n\nTransformers define transformations on tables in the metastore and outputs it to the metastore in a functional manner.\nThis means if you define transformations in the deterministic way and if it does not contain side effects, it becomes\n'replayable'. \n\nIn order to implement a transformer all you need to do is define a class that implements `Transformer` trait and either\nhas the default constructor or a constructor with one parameter - a TypeSafe configuration object. Example:\n\n```scala\npackage com.example\n\nimport com.typesafe.config.Config\nimport org.apache.spark.sql.DataFrame\nimport org.apache.spark.sql.functions._\nimport za.co.absa.pramen.api.Reason\nimport za.co.absa.pramen.MetastoreReader\nimport za.co.absa.pramen.Transformer\n\nimport java.time.LocalDate\n\nclass ExampleTransformer(conf: Config) extends Transformer {\n  override def validate(metastore: MetastoreReader,\n                        infoDate: LocalDate,\n                        options: Map[String, String]): Reason = {\n    if (/* fatal failure */) {\n      throw new IllegalArgumentException(\"Validation failed\")\n    }\n\n    if (/* no data to run the transformer */) {\n      Reason.NotReady(s\"No data for the transformation at $infoDate\")\n    } else if (/* need to skip this information date and don't check again*/) {\n      Reason.Skip(s\"Empty data for the transformation at $infoDate. Nothing to process\")\n    } else {\n      /* everything is in order */\n      Reason.Ready\n    }\n  }\n\n  override def run(metastore: MetastoreReader, \n                   infoDate: LocalDate,\n                   options: Map[String, String]): DataFrame = {\n    val df = metastore.getTable(\"some_table\", Option(infoDate), Option(infoDate))\n\n    /* Business logic of the transformation */\n    df.withColumn(\"new_column\", rand())\n  }\n}\n```\n(full example: [IdentityTransformer.scala](pramen/core/src/main/scala/za/co/absa/pramen/core/transformers/IdentityTransformer.scala))\n\nYou can refer to the transformer from the pipeline by its fully qualified class name (`com.example.ExampleTransformer` in this case).\n\nIn order to define a transformer you need to define 2 methods:\n- `validate()` Validation allows pre-condition checks and failure to execute gracefully, before the transformer is initialized.\n  Alternativaly, throwing an exception inside this method is considered validation failure. \n\n  Possible validation return reasons:\n  - `Reason.Ready` - the transformer is ready to run.\n  - `Reason.NotReady` - required data is missing to run the transformer. The transformer can possibly run later for the\n    information date when the data is available.\n  - `Reason.Skip` - the requirements for the transformer won't be satisfied for the specified information date. The\n    transformation is skipped (unless forced to run again). This could be used for cases when, say, nothing has arrived\n    so nothing to process.\n\n- `run()` Run the transformation and return a `DataFrame` containing transformation results. Input data can be fetched\n  from the metastore. If an exception is thrown from this method, it is not considered as a failure. The pipeline will try\n  running such transformations again when run again for the same information date.\n\nLet's take a look at parameters passed to the transformer:\n- `conf: Config` This is app's configuration. you can use it to fetch all parameters defined in the config, and you can\n  override them when launching the pipeline. More on TypeSafe config [here](https://github.com/lightbend/config).\n\n  While this is useful, we would recommend avoiding it for passing parameters to transformers. Prefer `options` (below)\n  when possible. \n- `metastore: MetastoreReader` - this is the object you should use to access data. While you can still use `spark.read(...)`,\n  the use of the metastore is strongly preferred in order to make transformers re-playable. \n  - `getTable()` - returns a `DataFrame` for the specified table and information date range. By default fetched data for\n    the current information date.\n  - `getLatest()` - returns a `DataFrame` for the specified table and latest information date for which the data is\n    available. This latest data is no bugger that `infoDate` by default so you can re-run historical jobs that do not\n    depend on the future data. But you can specify the maximum date in the `until` parameter.\n  - `getLatestAvailableDate()` - returns the latest information date the data is available for a given table.\n- `infoDate: LocalDate` - the [output] information date of the transformation.\n- `options: Map[String, String]` - a map of key/value pairs of arbitrary options that you can define for the\n  transformation in the pipeline. \n\n## Implementing transformers in Python\n\nHere is an example transformer implemented in Python:\n```python\n@attrs.define(auto_attribs=True, slots=True)\nclass ExampleTransformation1(Transformation):\n    async def run(\n        self,\n        metastore: MetastoreReader,\n        info_date: datetime.date,\n        options: Dict[str, str],\n        **kwargs: T_EXTRA_OPTIONS,\n    ) -\u003e DataFrame:\n        \"\"\"Example transformation 1.\"\"\"\n        logger.info(\"Hi from ExampleTransformation1!\")\n        dep_table = metastore.get_table(\n            \"table1_sync\",\n            info_date_from=datetime.date(2022, 3, 23),\n            info_date_to=datetime.date(2022, 3, 26),\n        )\n        return dep_table\n```\n\nFull example can be found here: [ToDo](ToDo)\n\n## Setting up a pipeline\nOnce the metastore, sources, transformers and sinks are defined, they can be connected to form a data pipeline. A data\npipeline in Pramen defines a set of jobs that should run together or in a sequence. Your data engineering estate can\nconsist of several pipelines scheduled to run at different times. You can define dependencies between jobs in the pipeline\nand jobs between pipeline as long as these pipelines share the metastore. \n\nHere is how a typical pipeline looks like:\n![](resources/pipeline_example.png)\n\nEvery element is optional. You can have a pipeline without sources if sources are loaded by a different pipeline. You can\nhave a pipeline without transformers if data ingestion is all is needed.\n\nEach pipeline has several mandatory options:\n\n```hocon\npramen {\n  # The environment name and pipeline name are defined to be included in email notifications.\n  # You can reference system environment variables if you want your pipeline config to be deployable\n  # to different envorinments without a change.\n  environment.name = \"MyEnv/UAT\"\n  pipeline.name = \"My Data Pipeline\"\n\n  # Optionally, you can set the Spark Application name. Otherwise the default name will be used.\n  # This does not work when using Yarn in cluster deploy mode. In this case you need to set Spark application name\n  # via the spark-xubmit command line.\n  spark.app.name = \"Pramen - \"${pramen.pipeline.name}\n\n  # The number of tasks to run in parallel. A task is a source, transformer, or sink running for a specified information date.\n  parallel.tasks = 1\n\n  # You can set this option so that Pramen never writes to partitions older than the specified date\n  #information.date.start = \"2010-01-01\"\n\n  # Or you can specify the same option in the number of days from the current calendar date.\n  #information.date.max.days.behind = 30\n\n  # Pramen-Py settings\n  py {\n    # This is mandatory of you want to use Python transformations and run Pramen-Py on the command line\n    location = \"/opt/Pramen-Py/bin\"\n    \n    # Optionally you can specify Pramen-Py executable name\n    executable = \"pramen-py\"\n    \n    # Optionally you can override the default command line pattern for Pramen-Py \n    cmd.line.template = \"@location/@executable transformations run @pythonClass -c @metastoreConfig --info-date @infoDate\"\n    \n    # Optionally you can override the default number of log lines to include in email notifications on a transformation failure.\n    keep.log.lines = 2000\n  }\n}\n```\n\nA pipeline is defined as a set of operations. Each operation is either a source, transformation or a sink job. When a pipeline is\nstarted, Pramen splits operations into jobs, jobs into tasks:\n\n![](resources/ops_jobs_tasks.png)\n\n\nA pipeline is defined as an array of operations. It becomes a DAG (directed acyclic graph) when each operation dependencies\nare evaluated.\n\n```hocon\npramen.operations = [\n  {\n    name = \"Source operation\"\n    type = \"ingestion\"\n    \n    # Can be 'daily', 'weekly', 'monthly'\n    schedule.type = \"daily\" \n    \n    # schedule.type = weekly\n    # 1 - Monday, ..., 7 - Sunday\n    # schedule.days.of.week = [ 7 ]\n\n    # schedule.type = monthly\n    # schedule.days.of.month = [ 1 ]\n    \n    # (optional) Specifies an expression for date of initial sourcing for all tables in this operation.\n    # Overrides 'default.daily.output.info.date.expr'   \n    initial.sourcing.date.expr = \"@runDate - 5\"\n    \n    source = \"my_jdbc_source\"\n    \n    # Specifies an expression to calculate output information date based on the day at which the job has ran.\n    # Optional, the default depends on the schedule.\n    # For daily jobs the default is:   \"@runDate\"\n    # For weekly jobs the default is:  \"lastMonday(@runDate)\"\n    # For monthly jobs the default is: \"beginOfMonth(@runDate)\"\n    info.date.expr = \"@runDate\"\n    \n    # If true (default) jobs in this operation is allowed to run in parallel.\n    # It makes sense to set it to false for jobs that take a lot of cluster resources.\n    allow.parallel = true\n    \n    # If this is true, the operation will run regardless if dependent jobs had failed.\n    # This gives more responsibilities for validation to ensure that the job can run.\n    # Useful for transformations that should still run if they do not strongly need latest\n    # data from previous jobs.\n    always.attempt = false\n    \n    # You can determine number of tasks running in parallel with 'pramen.parallel.tasks' setting. \n    # By setting 'consume.threads' to greater value than 1, the task will appear to require more than 1 thread to run. \n    # Thus, the task will take up multiple \"slots\" in 'pramen.parallel.tasks' setting.\n    # This is useful if some tasks consume lot of memory and CPU and should not be running with other tasks in parallel.\n    consume.threads = 2\n\n    tables = [\n      {\n        input.db.table = table1\n        output.metastore.table = table1\n      },\n      {\n        input.sql = \"SELECT * FROM table2 WHERE info_date = date'@infoDate'\"\n        output.metastore.table = table2\n      }\n    ]\n  },\n {\n    name = \"A transformer\"\n    type = \"transformer\"\n    class = \"za.co.absa.pramen.core.transformers.IdentityTransformer\"\n    schedule.type = \"daily\"\n\n    # Specifies a metastore table to save output data to\n    output.table = \"transformed_table1\"\n\n    # Specifies an expression to calculate output information date based on the day at which the job has ran.\n    info.date.expr = \"@runDate\"\n\n    # Specifies which tables are inputs to the transformer and which date range input tables are expected to have input data.\n    dependencies = [\n      {\n        tables = [ table1 ]\n        date.from = \"@infoDate\"\n        date.to = \"@infoDate\" // optional\n      }\n    ]\n   \n    option {\n      input.table = \"table1\"\n    }\n  },\n  {\n    name = \"A Kafka sink\"\n    type = \"sink\"\n    sink = \"kafka_prod\"\n\n    schedule.type = \"daily\"\n\n    tables = [\n      {\n        input.metastore.table = transformed_table1\n        output.topic = kafka.topic.transformed_table1\"\n      }\n    ]\n  }\n ]\n```\n\nEach operation has the following properties:\n- **Schedule** - (mandatory) defines which days it should run.\n- **Information date expression** - defines an expression to calculate output information date from the date a job actually ran.\n- **Initial sourcing dates** - defines an expression which is evaluated on the initial sourcing of the data. The result is the initial date from which data should be loaded.\n- **Parallelism** - specify if the operation is more or less resource intensive than other operations and if it should be run in parallel or sequentially.\n- **Dependencies** - specify data availability requirements that need to be satisfied for the operation to run.\n- **Filters** - specify post-processing filters for each output table of the operation.\n- **Schema transformations** - specify post-processing operations for the output table, usually related to schema evolution.\n- **Columns selection** - specify post-processing projections (which columns to select) for the output table.\n\n#### Schedule\nA schedule specifies when an operation should run. \n\nPramen does not have a built-in scheduler, so an external scheduler should be used to trigger runs of a pipeline.\nIt can be AirFlow, Dagster, RunDeck, DataBricks job scheduler, or even  local cron. Usually a pipeline runs daily,\nbut each operation can be configured to run only at specific days so some of them won't run each day. The schedule\nsetting specifies exactly that.\n\nA schedule can be daily, weekly, or monthly.\n\nHere are a couple of examples:\n\nDaily:\n```hocon\n    schedule.type = \"daily\" \n```\n\nWeekly, on Sundays:\n```hocon\n    schedule.type = weekly\n    # 1 - Monday, ..., 7 - Sunday\n    schedule.days.of.week = [ 7 ]\n```\n\nTwice a week, on Mondays and Fridays:\n```hocon\n    schedule.type = weekly\n    schedule.days.of.week = [ 1, 5 ] \n```\n\nMonthly (on 1st day of the month):\n```hocon\n    schedule.type = monthly\n    schedule.days.of.month = [ 1 ]\n```\n\nMonthly (on the last day of the month):\n```hocon\n    schedule.type = monthly\n    schedule.days.of.month = [ LAST ]\n```\n\nMonthly (on the second to last day of the month, e.g. Jan 30th or Apr 29th):\n```hocon\n    schedule.type = monthly\n    schedule.days.of.month = [ -2 ]\n```\n\nTwice a month (on 1st and 15th day of each month):\n```hocon\n    schedule.type = monthly\n    schedule.days.of.month = [ 1, 15 ]\n```\n\n#### Output information date expression\nMetastore tables are partitioned by information date. A chunk of data in a metastore table for specific information date is\nconsidered an immutable atomic portion of data and a minimal batch. For event-like data information date may be considered \nthe date of the event. For catalog-like data information date is considered the date of the snapshot.\n\nOutput information date expression allows specifying how the information date is calculated based on the date when the\npipeline is ran at.\n\n- For daily jobs information date is usually calculated as the same day when the job has ran, or a day before.\n- For weekly jobs information date is usually either beginning or end of week.\n- For monthly jobs information date is usually either beginning or end of month.\n\nWell-designed pipelines standardize information dates for weekly and monthly jobs across ingestion and transformation jobs\nso that querying the data is easier.\n\nYou can specify default output information date expressions in the config (usually `common.conf`) like this:\n```hocon\npramen {\n  # Default information date expression for daily jobs\n  default.daily.output.info.date.expr = \"@runDate\"\n\n  # Default information date expression for weekly jobs (Monday of the current week)\n  default.weekly.output.info.date.expr = \"lastMonday(@runDate)\"\n\n  # Default information date expression for monthly jobs (The first day of the month)\n  default.monthly.output.info.date.expr = \"beginOfMonth(@runDate)\"\n}\n```\n\nYou can override defaults for specific operations by changing the definition of the operation as follows:\n```hocon\npramen.operations = [\n  ...\n  {\n    ...\n    info.date.expr = \"@runDate\"\n  }\n  ...\n]\n```\n\n#### Initial sourcing dates\n\nWhen you add a new table to the metastore and have a sourcing job for it, by default Pramen will load only recent data.\nYou can change the behavior by either providing default initial sourcing date expressions or specifying an initial \nsourcing date expression for an operation.\n\nIn the expression you specify an expression that given the current date (@runDate) returns the oldest date to load data for.  \n\nDefault values are configured like this:\n```hocon\npramen {\n  # Default initial sourcing date expression for daily jobs\n  initial.sourcing.date.daily.expr = \"@runDate\"\n\n  # Default initial sourcing date expression for weekly jobs (pick up any information date last week)\n  initial.sourcing.date.weekly.expr = \"@runDate - 6\"\n\n  # Default initial sourcing date expression for monthly jobs (start from the beginning on the current month)\n  initial.sourcing.date.monthly.expr = \"beginOfMonth(@runDate)\"\n}\n```\n\nYou can override defaults for specific operations by changing the definition of the operation as follows:\n```hocon\npramen.operations = [\n  ...\n  {\n    # ...\n    initial.sourcing.date.expr = \"@runDate\"\n  }\n  ...\n]\n```\n\n#### Parallelism\n\nPramen has the ability to run tasks in parallel (configured by `pramen.parallel.tasks`). You can further fine-tune this\nconfiguration using the following options:\n\n\n| Option            | Is Mandatory | Description                                                                                                                                                                                                                                                                                                                                                        |\n|-------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `allow.parallel`  | No           | if `false`, tasks derived from this operation will run sequentially and not in parallel. For example, this is useful when a transformation for 'T-1' depends on data it produced on 'T-2' (the transformation has a self-dependency). It this case, running the transformation for 'T-1' and 'T-2' in parallel would produce incorrect results (default: `true`).  |\n| `consume.threads` | No           | Specify how many threads should a certain task consume with regards to the total number of threads set by `pramen.parallel.tasks` (default: `1`).                                                                                                                                                                                                                  |\n\nHere is an example of fine-tuning certain operations:\n\n```hocon\npramen {\n  # a maximum of 4 tasks running in parallel\n  parallel.tasks = 4\n  # ...\n}\n\npramen.operations = [\n  {\n    name = \"Easy job\"\n    \n    # not a resource intensive task, so Pramen can run 4 of these at one time (if no other tasks are running)\n    consume.threads = 1\n    # ...\n  },\n  {\n    name = \"Hard job\"\n    # run only one instance of this operation at a time (consumes all 4 threads defined by 'pramen.parallel.tasks')\n    consume.threads = 4\n    # ...\n  }\n]\n```\n\nIn reality, a task with `consume.threads = 3` does not really run on 3 threads. It still uses only one thread\nbut the setting gives an indication to Pramen that it is a resource-intensive task and should be run together with less\ndemanding tasks.\n\n#### Dependencies\nDependencies for an operation allow specifying data availability requirements for a particular operation. For example,\n'in order to run transformation T the input data in a table A should be not older than 2 days'. Dependencies determine\norder of execution of operations.\n\nYou can use any expressions from [the date expression reference](#date-functions).\n\nDependency configuration options:\n\n| Option            | Is Mandatory | Description                                                                                                                                                                                                                                              |\n|-------------------|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `tables`          | Yes          | The list of tables that the operation uses.                                                                                                                                                                                                              |\n| `date.from`       | Yes          | The date expression specifying the oldest date it is acceptable to run the operation.                                                                                                                                                                    |\n| `date.to`         | No           | The date expression specifying the newest date it is acceptable to run the operation.                                                                                                                                                                    |\n| `trigger.updates` | No           | If `true`, updates to the above tables will cause rerun of this operation (default: `false`).                                                                                                                                                            |\n| `optional`        | No           | If `true`, failing the dependency will only trigger a warning, not an error (default: `false`).                                                                                                                                                          |\n| `passive`         | No           | If `true`, failing dependency will not generate an error, the operation won't run, but will be checked next time. This is useful for operations that need to happen as soon as dependencies are met, but there is no certainty regarding the exact date. |\n\nOptions `optional` and `passive` are mutually exclusive.\n\nHere is a template for a dependency definition:\n```hocon\n{\n  # The list of input tables for which the condition should be satisfied \n  tables = [ table1, table2 ]\n  \n  # Date range expression for which data should be available.\n  # 'date.from' is mandatory, 'date.to' is optional. \n  date.from = \"@infoDate\"\n  date.to = \"@infoDate\"\n  \n  # If true, retrospective changes to any of the tables in the list will cause the operation to rerun.\n  trigger.updates = true\n  \n  # If true, dependency failure will cause a warning in the notification instead of error\n  optional = true\n\n  # If true, the job won't run on dependency failure, but will not be marked as a failure in notifications.\n  # This is useful for operations that need to happen as soon as dependencies are met, but there is no\n  # certainty regarding the exact date.\n  passive = true\n}\n```\n\nHere is an example of dependencies definition:\n```hocon\ndependencies = [\n  {\n    # Tables table1 and table2 should current.\n    # Any retrospective updates to these tables should trigger rerun of the operation. \n    tables = [ table1, table2 ]\n    date.from = \"@infoDate\"\n    trigger.updates = true\n  },\n  {\n    # Table table3 should have data for the previous week from Mon to Sun  \n    tables = [ table3 ]\n    date.from = \"lastMonday(@infoDate) - 7\"\n    date.to = lastMonday(@infoDate) - 1\"\n  },\n  {\n    # Table table4 should be available for the current month, older data will trigger a warning   \n    tables = [ table3 ]\n    date.from = \"beginOfMonth(@infoDate)\"\n    optional = true\n  }\n]\n```\n\n#### Filters\nFilters can be defined for any operation as well as any ingestion on sink table. Filters are applied before saving data\nto the metastore table or before sending data to the sink.\n\nThe purpose of filters is to load or send only portion of the source table. You can use any Spark boolean expressions\nin filters.\n\nExample:\n```hocon\nfilters = [\n  \"some_column1 \u003e 100\",\n  \"some_column2 \u003c 300\",\n  \"some_data_column == @infoDate\"\n]\n```\n\n#### Schema transformations\nSchema transformations can be defined for any operation as well as any ingestion on sink table. Schema transformations\nare applied before saving data to the metastore table or before sending data to the sink.\n\nThe purpose of schema transformations is to adapt to schema changes on data load or before sending data downstream.\n\nYou can create new columns, modify or delete existing columns. If the expression is empty, the column will be dropped. \n\nExample:\n```hocon\ntransformations = [\n  { col = \"new_column\", expr = \"lower(existing_column)\" },\n  { col = \"existing_column\", expr = \"upper(existing_column)\" },\n  { col = \"column_to_delete\", expr = \"\" }\n],\n```\n\n#### Columns selection / projection\nColumns selection or project can be defined for any operation as well as any ingestion on sink table. Columns selection\nare applied before saving data to the metastore table or before sending data to the sink.\n\nThe purpose of columns selection is to define the set and the order of columns to load or send. Similar can be achieved\nby schema transformations, but the only way to guarantee the order of columns (for example for CSV export) is to use\ncolumn selection.\n\nExample:\n```hocon\ncolumns = [ \"column1\", \"column2\", \"column3\", \"column4\" ]\n```\n\n### Sourcing jobs\n\nSourcing jobs synchronize data at external sources with tables at the metastore. You specify an input source, and a mapping\nbetween input tables/queries/paths to a table in the metastore. \n\nHere is an example configuration for a JDBC source:\n```hocon\n{\n  # The name of the ingestion operation will be included in email notifications\n  name = \"JDBC data sourcing\"\n  \n  # The operation type is 'ingestion'\n  type = \"ingestion\"\n  \n  # THe schedule is mandatory\n  schedule.type = \"daily\"\n  \n  # This specifies the source name from `sources.conf`\n  source = \"my_jdbc_source\"\n  \n  # Optionally you can specify an expression for the information date.\n  info.date.expr = \"@runDate\"\n  \n  tables = [\n    {\n      input.db.table = \"table1\"\n      output.metastore.table = \"table1\"\n    },\n    {\n      input.db.table = \"table2\"\n      output.metastore.table = \"table2\"\n    },\n    {\n      input.db.table = \"table3\"\n      output.metastore.table = \"table3\"\n      \n      # Optional filters, schema transformations and column selections\n      filters = [ ]\n      transformations = [ ]\n      columns = [ ]\n    },\n    {\n      input.sql = \"SELECT * FROM table4 WHERE info_date = date'@dateFrom'\"\n      output.metastore.table = \"table4\"\n      \n      # You can define range queries to the input table by providing date expressions like this:\n      date.from = \"@infoDate - 1\"\n      date.to = \"@infoDate\"\n\n      # [Optional] You can specify the maximum about the job should take. If the execution time is bigger than\n      # specified, a warning will be added to notifications.\n      warn.maximum.execution.time.seconds = 3600\n\n      # [Optional] You can specify the maximum about the job should take.\n      # This is the hard timeout. The job will be killed if the timeout is breached\n      # The timeout restriction applies to the full wall time of the task: validation and running.\n      kill.maximum.execution.time.seconds = 7200\n\n      # You can override any of source settings here \n      source {\n        minimum.records = 1000 \n        fail.if.no.new.data = true\n        has.information.date.column = true\n        use.jdbc.native = true\n        information.date.column = \"info_date\"\n      }\n    }\n  ]\n}\n```\n\nYou can use date expressions and formatted dates in sql expressions. You can wrap date expressions in `@{}` and use\nvariables like `@infoDate` and date functions referenced below inside curly braces. And you can apply formatting to variables\nusing `%format%` (like `%yyyy-MM-dd%`) after variables or expressions.\nExamples:\n\nFor\n```hocon\nsql = \"SELECT * FROM my_table_@infoDate%yyyyMMdd% WHERE a = b\"\n```\nthe result would look like:\n```sql\nSELECT * FROM my_table_20220218 WHERE a = b\n```\n\nFor\n```hocon\nsql = \"SELECT * FROM my_table WHERE snapshot_date = date'@{beginOfMonth(minusMonths(@infoDate, 1))}'\"\n```\nthe result would look like:\n```sql\n-- the beginning of the previous month\nSELECT * FROM my_table WHERE snapshot_date = date'2022-01-01'\n```\n\nFor\n```hocon\nsql = \"SELECT * FROM my_table_@{plusMonths(@infoDate, 1)}%yyyyMMdd% WHERE a = b\"\n```\nthe result would look like:\n```sql\nSELECT * FROM my_table_20220318 WHERE a = b\n--                          ^the month is 3 (next month)\n```\n\n\nThe above example also shows how you can add a pre-ingestion validation on the number of records in the table\nusing `minimum.records` parameter.\n\nFull example of JDBC ingestion pipelines: [examples/jdbc_sourcing](examples/jdbc_sourcing)\n\n\nFor example, let's have this range defined for a table:\n```hocon\n   date.from = \"@infoDate-2\" # 2022-07-01\n   date.to   = \"@infoDate\"   # 2022-07-03\n```\n\nWhen you use `input.sql = \"...\"` you can refer to the date range defined for the table `date.from` and `date.to` using the\nfollowing variables:\n\n| Variable    | Example expression                               | Actual substitution                               |\n|-------------|--------------------------------------------------|---------------------------------------------------|\n| `@dateFrom` | SELECT * FROM table WHERE date \u003e **'@dateFrom'** | SELECT * FROM table WHERE date \u003e **'2022-07-01'** |\n| `@dateTo`   | SELECT * FROM table WHERE date \u003e **'@dateTo'**   | SELECT * FROM table WHERE date \u003e **'2022-07-03'** |\n| `@date`     | SELECT * FROM table WHERE date \u003e **'@date'**     | SELECT * FROM table WHERE date \u003e **'2022-07-03'** |\n\nHere is an example configuration for a Parquet on Hadoop source. The biggest difference is that\nthis source uses `input.path` rather than `input.db.table` to refer to the source data. Filters,\nschema transformations, column selection and source setting overrides can apply for this\nsource the same way as for JDBC sources:\n\n```hocon\n{\n  name = \"Parquet on Hadoop data sourcing\"\n  type = \"ingestion\"\n  schedule.type = \"daily\"\n  \n  source = \"my_parquet_source\"\n  \n  tables = [\n    {\n      input.path = \"s3a://my-bucket-data-lake/prefix/table1\"\n      output.metastore.table = \"table1\"\n    }\n  ]\n}\n```\n\n### Transformation jobs (Scala)\nIn order to include a Scala transformer in the pipeline you just need to specify the fully qualified class name\nof the transformer. \n\nHere is a example:\n```hocon\n{\n  name = \"My Scala Transformation\"\n  type = \"transformer\"\n  class = \"com.example.MyTransformer\"\n  \n  schedule.type = \"daily\"\n  \n  output.table = \"my_output_table\"\n  \n  dependencies = [\n    {\n      tables = [ table1 ]\n      date.from = \"@infoDate - 1\"\n      date.to = \"@infoDate\"\n      trigger.updates = true\n      optional = false\n    },\n    {\n      tables = [table2, table3]\n      date.from = \"@infoDate\"\n      optional = true\n    }\n  ]\n  \n  # Arbitrary key/value pairs to be passed to the transformer.\n  # Remember, you can refer environment variables here.\n  option {\n    key1 = \"value1\"\n    key2 = \"value2\"\n    key3 = ${MY_ENV_VARIABLE}\n  }\n\n  # [Optional] You can specify the maximum about the job should take. If the execution time is bigger than\n  # specified, a warning will be added to notifications.\n  warn.maximum.execution.time.seconds = 3600\n  \n  # Optional schema transformations \n  transformations = [\n      {col = \"A\", expr = \"cast(A as decimal(15,5))\"}\n  ]\n  \n  # Optional filters\n  filters = [ \"A \u003e 0\", \"B \u003c 2\" ]\n  \n  # Optional column selection\n  columns = [ \"A\", \"B\", \"C\" ]\n}\n```\n\nRemember that although the dependency section is optional, you can use a table inside in the transformer only if it is \nincluded in dependencies. Even an optional dependency allows using a table inside the transformer.\n\n\n### Transformation jobs (Python)\nPython transformer definition is very similar to Scala transformer definitions. Use 'python_transformer' operation type\nand 'python.class' to refer to the transformer.\n\n```hocon\n{\n  name = \"My Python Transformarion\"\n  type = \"python_transformer\"\n  python.class = \"MyTransformer\"\n  \n  schedule.type = \"daily\"\n  \n  output.table = \"my_output_table\"\n  \n  dependencies = [\n    {\n      tables = [ table1 ]\n      date.from = \"@infoDate - 1\"\n      date.to = \"@infoDate\"\n      trigger.updates = true\n      optional = false\n    },\n    {\n      tables = [table2, table3]\n      date.from = \"@infoDate\"\n      optional = true\n    }\n  ]\n  \n  # Arbitrary Spark configuration\n  # You can use any configuration option from the official documentation: https://spark.apache.org/docs/latest/configuration.html\n  spark.conf {\n    spark.executor.instances = 4\n    spark.executor.cores = 1\n    spark.executor.memory = \"4g\"\n  }\n  \n  # Arbitrary key/value pairs to be passed to the transformer.\n  # Remember, you can refer environment variables here.\n  option {\n    key1 = \"value1\"\n    key2 = \"value2\"\n    key3 = ${MY_ENV_VARIABLE}\n  }\n  \n  # Optional schema transformations \n  transformations = [\n      {col = \"A\", expr = \"cast(A as decimal(15,5))\"}\n  ]\n  \n  # Optional filters\n  filters = [ \"A \u003e 0\", \"B \u003c 2\" ]\n  \n  # Optional column selection\n  columns = [ \"A\", \"B\", \"C\" ]\n}\n```\n\n### Sink jobs\n\nSink jobs allow sending data from the metastore downstream. The following examples may serve as a template for\nsink operation definition.\n\n#### Kafka sink example\n```hocon\n{\n  name = \"Kafka sink\"\n  type = \"sink\"\n  sink = \"kafka_prod_sink\"\n\n  schedule.type = \"daily\"\n\n  tables = [\n    {\n      input.metastore.table = table1\n      output.topic = \"kafka.topic1\"\n      \n      columns = [ \"A\", \"B\", \"C\", \"D\" ]\n      \n      date = {\n        from = \"@infoDate\"\n        to = \"@infoDate\"\n      }\n    }\n  ]\n}\n```\n\n#### Local CSV sink example\n```hocon\n{\n  name = \"CSV sink\"\n  type = \"sink\"\n  sink = \"local_sftp_sink\"\n\n  schedule.type = \"weekly\"\n  schedule.days.of.week = [ 2 ] // Tuesday\n\n  tables = [\n    {\n      input.metastore.table = table1\n      output.path = \"/output/local/path\"\n      \n      columns = [ \"A\", \"B\", \"C\", \"D\" ]\n      \n      date = {\n        from = \"lastMonday(@infoDate) - 7\"\n        to = \"lastSunday(@infoDate)\"\n      }\n\n      # [Optional] You can specify the maximum about the job should take. If the execution time is bigger than\n      # specified, a warning will be added to notifications.\n      warn.maximum.execution.time.seconds = 3600\n    }\n  ]\n}\n```\n\n### Transfer operations\nPramen can be used just for data ingestion to a data lake. In this case, you don't need to use the metastore. Instead, \nyou can send data directly from a source to a sink. Such operations are called 'transfer operations' in Pramen.\n\nYou specify:\n- A source name\n- A sink name\n- And the list of tables/queries/path to transfer\n- [optionally] If the input is not a database table, but a path or a SQL query, you need to specify a metastore table name for job tracking (see the example).\n\nLet's take a look at an example based on the Enceladus sink.\n\n## Bookkeeping\n\nIn order to support auto-recovery from failures, schema tracking and all other nice features, Pramen requires to use a database\nor a storage for keeping the state of the pipeline.\n\n### PostgreSQL database (recommended)\nThis is highly recommended way of storing bookkeeping data since it is the most efficient and feature rich.\n\nConfiguration:\n```hocon\npramen {\n  bookkeeping.enabled = \"true\"\n  \n  bookkeeping.jdbc {\n    driver = \"org.postgresql.Driver\"\n    url = \"jdbc:postgresql://host:5433/pramen\"\n    user = \"username\"\n    password = \"password\"\n  }\n}\n```\n\n### MongoDb database\nHere is how you can use a MongoDB database for storing bookkeeping information:\n\n```hocon\npramen {\n  bookkeeping.enabled = \"true\"\n\n  bookkeeping.mongodb {\n    connection.string = \"mongodb://aaabbb\"\n    database = \"mydb\"\n  }\n}\n```\n\n### Hadoop (CSV+JSON)\nThis is less recommended way, and is quite slow. But the advantage is that you don't need a database.\n\n```hocon\npramen.bookkeeping {\n  enabled = \"true\"\n  location = \"hdfs://path\"\n}\n```\n\n### Delta Lake (experimental)\nThis requires Delta Lake format support from the cluster you are running pipelines at.\n\nYou can use wither a path:\n```hocon\npramen.bookkeeping {\n  enabled = \"true\"\n  hadoop.format = \"delta\"\n  location = \"s3://path\"\n}\n```\n\nor a set of managed tables:\n```hocon\npramen.bookkeeping {\n  enabled = \"true\"\n  hadoop.format = \"delta\"\n  delta.database = \"my_db\"  # Optional. 'default' will be used if not specified\n  delta.table.prefix = \"bk_\"\n}\n```\n\n#### Enceladus ingestion pipelines for the Data Lake\nPramen can help with ingesting data for data lake pipelines of [Enceladus](https://github.com/AbsaOSS/enceladus).\nA special sink (`EnceladusSink`) is used to save data to Enceladus' raw folder.\n\nHere is a template for such a pipeline:\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n\n```hocon\npramen.sources = [\n  {\n    name = \"my_postgre_rds\"\n    factory.class = \"za.co.absa.pramen.core.source.JdbcSource\"\n\n    jdbc = {\n      driver = \"org.postgresql.Driver\"\n      connection.primary.url = \"jdbc:postgresql://connection.host/test_db\"\n      user = \"user\"\n      password = \"mypassword\"\n      \n      # (Optional) The number of times to retry connecting to the server in case of a failure\n      # If multiple URLs are specified, the retry will be attempted on the next URL each time.\n      # 'retries = 1' means that the connection will be attempted only once.\n      retries = 3\n\n      # (Optional) The timeout for connecting to the JDBC host.\n      connection.timeout = 60\n    }\n\n    option.fetchsize = 50000\n    option.batchsize = 50000\n\n    has.information.date.column = true\n    information.date.column = \"info_date\"\n    information.date.type = \"date\"\n    information.date.format = \"yyyy-MM-dd\"\n  }\n]\n\npramen.sinks = [\n  {\n    name = \"my_data_lake\"\n    factory.class = \"za.co.absa.pramen.extras.sink.EnceladusSink\"\n\n    format = \"json\"\n\n    mode = \"overwrite\"\n\n    records.per.partition = 1000000\n    \n    partition.pattern = \"{year}/{month}/{day}/v{version}\"\n\n    info.file {\n      generate = true\n\n      source.application = \"MyApp\"\n      country = \"Africa\"\n      history.type = \"Snapshot\"\n      timestamp.format = \"dd-MM-yyyy HH:mm:ss Z\"\n      date.format = \"yyyy-MM-dd\"\n    }\n  }\n]\n\npramen.operations = [\n{\n    name = \"My database to the data lake load\"\n    type = \"transfer\"\n    schedule.type = \"daily\"\n\n    source = \"my_postgre_rds\"\n    sink = \"my_data_lake\"\n\n    tables = [\n      {\n        # Minimal configuration example\n        input.db.table = table1\n        output.path = /datalake/path/raw/table1\n        output.info.version = 1\n      },\n      {\n        # Full configuration example\n        input.sql = \"SELECT * FROM table2 WHERE info_date = date'@infoDate'\"\n        job.metastore.table = \"table2-\u003emy_data_lake\" # This is needed the input is not a table\n        output.path = /datalake/path/raw/table2\n        \n        # Autodetect info version based on files in the raw and publish folders\n        # Needs 'output.publish.base.path' or 'output.hive.table' to be set\n        output.info.version = auto\n\n        # The rest of the fields are optional\n        date.from = \"@infoDate\"\n        date.to = \"@infoDate\"\n        \n        output {\n           # Optional when running Enceladus from Pramen\n           dataset.name = \"my_dataset\"\n           dataset.version = 2\n           \n           # Optional publish base path (for detecting version number)\n           publish.base.path = \"/bigdata/datalake/publish\"\n           # Optional Hive table to repair after Enceladus is executed\n           hive.table = \"my_database.my_table\"\n        }\n\n        transformations = [\n          {col = \"last_name_u\", expr = \"upper(last_name)\"}\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fpramen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fpramen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fpramen/lists"}