{"id":18810335,"url":"https://github.com/absaoss/atum","last_synced_at":"2025-08-21T15:32:24.350Z","repository":{"id":39586251,"uuid":"147207325","full_name":"AbsaOSS/atum","owner":"AbsaOSS","description":"A dynamic data completeness and accuracy library at enterprise scale for Apache Spark","archived":false,"fork":false,"pushed_at":"2024-11-04T15:05:09.000Z","size":475,"stargazers_count":30,"open_issues_count":20,"forks_count":9,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-12-10T10:03:59.727Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-03T13:16:02.000Z","updated_at":"2024-11-04T14:58:53.000Z","dependencies_parsed_at":"2024-11-04T13:19:31.883Z","dependency_job_id":"53cc10f1-af46-4adf-b1b8-82227e54a080","html_url":"https://github.com/AbsaOSS/atum","commit_stats":null,"previous_names":[],"tags_count":28,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fatum","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fatum/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fatum/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fatum/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/atum/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230520393,"owners_count":18238948,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:19:50.470Z","updated_at":"2024-12-20T01:15:39.596Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# About Atum\n\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/atum_2.11/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/atum_2.11/)\n\nAtum is a data completeness and accuracy library for Apache Spark.\n\nOne of the challenges regulated industries face is the requirement to track and prove that their systems preserve \nthe accuracy and completeness of data. In an attempt to solve this data processing problem in Apache Spark applications, \nwe propose the approach implemented in this library. \n\nThe purpose of Atum is to add the ability to specify \"checkpoints\" in Spark applications. These checkpoints are used \nto designate when and what metrics are calculated to ensure that critical input values have not been modified as well \nas allow for quick and efficient representation of the completeness of a dataset. Additional metrics can also be defined \nat any checkpoint.\n\nAtum adopts a standard JSON message schema for capturing checkpoint data, thus can be extended upstream or downstream,\nproviding flexibility across other computation engines and programming languages.\n\nThe library provides a concise and dynamic way to track completeness and accuracy of data produced from source through\na pipeline of Spark applications. All metrics are calculated at a DataFrame level using various aggregation functions \nand are stored as metadata together with the data between Spark applications in pipeline. Comparing control metrics \nfor various checkpoints is not only helpful for complying with strict regulatory frameworks, but also helps during \ndevelopment and debugging.\n \n\n## Motivation\n\nBig Data strategy for a company usually includes data gathering and ingestion processes.\nThat is the definition of how data from different systems operating inside a company\nare gathered and stored for further analysis and reporting. An ingestion processes can involve\nvarious transformations like:\n* Converting between data formats (XML, CSV, etc.)\n* Data type casting, for example converting XML strings to numeric values\n* Joining reference tables. For example this can include enriching existing\n  data with additional information available through dictionary mappings.\nThis constitutes a common ETL (Extract, Transform and Load) process.   \n\nDuring such transformations, sometimes data can get corrupted (e.g. during casting), records can\nget added or lost. For instance, *outer joining* a table holding duplicate keys can result in records explosion.\nAnd *inner joining* a table which has no matching keys for some records will result in loss of records.\n\nIn regulated industries it is crucial to ensure data integrity and accuracy. For instance, in the banking industry\nthe BCBS set of regulations requires analysis and reporting to be based on data accuracy and integrity principles.\nThus it is critical at the ingestion stage to preserve the accuracy and integrity of the data gathered from a\nsource system.    \n\nThe purpose of Atum is to provide means of ensuring no critical fields have been modified during\nthe processing and no records are added or lost. To do this the library provides an ability\nto calculate *hash sums* of explicitly specified columns. We call the set of hash sums at a given time\na *checkpoint* and each hash sum we call a *control measurement*. Checkpoints can be calculated anytime\nbetween Spark transformations and actions.\n\nWe assume the data for ETL are processed in a series of batch jobs. Let's call each data set for a given batch\njob a *batch*. All checkpoints are calculated for a specific batch.  \n\n## Features\n\nAtum provides means for defining, calculating and storing of checkpoints for batches. It does so by keeping additional\nmetadata in what we call an *info file*. An info file is a file usually named '_INFO' which usually resides in the same\ndirectory as the data of a specific batch. Each time a data progresses through a pipeline of ETL transformations the info\nfile is extended by additional checkpoints made between processing steps.\n\nA checkpoint can be generated between any Spark transformations and actions by invoking the '.setCheckpoint()' method\non a data frame. Checkpoints are generated eagerly so invoking '.setCheckpoint()' triggers several Spark actions\ndepending on the number of measurements required. When output data are saved, Atum saves an info file along with them.\nIt contains all checkpoints from the input data plus the new checkpoints generated in the spark job.\n\nFeatures:\n*   Create checkpoints\n*   Store sequences of checkpoints in info files alongside data.\n*   Automatically infer info file names by analyzing logical execution plans.\n*   Provide an initial info file content generator routine (**ControlMeasureBuilder.forDf(...).build.asJson**) for Spark dataframes \n*   Field rename is supported, but if a field is part of a control measurements calculation the renaming should be\n    explicitly stated using the **spark.registerColumnRename()** method.\n*   Plugin support\n \nPlugins are implemented as event listeners. To create a plugin you need to extend the\n**'za.co.absa.atum.plugins.EventListener'** trait and register the plugin by passing it as an argument to\n'PluginManager.loadPlugin()'. After this Atum will send plugin events to the event listener. This is useful \nfor implementing generic notifications to be sent to a dashboard on checkpoint events.    \n\nLimitations:\n*   If there are several data sources involved in a computation only one of them should have an _INFO file.\nIf that is not the case the location of the _INFO file needs to be specified explicitly to resolve the ambiguity.\n*   Several batch blocks, each having an info file, cannot be processed together. Batch blocks should be processed\nindependently.\n \n\n## Usage\n\n### Coordinate for Maven POM dependency\nFor project using Scala 2.11\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eza.co.absa\u003c/groupId\u003e\n    \u003cartifactId\u003eatum_2.11\u003c/artifactId\u003e\n    \u003cversion\u003eATUM_VERSION_HERE\u003c/version\u003e\n\u003c/dependency\u003e\n```\nFor project using Scala 2.12\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eza.co.absa\u003c/groupId\u003e\n    \u003cartifactId\u003eatum_2.12\u003c/artifactId\u003e\n    \u003cversion\u003eATUM_VERSION_HERE\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Initial info file generation example\n\nAtum provides helper methods for initial creation of info files from a Spark dataframe. It can be used as is or can\nserve as a reference implementation for calculating control measurements.\n\n#### Obtaining Initial ControlMeasure\nThe `ControlMeasureBuilder` can be used to create an initial `ControlMeaure`. The builder instance (obtained by \n`ControlMeasureBuilder.forDf()`) accepts metadata via optional setters. In addition it accepts definition of columns \nfor which control measurements should be generated. There are multiple ways to define these column settings and \nthe type of measurement to be computed is also possible to configure:\n\n```scala\nimport za.co.absa.atum.utils.controlmeasure.ControlMeasureBuilder\nimport ControlMeasureBuilder.ControlTypeStrategy.{Default, Specific}\nimport za.co.absa.atum.core.ControlType.{Count, DistinctCount, AggregatedTotal, AbsAggregatedTotal, HashCrc32}\n\nval controlMeasureBuilder = ControlMeasureBuilder.forDF(df)\n\n// with Default, the ControlType will be chosen based on the field type (AbsAggregatedTotal for numeric, HashCrc32 otherwise)\nval updatedBuilder1 = controlMeasureBuilder.withAggregateColumns(Seq(\"col1\", \"col2\"))\nval updatedBuilder1a = controlMeasureBuilder.withAggregateColumns(Seq(\"col1\", \"col2\"), Default) // an equivalent of the previous\n\n// here: all columns will use HashCrc32\nval updatedBuilder2 = controlMeasureBuilder.withAggregateColumns(Seq(\"col1\", \"col2\"), Specific(HashCrc32))\nval updatedBuilder2a = controlMeasureBuilder.withAggregateColumns(Seq(\"col1\" -\u003e HashCrc32, \"col2\" -\u003e HashCrc32)) // an equivalent of the previous\n\nval iterativelyUpdatedBuilder3 = controlMeasureBuilder\n  .withAggregateColumn(\"col1\", Default) // ControlType.AbsAggregatedTotal used if col1 is numeric, HashCrc32 otherwise\n  .withAggregateColumn(\"col2\", Specific(DistinctCount)) // Specific strategy with DistinctCount for this column's measurement\n\nval iterativelyUpdatedBuilder3a = controlMeasureBuilder // an equivalent of the `iterativelyUpdatedBuilder3`\n  .withAggregateColumn(\"col1\") // equivalent to .withAggregateColumn(\"col1\", Default).\n  .withAggregateColumn(\"col2\", DistinctCount) // DistinctCount controlType can be applied directly\n\n\n```\nThe above excerpt demonstrate that the aggregate columns can be either inputted at once with `.withAggregateColumns` \n(subsequent calls would replace the columns already defined) or using more fine-grained `.withAggregateColumn` where \nthe control type strategy can be specified for each column in the input (subsequent calls add to the group).\n\nThe default `Default` ControlType strategy will select `ControlType` `AbsAggregatedTotal` (**SUM(ABS(X))**) for numeric fields and\n`HashCrc32` (**SUM(CRC32(x))**) for non-numeric ones. Non-primitive data types are not supported.   \n\nA full example of initial control measure generation then could look as follows:\n```scala\nimport org.apache.spark.sql.{DataFrame, SparkSession}\nimport za.co.absa.atum.model.ControlMeasure\nimport za.co.absa.atum.utils.controlmeasure.ControlMeasureBuilder\n\nval spark = SparkSession.builder()\n  .appName(\"An info file creation job\")\n  .getOrCreate()\n\nval df: DataFrame = spark\n  .read\n  .format(\"csv\").option(\"header\", \"true\") // adjust to your data source format\n  .load(\"path/to/source\")\nval aggregateColumns = List(\"employeeId\", \"address\", \"dealId\") // these columns must exist in the `df`\n\n// builder-like fluent API to construct a ControlMeasureBuilder and yield the `controlMeasure` with `build`\nval controlMeasure: ControlMeasure =\n  ControlMeasureBuilder.forDf(df)\n    .withAggregateColumns(aggregateColumns) // using Default controlType strategy: AbsAggregatedTotal for numeric fields, HashCrc32 otherwise\n    .withInputPath(\"path/to/source\")\n    .withSourceApplication(\"Source Application\")\n    .withReportDate(\"15-10-2017\")\n    .withReportVersion(1)\n    .build\n\n// convert to JSON using .asJson | asJsonPretty\nprintln(\"Generated control measure is: \" + controlMeasure.asJson)\n```\n\n#### Customizing ControlMeasure\nIn case you need to change the data the ControlMeasure holds, you can do this in a usual scala way - the model comprises\nof case classes, so the built-in copy methods are available, e.g. `cm.copy(metadata = cm.metadata.copy(sourceApplication = \"UpdatedAppName\"))`\n\nHowever, to make things slightly easier in the checkpoint department, ControlMeasure's helper method `withPrecedingCheckpoint()`\nto prepend a custom checkpoint has been added shifting existing checkpoint behind while also increasing their order:\n\n```scala\nimport za.co.absa.atum.model._\n\nval cm: ControlMeasure = ... // existing ControlMeasure with 2 checkpoints\nval cpToBePrepended: Checkpoint = Checkpoint(..., order = 1, ...)\n\ncm.checkpoints.map(_.order) // List(1, 2)\ncpToBePrepended.order // 1\nval updatedCm = cm.withPrecedingCheckpoint(cpToBePrepended)\nupdatedCm.checkpoints.map(_.order) // List(1, 2, 3)\n```\n\n#### Writing an _INFO file with the ControlMeasure to HDFS\n```scala\nimport org.apache.hadoop.fs.{FileSystem, Path}\nimport za.co.absa.atum.utils.controlmeasure.ControlMeasureUtils\n\n// assuming `spark`, `controlMeasure`, and `inputPath` from the previous example block\nimplicit val hdfs = FileSystem.get(spark.sparkContext.hadoopConfiguration)\nControlMeasureUtils.writeControlMeasureInfoFileToHadoopFs(controlMeasure, new Path(inputPath))\n```\n\n#### Writing an _INFO file with the ControlMeasure to S3 (using Hadoop FS)\n```scala\nimport java.net.URI\nimport org.apache.hadoop.fs.{FileSystem, Path}\nimport za.co.absa.atum.utils.controlmeasure.ControlMeasureUtils\n\n// assuming `spark`, `controlMeasure`, and `inputPath` from the previous example block\nval s3Uri = new URI(\"s3://my-awesome-bucket123\") // s3://\u003cbucket\u003e (or s3a://)\nval s3Path = new Path(s\"/$inputPath\") // /\u003ctext-file-object-path\u003e\n\nimplicit val s3fs = FileSystem.get(s3Uri, spark.sparkContext.hadoopConfiguration)\nControlMeasureUtils.writeControlMeasureInfoFileToHadoopFs(controlMeasure, s3Path)\n\n```\n\n### An ETL job example \n\nFor the full example please see **SampleMeasurements1** and **SampleMeasurements2** objects from *atum.examples* project.\nIt uses made up Wikipedia data for computations. The source data has an info file containing the initial checkpoints,\npresumably generated by previous processing.\n\nThe examples are made so they can be run on a user's instance of Spark cluster. Spark and Scala dependencies have 'provided'\nscope. To run them locally please use **SampleMeasurements1Runner** and **SampleMeasurements2Runner** test suites. When\nrunning unit tests Maven loads all provided dependencies so all Scala and Spark libraries needed to run the jobs are\navailable when running unit tests.\n\n\n```scala\nimport org.apache.spark.sql.SparkSession\nimport za.co.absa.atum.AtumImplicits._ // using basic Atum without extensions\nimport org.apache.hadoop.conf.Configuration\nimport org.apache.hadoop.fs.FileSystem\n\nobject ExampleSparkJob {\n  def main(args: Array[String]) {\n    val spark = SparkSession\n      .builder()\n      .appName(\"Example Spark Job\")\n      .getOrCreate()\n\n    import spark.implicits._\n\n    // implicit FS is needed for enableControlMeasuresTracking, setCheckpoint calls, e.g. standard HDFS here:\n    implicit val localHdfs = FileSystem.get(spark.sparkContext.hadoopConfiguration)\n\n    // Initializing library to hook up to Apache Spark\n    spark.enableControlMeasuresTracking(sourceInfoFilePath = Some(\"data/input/_INFO\"), destinationInfoFilePath = None)\n      .setControlMeasuresWorkflow(\"Example processing\")\n\n    // Reading data from a CSV file and creating a checkpoint \n    val df = spark.read\n      .option(\"header\", \"true\")\n      .option(\"inferSchema\", \"true\")\n      .csv(\"data/input/mydata.csv\")\n      .as(\"source\")\n      .setCheckpoint(\"Computations Started\") // First checkpoint\n\n    // A business logic of a spark job ...\n\n    // The df.setCheckpoint() routine can be used as many time as needed.\n    df.setCheckpoint(\"Computations Finished\") // Second checkpoint\n      .parquet(\"data/output/my_results\")\n  }\n}\n```\n\nIn this example the data is read from 'data/input/mydata.csv' file. This data file has a precomputed set of checkpoints\nin 'data/input/_INFO'. Two checkpoints are created. Any business logic can be inserted between reading the source data\nand saving it to Parquet format.  \n\n### Storing Measurements in AWS S3\n\n#### AWS S3 via Hadoop FS API\nSince version 3.1.0, persistence support for AWS S3 via Hadoop FS API is available. The usage is the same as with \nregular HDFS with the exception of providing a different file system, e.g.:\n```scala\nimport java.net.URI\nimport org.apache.hadoop.fs.FileSystem\nimport org.apache.spark.sql.SparkSession\nimport za.co.absa.atum.AtumImplicits._ // using basic Atum without extensions\n\nval spark = SparkSession\n      .builder()\n      .appName(\"Example Spark Job\")\n      .getOrCreate()\n\nval s3Uri = new URI(\"s3://my-awesome-bucket\")\nimplicit  val fs = FileSystem.get(s3Uri, spark.sparkContext.hadoopConfiguration)\n\n```\nThe rest of the usage is the same in the example listed above.\n\n#### AWS S3 via AWS SDK for S3\nStarting with version 3.3.0, there is also persistence support for AWS S3 via AWS SDK S3 via an optional dependency:\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eza.co.absa\u003c/groupId\u003e\n    \u003cartifactId\u003eatum-s3-sdk-extension_2.11\u003c/artifactId\u003e \u003c!-- or 2.12 --\u003e\n    \u003cversion\u003e${project.version}\u003c/version\u003e \u003c!-- e.g. 3.3.0 --\u003e\n\u003c/dependency\u003e\n```\n\nThe following example demonstrates the setup:\n```scala\nimport org.apache.spark.sql.SparkSession\nimport software.amazon.awssdk.auth.credentials.{AwsCredentialsProvider, DefaultCredentialsProvider, ProfileCredentialsProvider}\nimport za.co.absa.atum.persistence.{S3KmsSettings, S3Location}\nimport za.co.absa.atum.AtumImplicitsSdkS3._ // using extended Atum\n\nobject S3Example {\n  def main(args: Array[String]) {\n    val spark = SparkSession\n          .builder()\n          .appName(\"Example S3 Atum init showcase\")\n          .getOrCreate()\n\n    // Here we are using default credentials provider that relies on its default credentials provider chain to obtain the credentials\n    // (e.g. running in EMR/EC2 with correct role assigned)\n    implicit val defaultCredentialsProvider: AwsCredentialsProvider = DefaultCredentialsProvider.create()\n    // Alternatively, one could pass specific credentials provider. An example of using local profile named \"saml\" can be:\n    // implicit val samlCredentialsProvider = ProfileCredentialsProvider.create(\"saml\")\n    \n    val sourceS3Location: S3Location = S3Location(\"my-bucket123\", \"atum/input/my_amazing_measures.csv.info\")\n\n    val kmsKeyId: String = \"arn:aws:kms:eu-west-1:123456789012:key/12345678-90ab-cdef-1234-567890abcdef\" // just example \n    val destinationS3Config: (S3Location, S3KmsSettings) = (\n      S3Location(\"my-bucket123\", \"atum/output/my_amazing_measures2.csv.info\"),\n      S3KmsSettings(kmsKeyId)\n    )\n\n    import spark.implicits._\n\n    // Initializing library to hook up to Apache Spark with S3 persistence\n    spark.enableControlMeasuresTrackingForS3(\n      sourceS3Location = Some(sourceS3Location),\n      destinationS3Config = Some(destinationS3Config)\n    ).setControlMeasuresWorkflow(\"A job with measurements saved to S3\")\n  }\n}\n\n```\nThe rest of the processing logic and programmatic approach to the library remains unchanged.\n\n\n### Standalone model usage\nIn cases you only want to work with Atum's model (`ControlMeasure`-related case classes and `S3Location`), you may find\nAtum's model artifact sufficient as your dependency.\n\nFirst, if not provided by Spark or other library, you will need to provide json4s dependencies. This project is tested \nwith `3.5.3` and `3.7.0-M15`.\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eorg.json4s\u003c/groupId\u003e\n    \u003cartifactId\u003ejson4s-core_2.11\u003c/artifactId\u003e \u003c!-- or 2.12 --\u003e\n    \u003cversion\u003e${json4s.version}\u003c/version\u003e\n    \u003cscope\u003eprovided\u003c/scope\u003e\n\u003c/dependency\u003e\n\u003cdependency\u003e\n    \u003cgroupId\u003eorg.json4s\u003c/groupId\u003e\n    \u003cartifactId\u003ejson4s-jackson_2.11\u003c/artifactId\u003e \u003c!-- or 2.12 --\u003e\n    \u003cversion\u003e${json4s.version}\u003c/version\u003e\n    \u003cscope\u003eprovided\u003c/scope\u003e\n\u003c/dependency\u003e\n```\n\nThen, just include the model library\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eza.co.absa\u003c/groupId\u003e\n    \u003cartifactId\u003eatum-model_2.11\u003c/artifactId\u003e \u003c!-- or 2.12 --\u003e\n    \u003cversion\u003e3.5.1\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nThe model module also offers basic JSON (de)serialization functionality, such as:\n```scala\nimport za.co.absa.atum.model._\nimport za.co.absa.atum.utils.SerializationUtils\n\nval measureObject1: ControlMeasure = SerializationUtils.fromJson[ControlMeasure](myJsonStringWithAControlMeasure)\nval jsonString: String = SerializationUtils.asJson(measureObject1)\nval prettyfiedJsonString: String = SerializationUtils.asJsonPretty(measureObject1)\n```\n\n## Atum library routines\n\nThe summary of common control framework routines you can use as Spark and Dataframe implicits are as follows:\n\n| Routine        | Description          | Example usage  |\n| -------------- |:-------------------- |:---------------|\n| enableControlMeasuresTracking(sourceInfoFilePath: *Option[String]*, destinationInfoFilePath: *Option[String]*) | Enable control measurements tracking. Source and destination info file paths can be omitted. If omitted (`None`), they will be automatically inferred from the input/output data sources. | spark.enableControlMeasurementsTracking() |\n| enableControlMeasuresTrackingForSdkS3(sourceS3Location: *Option[S3Location]*, destinationS3Config: *Option[(S3Location, S3KmsSettings)]*) | Enable control measurements tracking in S3. Source and destination parameters can be omitted. If omitted, the loading/storing part will not be used | spark.enableControlMeasuresTrackingForS3(optionalSourceS3Location, optionalDestinationS3Config) |\n| isControlMeasuresTrackingEnabled: *Boolean* | Returns true if control measurements tracking is enabled. |  if (spark.isControlMeasuresTrackingEnabled) {/*do something*/} |\n| disableControlMeasuresTracking() | Explicitly turn off control measurements tracking. | spark.disableControlMeasurementsTracking() |\n| setCheckpoint(name: *String*) | Calculates the control measurements and appends a new checkpoint. | df.setCheckpoint(\"Conformance Started\") |\n| writeInfoFile(outputFileName: *String*) | Write only an info file to a given HDFS location (could be a directory of a file). | df.writeInfoFile(\"/project/test/_INFO\") |\n| registerColumnRename(oldName: *String*, newName: *String*) | Register that a column which is part of control measurements is renamed. | df.registerColumnRename(\"tradeNumber\", \"tradeId\") |\n| registerColumnDrop(columnName: *String*) | Register that a column which is part of control measurements is dropped. | df.registerColumnDrop(\"personId\") |\n| setControlMeasuresFileName(fileName: *String*) | Use a specific name for info files instead of deafult '_INFO'. | spark.setControlMeasuresFileName(\"_EXAMPLE_INFO\") |\n| setControlMeasuresWorkflow(workflowName: *String*) | Sets workflow name for the set of checkpoints that will follow. | spark.setControlMeasuresWorkflow(\"Conformance\") |\n| setControlMeasurementError(jobStep: *String*, errorDescription: *String*, techDetails: *String*) | Sets up an error message that can be used by plugins (e.g. Menas) to track the status of the job. | setControlMeasurementError(\"Conformance\", \"Validation error\", stackTrace) |\n| setAllowUnpersistOldDatasets(allowUnpersist: *Boolean)* | Turns on a performance optimization that unpersists old checkpoints after new onces are materialized. | Atum.setAllowUnpersistOldDatasets(true) |\n| enableCaching(cacheStorageLevel: *StorageLevel*) | Turns on caching that happens every time a checkpoint is generated (default behavior). A specific storage level can be set as well (see `setCachingStorageLevel()`) | enableCaching() |\n| disableCaching() | Turns off caching that happens every time a checkpoint is generated. | disableCaching() |\n| setCachingStorageLevel(cacheStorageLevel: *StorageLevel*) | Specifies a Spark storage level to use for caching. Can be one of following: `NONE`, `DISK_ONLY`, `DISK_ONLY_2`, `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_ONLY_SER`, `MEMORY_ONLY_SER_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `MEMORY_AND_DISK_SER`, `MEMORY_AND_DISK_SER_2`, `MEMORY_AND_DISK_SER_2`, `OFF_HEAP`. | setCachingStorageLevel(StorageLevel.MEMORY_AND_DISK) |\n\n## Control measurement types\n\nThe control measurement of a column is a hash sum. It can be calculated differently depending on the column's data type and\non business requirements. This table represents all currently supported measurement types:\n\n| Type                                | Description                                           |\n| ----------------------------------- |:----------------------------------------------------- |\n| controlType.Count                   | Calculates the number of rows in the dataset          |\n| controlType.distinctCount           | Calculates DISTINCT(COUNT(()) of the specified column |\n| controlType.aggregatedTotal         | Calculates SUM() of the specified column              |\n| controlType.absAggregatedTotal      | Calculates SUM(ABS()) of the specified column         |\n| controlType.HashCrc32               | Calculates SUM(CRC32()) of the specified column       |\n| controlType.aggregatedTruncTotal    | Calculates SUM(TRUNC()) of the specified column       |\n| controlType.absAggregatedTruncTotal | Calculates SUM(TRUNC(ABS())) of the specified column  |\n\n## How to generate Code coverage report\n```sbt\nsbt jacoco\n```\nFor example modules:\n```sbt\nsbt examples/jacoco s3sdkExamples/jacoco\n```\nCode coverage will be generated on path:\n```\n{project-root}/atum/{module}/target/scala-{scala_version}/jacoco/report/html\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fatum","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fatum","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fatum/lists"}