{"id":18810386,"url":"https://github.com/absaoss/atum-service","last_synced_at":"2025-06-22T18:08:48.211Z","repository":{"id":41610706,"uuid":"439120334","full_name":"AbsaOSS/atum-service","owner":"AbsaOSS","description":null,"archived":false,"fork":false,"pushed_at":"2025-06-06T17:28:03.000Z","size":44172,"stargazers_count":6,"open_issues_count":36,"forks_count":1,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-06-06T17:35:29.391Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-12-16T20:35:36.000Z","updated_at":"2025-06-06T17:28:04.000Z","dependencies_parsed_at":"2024-03-18T11:56:01.525Z","dependency_job_id":"c2ea62e3-491f-456b-9211-544f5ab0e274","html_url":"https://github.com/AbsaOSS/atum-service","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/AbsaOSS/atum-service","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fatum-service","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fatum-service/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fatum-service/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fatum-service/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/atum-service/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fatum-service/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261338999,"owners_count":23143900,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:20:02.767Z","updated_at":"2025-06-22T18:08:43.199Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Atum Service\n\n[![Build](https://github.com/AbsaOSS/spark-commons/actions/workflows/build.yml/badge.svg)](https://github.com/AbsaOSS/spark-commons/actions/workflows/build.yml)\n[![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)\n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity)\n\n| Atum Server                                                                                                                                                                                                         | Atum Agent                                                                                                                                                                                                        | Atum Model | Atum Reader                                                                                                                                                                                                  |\n|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [![GitHub release](https://img.shields.io/github/release/AbsaOSS/atum-service.svg)](https://GitHub.com/AbsaOSS/atum-service/releases/) | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa.atum-service/atum-agent-spark3_2.13/badge.svg)](https://central.sonatype.com/search?q=atum-agent\u0026namespace=za.co.absa.atum-service) | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa.atum-service/atum-model_2.13/badge.svg)](https://central.sonatype.com/search?q=atum-model\u0026namespace=za.co.absa.atum-service) | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa.atum-service/atum-reader_2.13/badge.svg)](https://central.sonatype.com/search?q=atum-reader\u0026namespace=za.co.absa.atum-service) |                                                                             \n\n\n\n\n- [Atum Service](#atum-service)\n    - [Motivation](#motivation)\n    - [Features](#features)\n    - [Modules](#modules)\n        - [Agent `agent/`](#agent-agent)\n        - [Reader `reader/`](#agent-agent)\n        - [Server `server/`](#server-server)\n        - [Data Model `model/`](#data-model-model)\n        - [Database `database/`](#database-database)\n    - [Vocabulary](#vocabulary)\n        - [Atum Agent](#atum-agent)\n        - [Partitioning](#partitioning)\n        - [Atum Context](#atum-context)\n        - [Measure](#measure)\n        - [Measurement](#measurement)\n        - [Checkpoint](#checkpoint)\n        - [Data Flow](#data-flow)\n    - [Usage](#usage)\n        - [Atum Agent routines](#atum-agent-routines)\n        - [Control measurement types](#control-measurement-types)\n    - [How to generate Code coverage report](#how-to-generate-code-coverage-report)\n    - [How to Run in IntelliJ](#how-to-run-in-intellij)\n    - [How to Run Tests](#how-to-run-tests)\n        - [Test controls](#test-controls)\n        - [Run Unit Tests](#run-unit-tests)\n        - [Run Integration Tests](#run-integration-tests)\n    - [How to Release](#how-to-release)\n\nAtum Service is a data completeness and accuracy application meant to be used for data processed by Apache Spark.\n\nOne of the challenges regulated industries face is the requirement to track and prove that their systems preserve\nthe accuracy and completeness of data. In an attempt to solve this data processing problem in Apache Spark applications,\nwe propose the approach implemented in this application.\n\nThe purpose of Atum Service is to add the ability to specify \"checkpoints\" in Spark applications. These checkpoints \nare used to designate when and what metrics are calculated to ensure that critical input values have not been modified \nas well as allow for quick and efficient representation of the completeness of a dataset. This application does not \nimplement any checks or validations against these control measures, i.e. it does not act on them - Atum Service is, \nrather, solely focused on capturing them.\n\nThe application provides a concise and dynamic way to track completeness and accuracy of data produced from source \nthrough a pipeline of Spark applications. All metrics are calculated at a DataFrame level using various aggregation \nfunctions and are stored on a single central place, in a relational database. Comparing control metrics for various \ncheckpoints is not only helpful for complying with strict regulatory frameworks, but also helps during development \nand debugging of your Spark-based data processing.\n\n## Motivation\n\nBig Data strategy for a company usually includes data gathering and ingestion processes.\nThat is the definition of how data from different systems operating inside a company\nare gathered and stored for further analysis and reporting. An ingestion processes can involve\nvarious transformations like:\n* Converting between data formats (XML, CSV, etc.)\n* Data type casting, for example converting XML strings to numeric values\n* Joining reference tables. For example this can include enriching existing\n  data with additional information available through dictionary mappings.\n  This constitutes a common ETL (Extract, Transform and Load) process.\n\nDuring such transformations, sometimes data can get corrupted (e.g. during casting), records can\nget added or lost. For instance, *outer joining* a table holding duplicate keys can result in records explosion.\nAnd *inner joining* a table which has no matching keys for some records will result in loss of records.\n\nIn regulated industries it is crucial to ensure data integrity and accuracy. For instance, in the banking industry\nthe BCBS set of regulations requires analysis and reporting to be based on data accuracy and integrity principles.\nThus it is critical at the ingestion stage to preserve the accuracy and integrity of the data gathered from a\nsource system.\n\nThe purpose of Atum is to provide means of ensuring no critical fields have been modified during the processing and no \nrecords are added or lost. To do this the library provides an ability to calculate *control numbers* of explicitly \nspecified columns using a selection of agregate function. We call the set of such measurements at a given time\na *checkpoint* and each value - a result of the function computation - we call a *control measurement*. Checkpoints can \nbe calculated anytime between Spark transformations and actions, so as at the start of the process or after its end.\n\nWe assume the data for ETL are processed in a series of batch jobs. Let's call each data set for a given batch\njob a *batch*. All checkpoints are calculated for a specific batch.\n\n## Features\n\nTBD\n\n## Modules\n\n### Agent `agent/`\nThis module is intended to replace the current [Atum](https://github.com/AbsaOSS/atum) repository. \nIt provides functionality for computing and pushing control metrics to the API located in `server/`.\n\nFor more information, see the [Vocabulary section](#Vocabulary) or `agent/README.md` for more technical documentation.\n\n#### Spark 2.4 support\nBecause there are some java level incompatibilities between Spark 2.4 and Spark 3.x when build on Java 11+, we have to \ndrop support for Spark 2.4. If you need the agent to work with Spark 2.4 follow these steps:\n* Switch to Java 8\n* In `'build.sbt'` change the matrix rows, to be Spark 2.4 and Scala 2.11 for modules _agent_ and _model_\n* Build these two modules and use them in your project\n\n### Reader `reader/`\n**NB!**  \n_This module is not yet implemented to an operational abilities and therefore not yet released._\n\nThis module is intended to be used whenever an application wants to read the metrics stored by the _Atum Service_. It\noffers classes and methods to read the metrics from the database shielding away the complexity of accessing the _Atum Server_\ndirectly.\n\n### Server `server/`\nAn API under construction that communicates with the Agent and with the persistent storage. It also provides measure \nconfiguration to the agent.\n\nThe server accepts metrics potentially from several agents and saves them into database. In the future, it will be also \nable to send the metrics definitions back if requested. \n\nImportant note: the server never receives any real data - it only works with the metadata and metrics defined \nby the agent! \n\nSee `server/README.md` for more technical documentation.\n\n### Data Model `model/`\n\nThis module defines a set of Data Transfer Objects. These are Atum-specific objects that carry data that are being \npassed from agent to server and vice versa.\n\n### Database `database/`\n\nThis module contains a set of scripts that are used to create and maintain the database models. It also contains \nintegration tests that are used to verify the logic of our database functions.\nThe database tests are integration tests in nature. Therefore, a few conditions applies:\n* The tests are excluded from task `test` and are run only by a dedicated `dbTest` task (`sbt dbTest`).\n* The database structures must exist on the target database \n  (follow the [deployment instructions in database module](database/README.md#Deployment)).\n* The connection information to the DB is provided in file `database/src/test/resources/database.properties` \n  (See `database.properties.template` for syntax).\n\n## Vocabulary\n\nThis section defines a vocabulary of words and phrases used across the codebase or this documentation.\n\n### Atum Agent\n\nBasically, the agent is supposed to be embedded into your application and its responsibility is to measure the\ngiven metrics and send the results to the server. It acts as an entity responsible for spawning the `Atum Context` \nand communicating with the server.\n\nA user of the Atum Agent must provide certain `Partitioning` with a set of `Measures` he or she wants to calculate, \nand execute the `Checkpoint` operation. A server details are also needed to be configured.\n\n### Partitioning\n\n`Atum Partitioning` uniquely defines a particular dataset (or a subset of a dataset, using Sub-Partitions) that we\nwant to apply particular metrics on. It's similar to data partitioning in HDFS or Data Lake.\nThe order of individual `Partitions` in a given `Partitioning` matters. It's a map-like structure in which the order\nof keys (partition names) matters.\n\nIt's possible to define an additional metadata along with `Partitioning` - as a map-like structure with which \nyou can store various attributes associated with a given `Partitioning`, that you can potentially \nuse later in your application. Just to give you some ideas for these: \n* a name of your application, ETL Pipeline, or your Spark job\n* a list of owners of your application or your dataset\n* source system of a given dataset\n* and more\n\n### Atum Context\n \nThis is a main entity responsible for actually performing calculations on a Spark DataFrame. Each `Atum Context` is \nrelated to particular `Partitioning` - or to put in other words, each `Atum Context` contains all `Measures` \nfor a specific data, defined by a given `Partitioning`, that are supposed to be calculated.\n\n### Measure\n\nA `Measure` defines what and how a single metric should be calculated. So it's a type of control metric to compute, \nsuch as count, sum, or hash, that also defines a list of columns (if applicable) that should be used when actually \nexecuting the calculation against a given Spark DataFrame.\n\nSome `Measures` define no columns (such as `count`), some require exactly one column (such as `sum` of values for \nparticular column), and some require more columns (such as `hash` function).\n\n### Measurement\n\nPractically speaking, a single `Measurement` contains a `Measure` and result associated with it. \n\n### Checkpoint\n\nEach `Checkpoint` defines a sequence of `Measurements` (containing individual `Measures` and their results) that are \nassociated with certain `Partitioning`.\n\nA `Checkpoint` is defined on the agent side, the server only accepts it.\n\n`Atum Context` stores information about a set of `Measures` associated with specific `Partitioning`, but the\ncalculations of individual metrics are performed only after the `Checkpoint` operation is being called.\nWe can even say, that `Checkpoint` is a result of particular `Measurements` (verb).\n\n### Data Flow\n\nThe journey of a dataset throughout various data transformations and pipelines. It captures the whole journey,\neven if it involves multiple applications or ETL pipelines.\n\n## Usage\n\n### Atum Agent routines\n\nTBD\n\n### Control measurement types\n\nThe control measurement of one or more columns is an aggregation function result executed over the dataset. It can be \ncalculated differently depending on the column's data type, on business requirements and function used. This table \nrepresents all currently supported measurement types (aka measures):\n\n| Type                               | Description                                                   |\n|------------------------------------|:--------------------------------------------------------------|\n| AtumMeasure.RecordCount            | Calculates the number of rows in the dataset                  |\n| AtumMeasure.DistinctRecordCount    | Calculates DISTINCT(COUNT(()) of the specified column         |\n| AtumMeasure.SumOfValuesOfColumn    | Calculates SUM() of the specified column                      |\n| AtumMeasure.AbsSumOfValuesOfColumn | Calculates SUM(ABS()) of the specified column                 |\n| AtumMeasure.SumOfHashesOfColumn    | Calculates SUM(CRC32()) of the specified column               |\n| Measure.UnknownMeasure             | Custom measure where the data are provided by the application |\n\n[//]: # (| controlType.aggregatedTruncTotal    | Calculates SUM\u0026#40;TRUNC\u0026#40;\u0026#41;\u0026#41; of the specified column       |)\n\n[//]: # (| controlType.absAggregatedTruncTotal | Calculates SUM\u0026#40;TRUNC\u0026#40;ABS\u0026#40;\u0026#41;\u0026#41;\u0026#41; of the specified column  |)\n\n\n## How to generate Code coverage report\n```sbt\nsbt jacoco\n```\n\nCode coverage wil be generated on path:\n```\n{project-root}/{module}/target/jvm-{scala_version}/jacoco/report/html\n```\n\n## How to Run in IntelliJ\n\nTo make this project runnable via IntelliJ, do the following:\n- Make sure that your configuration in `server/src/main/resources/reference.conf` \n  is configured according to your needs\n- When building within an IDE sure to have the option `-language:higherKinds` on in the compiler options, as it's often not picked up from the SBT project settings.\n\n## How to Run Tests\n\n### Test controls\n\nSee the commands configured in the `.sbtrc` [(link)](https://www.scala-sbt.org/1.x/docs/Best-Practices.html#.sbtrc) file to provide different testing profiles.\n\n### Run Unit Tests\nUse the `test` command to execute all unit tests, skipping all other types of tests. \n```sbt\nsbt test\n```\n\n### Run Integration Tests\nUse the `testIT` command to execute all Integration tests, skipping all other test types.\n```sbt\nsbt testIT\n```\n\nUse the `testDB` command to execute all Integration tests in `database` module, skipping all other tests and modules.\n- Hint: project custom command\n```sbt\nsbt testDB\n```\n\nIf you want to run all tests, use the following command.\n- Hint: project custom command\n```sbt\nsbt testAll\n```\n\n\n## How to Release\n\nPlease see [this file](RELEASE.md) for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fatum-service","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fatum-service","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fatum-service/lists"}