Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mrpowers-io/jodie
Delta lake and filesystem helper methods
https://github.com/mrpowers-io/jodie
Last synced: 3 months ago
JSON representation
Delta lake and filesystem helper methods
- Host: GitHub
- URL: https://github.com/mrpowers-io/jodie
- Owner: mrpowers-io
- License: mit
- Created: 2021-06-09T14:09:37.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-02-29T04:36:22.000Z (11 months ago)
- Last Synced: 2024-10-13T00:11:46.326Z (3 months ago)
- Language: Scala
- Size: 105 KB
- Stars: 48
- Watchers: 11
- Forks: 11
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
Awesome Lists containing this project
README
# jodie
This library provides helpful Delta Lake and filesystem utility functions.
![jodie](images/jodie.jpeg)
## Accessing the library
Fetch the JAR file from Maven.
```scala
libraryDependencies += "com.github.mrpowers" %% "jodie" % "0.0.3"
```You can find the spark-daria releases for different Scala versions:
* [Scala 2.12 versions here](https://repo1.maven.org/maven2/com/github/mrpowers/jodie_2.12/)
* [Scala 2.13 versions here](https://repo1.maven.org/maven2/com/github/mrpowers/jodie_2.13/)## Delta Helpers
### Type 2 SCDs
This library provides an opinionated, conventions over configuration, approach to Type 2 SCD management. Let's look at an example before covering the conventions required to take advantage of the functionality.
Suppose you have the following SCD table with the `pkey` primary key:
```
+----+-----+-----+----------+-------------------+--------+
|pkey|attr1|attr2|is_current| effective_time|end_time|
+----+-----+-----+----------+-------------------+--------+
| 1| A| A| true|2019-01-01 00:00:00| null|
| 2| B| B| true|2019-01-01 00:00:00| null|
| 4| D| D| true|2019-01-01 00:00:00| null|
+----+-----+-----+----------+-------------------+--------+
```You'd like to perform an upsert with this data:
```
+----+-----+-----+-------------------+
|pkey|attr1|attr2| effective_time|
+----+-----+-----+-------------------+
| 2| Z| null|2020-01-01 00:00:00| // upsert data
| 3| C| C|2020-09-15 00:00:00| // new pkey
+----+-----+-----+-------------------+
```Here's how to perform the upsert:
```scala
Type2Scd.upsert(deltaTable, updatesDF, "pkey", Seq("attr1", "attr2"))
```Here's the table after the upsert:
```
+----+-----+-----+----------+-------------------+-------------------+
|pkey|attr1|attr2|is_current| effective_time| end_time|
+----+-----+-----+----------+-------------------+-------------------+
| 2| B| B| false|2019-01-01 00:00:00|2020-01-01 00:00:00|
| 4| D| D| true|2019-01-01 00:00:00| null|
| 1| A| A| true|2019-01-01 00:00:00| null|
| 3| C| C| true|2020-09-15 00:00:00| null|
| 2| Z| null| true|2020-01-01 00:00:00| null|
+----+-----+-----+----------+-------------------+-------------------+
```You can leverage the upsert code if your SCD table meets these requirements:
* Contains a unique primary key column
* Any change in an attribute column triggers an upsert
* SCD logic is exposed via `effective_time`, `end_time` and `is_current` column`merge` logic can get really messy, so it's easiest to follow these conventions. See [this blog post](https://mungingdata.com/delta-lake/type-2-scd-upserts/) if you'd like to build a SCD with custom logic.
### Kill Duplicates
The function `killDuplicateRecords` deletes all the duplicated records from a table given a set of columns.
Suppose you have the following table:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 1| Benito| Jackson| # duplicate
| 2| Maria| Willis|
| 3| Jose| Travolta| # duplicate
| 4| Benito| Jackson| # duplicate
| 5| Jose| Travolta| # duplicate
| 6| Maria| Pitt|
| 9| Benito| Jackson| # duplicate
+----+---------+---------+
```We can Run the following function to remove all duplicates:
```scala
DeltaHelpers.killDuplicateRecords(
deltaTable = deltaTable,
duplicateColumns = Seq("firstname","lastname")
)
```The result of running the previous function is the following table:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 2| Maria| Willis|
| 6| Maria| Pitt|
+----+---------+---------+
```### Remove Duplicates
The functions `removeDuplicateRecords` deletes duplicates but keeps one occurrence of each record that was duplicated.
There are two versions of that function, lets look an example of each,#### Let’s see an example of how to use the first version:
Suppose you have the following table:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 2| Maria| Willis|
| 3| Jose| Travolta|
| 4| Benito| Jackson|
| 1| Benito| Jackson| # duplicate
| 5| Jose| Travolta| # duplicate
| 6| Maria| Willis| # duplicate
| 9| Benito| Jackson| # duplicate
+----+---------+---------+
```
We can Run the following function to remove all duplicates:```scala
DeltaHelpers.removeDuplicateRecords(
deltaTable = deltaTable,
duplicateColumns = Seq("firstname","lastname")
)
```The result of running the previous function is the following table:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 2| Maria| Willis|
| 3| Jose| Travolta|
| 4| Benito| Jackson|
+----+---------+---------+
```#### Now let’s see an example of how to use the second version:
Suppose you have a similar table:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 2| Maria| Willis|
| 3| Jose| Travolta| # duplicate
| 4| Benito| Jackson| # duplicate
| 1| Benito| Jackson| # duplicate
| 5| Jose| Travolta| # duplicate
| 6| Maria| Pitt|
| 9| Benito| Jackson| # duplicate
+----+---------+---------+
```This time the function takes an additional input parameter, a primary key that will be used to sort
the duplicated records in ascending order and remove them according to that order.```scala
DeltaHelpers.removeDuplicateRecords(
deltaTable = deltaTable,
primaryKey = "id",
duplicateColumns = Seq("firstname","lastname")
)
```The result of running the previous function is the following:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 1| Benito| Jackson|
| 2| Maria| Willis|
| 3| Jose| Travolta|
| 6| Maria| Pitt|
+----+---------+---------+
```These functions come in handy when you are doing data cleansing.
### Copy Delta Table
This function takes an existing delta table and makes a copy of all its data, properties,
and partitions to a new delta table. The new table could be created based on a specified path or
just a given table name.Copying does not include the delta log, which means that you will not be able to restore the new table to an old version of the original table.
Here's how to perform the copy to a specific path:
```scala
DeltaHelpers.copyTable(deltaTable = deltaTable, targetPath = Some(targetPath))
```Here's how to perform the copy using a table name:
```scala
DeltaHelpers.copyTable(deltaTable = deltaTable, targetTableName = Some(tableName))
```Note the location where the table will be stored in this last function call
will be based on the spark conf property `spark.sql.warehouse.dir`.### Validate append
The `validateAppend` function provides a mechanism for allowing some columns for schema evolution, but rejecting appends with columns that aren't specificly allowlisted.
Suppose you have the following Delta table:
```
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2| b| B|
| 1| a| A|
+----+----+----+
```
Here's an appender function that wraps `validateAppend`:```scala
DeltaHelpers.validateAppend(
deltaTable = deltaTable,
appendDF = appendDf,
requiredCols = List("col1", "col2"),
optionalCols = List("col4")
)
```You can append the following DataFrame that contains the required columns and the optional columns:
```
+----+----+----+
|col1|col2|col4|
+----+----+----+
| 3| c| cat|
| 4| d| dog|
+----+----+----+
```Here's what the Delta table will contain after that data is appended:
```
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 3| c|null| cat|
| 4| d|null| dog|
| 2| b| B|null|
| 1| a| A|null|
+----+----+----+----+
```You cannot append the following DataFrame which contains the required columns, but also contains another column (`col5`) that's not specified as an optional column.
```
+----+----+----+
|col1|col2|col5|
+----+----+----+
| 4| b| A|
| 5| y| C|
| 6| z| D|
+----+----+----+
```Here's the error you'll get when you attempt this write: "The following columns are not part of the current Delta table. If you want to add these columns to the table, you must set the optionalCols parameter: List(col5)"
You also cannot append the following DataFrame which is missing one of the required columns.
```
+----+----+
|col1|col4|
+----+----+
| 4| A|
| 5| C|
| 6| D|
+----+----+
```Here's the error you'll get: "The base Delta table has these columns List(col1, col4), but these columns are required List(col1, col2)"
### Latest Version of Delta Table
The function `latestVersion` return the latest version number of a table given its storage path.Here's how to use the function:
```scala
DeltaHelpers.latestVersion(path = "file:/path/to/your/delta-lake/table")
```### Insert Data Without Duplicates
The function `appendWithoutDuplicates` inserts data into an existing delta table and prevents data duplication in the process.
Let's see an example of how it works.Suppose we have the following table:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 1| Benito| Jackson|
| 4| Maria| Pitt|
| 6| Rosalia| Pitt|
+----+---------+---------+
```
And we want to insert this new dataframe:```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 6| Rosalia| Pitt| # duplicate
| 2| Maria| Willis|
| 3| Jose| Travolta|
| 4| Maria| Pitt| # duplicate
+----+---------+---------+
```We can use the following function to insert new data and avoid data duplication:
```scala
DeltaHelpers.appendWithoutDuplicates(
deltaTable = deltaTable,
appendData = newDataDF,
compositeKey = Seq("firstname","lastname")
)
```The result table will be the following:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 1| Benito| Jackson|
| 4| Maria| Pitt|
| 6| Rosalia| Pitt|
| 2| Maria| Willis|
| 3| Jose| Travolta|
+----+---------+---------+
```
### Generate MD5 from columnsThe function `withMD5Columns` appends a md5 hash of specified columns to the DataFrame. This can be used as a unique key
if the selected columns form a composite key. Here is an exampleSuppose we have the following table:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 1| Benito| Jackson|
| 4| Maria| Pitt|
| 6| Rosalia| Pitt|
+----+---------+---------+
```We use the function in this way:
```scala
DeltaHelpers.withMD5Columns(
dataFrame = inputDF,
cols = List("firstname","lastname"),
newColName = "unique_id")
)
```The result table will be the following:
```
+----+---------+---------+----------------------------------+
| id|firstname| lastname| unique_id |
+----+---------+---------+----------------------------------+
| 1| Benito| Jackson| 3456d6842080e8188b35f515254fece8 |
| 4| Maria| Pitt| 4fd906b56cc15ca517c554b215597ea1 |
| 6| Rosalia| Pitt| 3b3814001b13695931b6df8670172f91 |
+----+---------+---------+----------------------------------+
```You can use this function with the columns identified in findCompositeKeyCandidate to append a unique key to the DataFrame.
### Find Composite Key
This function `findCompositeKeyCandidate` helps you find a composite key that uniquely identifies the rows your Delta table.
It returns a list of columns that can be used as a composite key. i.e:Suppose we have the following table:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 1| Benito| Jackson|
| 4| Maria| Pitt|
| 6| Rosalia| Pitt|
+----+---------+---------+
```Now execute the function:
```scala
val result = DeltaHelpers.findCompositeKeyCandidate(
deltaTable = deltaTable,
excludeCols = Seq("id")
)
```The result will be the following:
```scala
Seq("firstname","lastname")
```### Validate Composite Key
The `isCompositeKeyCandidate` function aids in verifying whether a given composite key qualifies as a unique key within your Delta table.
It returns true if the key is considered a potential composite key, and false otherwise.Suppose we have the following table:
```
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 1| Benito| Jackson|
| 4| Maria| Pitt|
| 6| Rosalia| Travolta|
+----+---------+---------+
```Now execute the function:
```scala
val result = DeltaHelpers.isCompositeKeyCandidate(
deltaTable = deltaTable,
cols = Seq("id", "firstName")
)
```The result will be the following:
```scala
true
```## Delta File Sizes
The `deltaFileSizes` function returns a `Map[String,Long]` that contains the total size in bytes, the amount of files and the
average file size for a given Delta Table.Suppose you have the following Delta Table, partitioned by `col1`:
```
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| A|
| 2| A| B|
+----+----+----+
```Running `DeltaHelpers.deltaFileSizes(deltaTable)` on that table will return:
```scala
Map("size_in_bytes" -> 1320,
"number_of_files" -> 2,
"average_file_size_in_bytes" -> 660)
```
## Show Delta File SizesThe function `showDeltaFileSizes` displays the size, average size and amount of files of a Delta table in a human readable fashion.
Suppose you have the following table, partitioned by `col1`:
```
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| A|
| 2| A| B|
+----+----+----+
```Running `DeltaHelpers.showDeltaFileSizes` will display the following into the console:
`"The delta table contains 2 files with a size of 1.32 kB.The average file size is 660 B"`
## Humanize Bytes
The function `humanizeBytes` formats a `integer` represeting a number of bytes into a human readable format.
```
DeltaHelpers.humanize_bytes(1234567890) # "1.23 GB"
DeltaHelpers.humanize_bytes(1234567890000) # "1.23 TB"
```## Delta Table File Size Distribution
The function `deltaFileSizeDistributionInMB` returns a `DataFrame` that contains the following stats in megabytes about file sizes in a Delta Table:
### `No. of Parquet Files, Mean File Size, Standard Deviation, Minimum File Size, Maximum File Size, 10th Percentile, 25th Percentile, Median, 75th Percentile, 90th Percentile, 95th Percentile.`This function also works on partition condition. For example, if you have a Delta Table partitioned by `country` and you want to know the file size distribution for `country = 'Australia''`, you can run the following:
```scala
DeltaHelpers.deltaFileSizeDistribution(path, Some("country='Australia'"))
```
This will return a `DataFrame` with the following columns:
```scala
+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+-----------------------------------------------------------------------------------------------------------------------+
|partitionValues |num_of_parquet_files|mean_size_of_files|stddev |min_file_size |max_file_size |Percentile[10th, 25th, Median, 75th, 90th, 95th] |
+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+-----------------------------------------------------------------------------------------------------------------------+
|[{country, Australia}] |1429 |30.205616120778238|0.3454942220373272 |17.376179695129395 |30.377344131469727|[30.132079124450684, 30.173019409179688, 30.215540885925293, 30.25797176361084, 30.294878005981445, 30.318415641784668]|
+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+-----------------------------------------------------------------------------------------------------------------------+
```
Generally, if no partition condition is provided, the function will return the `file size distribution` for the whole Delta Table (with or without partition wise).
```scala
DeltaHelpers.deltaFileSizeDistribution(path)+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+-----------------------------------------------------------------------------------------------------------------------+
|partitionValues |num_of_parquet_files|mean_size_of_files|stddev |min_file_size |max_file_size |Percentile[10th, 25th, Median, 75th, 90th, 95th] |
+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+-----------------------------------------------------------------------------------------------------------------------+
|[{country, Mauritius}] |2502 |28.14731636093103 |0.7981461034111957 |0.005436897277832031|28.37139320373535 |[28.098042488098145, 28.12824249267578, 28.167524337768555, 28.207666397094727, 28.246790885925293, 28.265881538391113]|
|[{country, Malaysia}] |3334 |34.471798611888644|0.4018671378261647 |11.515838623046875 |34.700727462768555|[34.40602779388428, 34.43935298919678, 34.47779560089111, 34.51614856719971, 34.55129528045654, 34.57488822937012] |
|[{country, GrandDuchyofLuxembourg}] |808 |2.84647535569597 |0.5369371124495063 |0.006397247314453125|3.0397253036499023|[2.8616743087768555, 2.8840208053588867, 2.9723005294799805, 2.992110252380371, 3.0045957565307617, 3.0115060806274414]|
|[{country, Argentina}] |3372 |36.82978148392511 |5.336511210904255 |0.010506629943847656|99.95287132263184 |[36.29576301574707, 36.33060932159424, 36.369083404541016, 36.406826972961426, 36.442559242248535, 36.4655065536499] |
|[{country, Australia}] |1429 |30.205616120778238|0.3454942220373272 |17.376179695129395 |30.377344131469727|[30.132079124450684, 30.173019409179688, 30.215540885925293, 30.25797176361084, 30.294878005981445, 30.318415641784668]|
+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+-----------------------------------------------------------------------------------------------------------------------+
```
A similar function `deltaFileSizeDistribution` is provided which returns the same stats in bytes.
## Delta Table Number of Records Distribution
The function `deltaNumRecordDistribution` returns a `DataFrame` that contains the following stats about number of records in parquet files in a Delta Table:
### `No. of Parquet Files, Mean Num Records, Standard Deviation, Minimum & Maximum Number of Records in a File, 10th Percentile, 25th Percentile, Median, 75th Percentile, 90th Percentile, 95th Percentile.`This function also works on partition condition. For example, if you have a Delta Table partitioned by `country` and you want to know the numRecords distribution for `country = 'Australia''`, you can run the following:
```scala
DeltaHelpers.deltaNumRecordDistribution(path, Some("country='Australia'"))
```
This will return a `DataFrame` with the following columns:
```scala
+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+---------------------------------------------------------+
|partitionValues |num_of_parquet_files|mean_num_records_in_files|stddev |min_num_records|max_num_records|Percentile[10th, 25th, Median, 75th, 90th, 95th] |
+------------------------------------------------+--------------------+-------------------------+------------------+---------------+---------------+------------------------------------------------------------+
|[{country, Australia}] |1429 |354160.2757172848 |4075.503669047513 |201823.0 |355980.0 |[353490.0, 353907.0, 354262.0, 354661.0, 355024.0, 355246.0]|
+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+---------------------------------------------------------+
```
Generally, if no partition condition is provided, the function will return the `number of records distribution` for the whole Delta Table (with or without partition wise).
```scala
DeltaHelpers.deltaNumRecordDistribution(path)+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+---------------------------------------------------------+
|partitionValues |num_of_parquet_files|mean_num_records_in_files|stddev |min_num_records|max_num_records|Percentile[10th, 25th, Median, 75th, 90th, 95th] |
+------------------------------------------------+--------------------+-------------------------+------------------+---------------+---------------+------------------------------------------------------------+
|[{country, Mauritius}] |2502 |433464.051558753 |12279.532110752265|1.0 |436195.0 |[432963.0, 433373.0, 433811.0, 434265.0, 434633.0, 434853.0]|
|[{country, Malaysia}] |3334 |411151.4946010798 |4797.137407595447 |136777.0 |413581.0 |[410390.0, 410794.0, 411234.0, 411674.0, 412063.0, 412309.0]|
|[{country, GrandDuchyofLuxembourg}] |808 |26462.003712871287 |5003.8118076056935|6.0 |28256.0 |[26605.0, 26811.0, 27635.0, 27822.0, 27937.0, 28002.0] |
|[{country, Argentina}] |3372 |461765.5604982206 |79874.3727926887 |61.0 |1403964.0 |[453782.0, 454174.0, 454646.0, 455103.0, 455543.0, 455818.0]|
|[{country, Australia}] |1429 |354160.2757172848 |4075.503669047513 |201823.0 |355980.0 |[353490.0, 353907.0, 354262.0, 354661.0, 355024.0, 355246.0]|
+------------------------------------------------+--------------------+------------------+--------------------+--------------------+------------------+---------------------------------------------------------+
```## Number of Shuffle Files in Merge & Other Filter Conditions
The function `getNumShuffleFiles` gets the number of shuffle files (think of part files in parquet) that will be pulled into memory for a given filter condition. This is particularly useful to estimate memory requirements in a Delta Merge operation where the number of shuffle files can be a bottleneck.
To better tune your jobs, you can use this function to get the number of shuffle files for different kinds of filter condition and then perform operations like merge, zorder, compaction etc. to see if you reach the desired no. of shuffle files.For example, if the condition is "country = 'GBR' and age >= 30 and age <= 40 and firstname like '%Jo%' " and country is the partition column,
```scala
DeltaHelpers.getNumShuffleFiles(path, "country = 'GBR' and age >= 30 and age <= 40 and firstname like '%Jo%' ")
```then the output might look like following (explaining different parts of the condition as a key in the `Map` and the value contains the file count)
```scala
Map(
// number of files that will be pulled into memory for the entire provided condition
"OVERALL RESOLVED CONDITION => [ (country = 'GBR') and (age >= 30) and" +
" (age = 40) and firstname LIKE '%Joh%' ]" -> 18,
// number of files signifying the greater than/less than part => "age >= 30 and age <= 40"
"GREATER THAN / LESS THAN PART => [ (age >= 30) and (age = 40) ]" -> 100,
// number of files signifying the equals part => "country = 'GBR'
"EQUALS/EQUALS NULL SAFE PART => [ (country = 'GBR') ]" -> 300,
// number of files signifying the like (or any other) part => "firstname like '%Jo%' "
"LEFT OVER PART => [ firstname LIKE '%Joh%' ]" -> 600,
// number of files signifying any other part. This is mostly a failsafe
// 1. to capture any other condition that might have been missed
// 2. If wrong attribute names or conditions are provided like snapshot.id = source.id (usually found in merge conditions)
"UNRESOLVED PART => [ (snapshot.id = update.id) ]" -> 800,
// Total no. of files in the Delta Table
"TOTAL_NUM_FILES_IN_DELTA_TABLE =>" -> 800,
// List of unresolved columns/attributes in the provided condition.
// Will be empty if all columns are resolved.
"UNRESOLVED_COLUMNS =>" -> List())
```Another important use case this method can help with is to see the min-max range overlap. Adding a min max on a high cardinality column like id say `id >= 900 and id <= 5000` can actually help in reducing the no. of shuffle files delta lake pulls into memory. However, such a operation is not always guaranteed to work and the effect can be viewed when you run this method.
This function works only on the Delta Log and does not scan any data in the Delta Table.
If you want more information about these individual files and their metadata, consider using the `getShuffleFileMetadata` function.
## Change Data Feed Helpers### CASE I - When Delta aka Transaction Log gets purged
`getVersionsForAvailableDeltaLog` - helps you find the versions within the `[startingVersion,endingVersion]`range for which Delta Log is present and CDF read is enabled (only for the start version) and possible
```scala
ChangeDataFeedHelper(deltaPath, 0, 5).getVersionsForAvailableDeltaLog
```
The result will return the same versions `Some(0,5)` if Delta Logs are present. Otherwise, it will return say `Some(10,15)` - the earliest queryable start version and latest snapshot version as ending version. If at any point within versions it finds that EDR is disabled, it returns a `None`.`readCDFIgnoreMissingDeltaLog` - Returns an Option of Spark Dataframe for all versions provided by the above method
```scala
ChangeDataFeedHelper(deltaPath, 11, 13).readCDFIgnoreMissingDeltaLog.get.show(false)+---+------+---+----------------+---------------+-------------------+
|id |gender|age|_change_type |_commit_version|_commit_timestamp |
+---+------+---+----------------+---------------+-------------------+
|4 |Female|25 |update_preimage |11 |2023-03-13 14:21:58|
|4 |Other |45 |update_postimage|11 |2023-03-13 14:21:58|
|2 |Male |45 |update_preimage |13 |2023-03-13 14:22:05|
|2 |Other |67 |update_postimage|13 |2023-03-13 14:22:05|
|2 |Other |67 |update_preimage |12 |2023-03-13 14:22:01|
|2 |Male |45 |update_postimage|12 |2023-03-13 14:22:01|
+---+------+---+----------------+---------------+-------------------+
```
Resultant Dataframe is the same as the result of CDF Time Travel query
### CASE II - When CDC data gets purged in `_change_data` directory`getVersionsForAvailableCDC` - helps you find the versions within the `[startingVersion,endingVersion]`range for which underlying CDC data is present under `_change_data` directory. Call this method when java.io.FileNotFoundException is encountered during time travel
```scala
ChangeDataFeedHelper(deltaPath, 0, 5).getVersionsForAvailableCDC
```
The result will return the same versions `Some(0,5)` if CDC data is present for the given versions under `_change_data` directory. Otherwise, it will return `Some(2,5)` - the earliest queryable start version for which CDC is present and given ending version. If no version is found that has CDC data available, it returns a `None`.
`readCDFIgnoreMissingCDC` - Returns an Option of Spark Dataframe for all versions provided by the above method
```scala
ChangeDataFeedHelper(deltaPath, 11, 13).readCDFIgnoreMissingCDC.show(false)+---+------+---+----------------+---------------+-------------------+
|id |gender|age|_change_type |_commit_version|_commit_timestamp |
+---+------+---+----------------+---------------+-------------------+
|4 |Female|25 |update_preimage |11 |2023-03-13 14:21:58|
|4 |Other |45 |update_postimage|11 |2023-03-13 14:21:58|
|2 |Male |45 |update_preimage |13 |2023-03-13 14:22:05|
|2 |Other |67 |update_postimage|13 |2023-03-13 14:22:05|
|2 |Other |67 |update_preimage |12 |2023-03-13 14:22:01|
|2 |Male |45 |update_postimage|12 |2023-03-13 14:22:01|
+---+------+---+----------------+---------------+-------------------+
```
Resultant Dataframe is the same as the result of CDF Time Travel query### CASE III - Enable-Disable-Re-enable CDF
`getRangesForCDFEnabledVersions`- Skip all versions for which CDF was disabled and get all ranges for which CDF was enabled and time travel is possible within a `[startingVersion,endingVersion]`range
```scala
ChangeDataFeedHelper(writePath, 0, 30).getRangesForCDFEnabledVersions
```
The result will look like `List((0, 3), (7, 8), (12, 20))` signifying all version ranges for which CDF is enabled. The function `getRangesForCDFDisabledVersions` returns exactly same `List` but this time it returns disabled version ranges.`readCDFIgnoreMissingRangesForEDR`- Returns an Option of unionised Spark Dataframe for all version ranges provided by the above method
```scala
ChangeDataFeedHelper(writePath, 0, 30).readCDFIgnoreMissingRangesForEDR
+---+------+---+----------------+---------------+-------------------+
|id |gender|age|_change_type |_commit_version|_commit_timestamp |
+---+------+---+----------------+---------------+-------------------+
|2 |Male |25 |update_preimage |2 |2023-03-13 14:40:48|
|2 |Male |100|update_postimage|2 |2023-03-13 14:40:48|
|1 |Male |25 |update_preimage |1 |2023-03-13 14:40:44|
|1 |Male |35 |update_postimage|1 |2023-03-13 14:40:44|
|2 |Male |100|update_preimage |3 |2023-03-13 14:40:52|
|2 |Male |101|update_postimage|3 |2023-03-13 14:40:52|
|1 |Male |25 |insert |0 |2023-03-13 14:40:34|
|2 |Male |25 |insert |0 |2023-03-13 14:40:34|
|3 |Female|35 |insert |0 |2023-03-13 14:40:34|
|2 |Male |101|update_preimage |8 |2023-03-13 14:41:07|
|2 |Other |66 |update_postimage|8 |2023-03-13 14:41:07|
|2 |Other |66 |update_preimage |13 |2023-03-13 14:41:24|
|2 |Other |67 |update_postimage|13 |2023-03-13 14:41:24|
|2 |Other |67 |update_preimage |14 |2023-03-13 14:41:27|
|2 |Other |345|update_postimage|14 |2023-03-13 14:41:27|
|2 |Male |100|update_preimage |20 |2023-03-13 14:41:46|
|2 |Male |101|update_postimage|20 |2023-03-13 14:41:46|
|4 |Other |45 |update_preimage |15 |2023-03-13 14:41:30|
|4 |Female|678|update_postimage|15 |2023-03-13 14:41:30|
|1 |Other |55 |update_preimage |18 |2023-03-13 14:41:40|
|1 |Male |35 |update_postimage|18 |2023-03-13 14:41:40|
|2 |Other |345|update_preimage |19 |2023-03-13 14:41:43|
|2 |Male |100|update_postimage|19 |2023-03-13 14:41:43|
+---+------+---+----------------+---------------+-------------------+
```
Resultant Dataframe is the same as the result of CDF Time Travel query but this time it will only have CDC for enabled versions ignoring all versions for which CDC was disabled.
### Dry Run
`dryRun`- This method works as a fail-safe to see if there are any CDF-related issues. If it doesn't throw any errors, then you can be certain the above-mentioned issues do not occur in your Delta Table for the given versions. When it does, it throws either an AssertionError or an IllegalStateException with appropriate error message`readCDF`- Plain old time travel query, this is literally the method definition, that's it
```scala
spark.read.format("delta").option("readChangeFeed","true").option("startingVersion",0).("endingVersion",20).load(gcs_path)
```
Pair `dryRun` with `readCDF` to detect any CDF errors in your Delta Table
```scala
ChangeDataFeedHelper(writePath, 9, 13).dryRun().readCDF
```
If no error found, it will return a similar Spark Dataframe with CDF between given versions.## Operation Metric Helpers
### Count Metrics on Delta Table between 2 versions
This function displays all count metric stored in the Delta Logs across versions for the entire Delta Table. It skips versions which do not record
these count metrics and presents a unified view. It shows the growth of a Delta Table by providing the record counts -
**deleted**, **updated** and **inserted** against a **version**. For a **merge** operation, we additionally have a source dataframe to tally
with as **source rows = (deleted + updated + inserted) rows**. Please note that you need to have enough Driver Memory
for processing the Delta Logs at driver level.
```scala
OperationMetricHelper(path,0,6).getCountMetricsAsDF()
```
The result will be following:
```scala
+-------+-------+--------+-------+-----------+
|version|deleted|inserted|updated|source_rows|
+-------+-------+--------+-------+-----------+
|6 |0 |108 |0 |108 |
|5 |12 |0 |0 |0 |
|4 |0 |0 |300 |300 |
|3 |0 |100 |0 |100 |
|2 |0 |150 |190 |340 |
|1 |0 |0 |200 |200 |
|0 |0 |400 |0 |400 |
+-------+-------+--------+-------+-----------+
```
### Count Metrics at partition level of Delta Table
This function provides the same count metrics as the above function, but this time at a partition level. If operations
like **MERGE, DELETE** and **UPDATE** are executed **at a partition level**, then this function can help in visualizing count
metrics for such a partition. However, **it will not provide correct count metrics if these operations are performed
across partitions**. This is because Delta Log does not store this information at a log level and hence, need to be
implemented separately (we intend to take this up in future). Please note that you need to have enough Driver Memory
for processing the Delta Logs at driver level.
```scala
OperationMetricHelper(path).getCountMetricsAsDF(
Some(" country = 'USA' and gender = 'Female'"))// The same metric can be obtained generally without using spark dataframe
def getCountMetrics(partitionCondition: Option[String] = None)
: Seq[(Long, Long, Long, Long, Long)]
```
The result will be following:
```scala
+-------+-------+--------+--------+-----------+
|version|deleted|inserted| updated|source_rows|
+-------+-------+--------+--------+-----------+
| 27| 0| 0|20635530| 20635524|
| 14| 0| 0| 1429460| 1429460|
| 13| 0| 0| 4670450| 4670450|
| 12| 0| 0|20635530| 20635524|
| 11| 0| 0| 5181821| 5181821|
| 10| 0| 0| 1562046| 1562046|
| 9| 0| 0| 1562046| 1562046|
| 6| 0| 0|20635518| 20635512|
| 3| 0| 0| 5181821| 5181821|
| 0| 0|56287990| 0| 56287990|
+-------+-------+--------+--------+-----------+
```
Supported Partition condition types
```scala
// Single Partition
Some(" country = 'USA'")
// Multiple Partition with AND condition. OR is not supported.
Some(" country = 'USA' and gender = 'Female'")
// Without Single Quotes
Some(" country = USA and gender = Female")
```## How to contribute
We welcome contributions to this project, to contribute checkout our [CONTRIBUTING.md](CONTRIBUTING.md) file.## How to build the project
### pre-requisites
* SBT 1.8.2
* Java 8
* Scala 2.12.12### Building
To compile, run
`sbt compile`To test, run
`sbt test`To generate artifacts, run
`sbt package`## Project maintainers
* Matthew Powers aka [MrPowers](https://github.com/MrPowers)
* Brayan Jules aka [brayanjuls](https://github.com/brayanjuls)
* Joydeep Banik Roy aka [joydeepbroy-zeotap](https://github.com/joydeepbroy-zeotap)## More about Jodie
See [this video](https://www.youtube.com/watch?v=llHKvaV0scQ) for more info about the awesomeness of Jodie!