https://github.com/absaoss/spark-metadata-tool

Tool to fix _spark_metadata from Structured Streaming queries
https://github.com/absaoss/spark-metadata-tool

Last synced: 10 months ago
JSON representation

Tool to fix _spark_metadata from Structured Streaming queries

Host: GitHub
URL: https://github.com/absaoss/spark-metadata-tool
Owner: AbsaOSS
License: apache-2.0
Created: 2021-09-01T13:35:21.000Z (over 4 years ago)
Default Branch: develop
Last Pushed: 2023-11-06T13:50:37.000Z (over 2 years ago)
Last Synced: 2024-04-12T07:05:56.180Z (almost 2 years ago)
Language: Scala
Size: 132 KB
Stars: 6
Watchers: 11
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

# spark-metadata-tool
Tool to fix _spark_metadata from Structured Streaming queries

## Motivation
Spark Structured Streaming references data files using absolute paths, which makes it impossible to move the data to a different location without breaking functionality.
The tool solves this issue by fixing all paths in Spark metadata files to point to the current location of the data.

## Features
The tool currently provides 3 run modes - `fix-paths`, `merge` and `compare-metadata-with-data`.

### fix-paths
- Fixes all paths in metadata files to point to the current location of the data

Note that the tool doesn't perform any validation and assumes files are in the correct state. Consider the following example:

- Old data location: `hdfs://old/path/old_root`
- Current data location: `s3://bucket/new_root`

In this case Spark metadata files might contain paths like
```
hdfs://old/path/old_root/partition1=x/partition2=y/file.parquet
```
After the fix
```
s3://bucket/new_root/partition1=x/partition2=y/file.parquet
```

Only the base path is changed, **no checks are performed** whether the `.parquet` files or partition folders actually exist.

### merge
- Merges content of the metadata files into the files in another Spark metadata directory

Example:
- Old `_spark_metadata` directory containing metadata files `0`, `1`, `2`, `3`, `3.compact`, `4`, `5`
- New `_spark_metadata` directory containing metadata files `0`, `1`, `2`, `3`, `4`, `5`, `5.compact`, `6`

1. The tool finds the target file, into which it will write the data. This is either the latest `.compact` file, or earliest regular file, if no `.compact` files are present.
In the above example, merged data would be written into file `5.compact`.
2. The tool determines, which files from the old `_spark_metadata` directory use for merging. They're either the latest `.compact` file and all following regular files in order,
or simply all regular files, in case no `.compact` files are present.
3. The contents of the old metadata files are merged into the target file. For the example above, the result would be as follows:
```
version // First line taken from target file, i.e. `5.compact`.
lines from 3.compact // Contents of the lates .compact file from the old metadata directory. Version is omitted.
lines from 4 // Contents of the following regular file from the old metadata directory. Version is omitted.
lines from 5 // Same as with `4`.
lines from 5.compact // Remaining contents of the target metadata file, i.e. 5.compact from the new metadata directory.
```

In every run mode, the tool offers following universal features:
- Creates backup of each file before processing
- Backup is deleted after a successful run(can be overridden to keep the backup)
- Currently supported file systems:
- S3
- Unix
- HDFS

### compare-metadata-with-data
- Compares metadata records with data and log all inconsistencies

Note that the tool does not perform any operation to file system

### create-metadata
- Generates `_spark_metadata` directory with spark streaming metadata from the latest compaction file

Example:
* Let `hdfs://authority/path/to/data/` be a path with partitioned or unpartitioned data,
(i.e. `hdfs://authority/path/to/data/p1=k11/.../pn=kn1/part-wxyz-.c001.snappy.parquet`,
or simpler version without partitioning `hdfs://authority/path/to/data/part-wxyz-.c001.snappy.parquet`)

1. Tool will create `hdfs://authority/path/to/data/_spark_metadata` directory in data root directory
2. Tool will search for all datafiles in directory tree and extract their metadata (e.g. `size`, `path`, `lastModified`, ...).
Returned set will be alphanumerically **ordered** and **trimmed** to match `--max-micro-batch-number`
3. It will calculate last compaction from provided parameters `--max-micro-batch-number` and `--compaction-number`
4. Metadata **up to last compaction** will be written to `hdfs://authority/path/to/data/_spark_metada/.compact`
5. **Rest** of the metadata will be written to **non-compacted** metadata files **up to max micro batch number**:
`hdfs://authority/path/to/data/_spark_metadata/`

> **NOTE**
>
> Metadata are aligned to the `--max-micro-batch-number,` so if the `--compaction-number` is higher than
> the number of metadata files, it can produce empty, but still valid, metadata files.

## Usage
### Obtaining
The application is being published as a standalone executable JAR. Simply download the most recent version of the file `spark-metadata-tool_2.13-x.y.z-assembly.jar` from the [package repository](https://github.com/orgs/AbsaOSS/packages?repo_name=spark-metadata-tool).

### Building
To build the package locally, use command
```
sbt clean assembly
```

### Running
Run the application by executing the JAR with desired arguments, e.g.
```
java -jar spark-metadata-tool_2.13-x.y.z-assembly.jar fix-paths --path "s3://bucket/foo/baz
```

The target filesystem is derived automatically from the provided path:
- `s3://` for S3 storage
- `/` for Unix filesystem
- `hdfs://` for HDFS filesystem

### Complete list of allowed arguments:
```
Usage: spark-metadata-tool [fix-paths|merge|create-metadata] [options]

Command: fix-paths [options]
Fix paths in Spark metadata files to match current location
-p, --path full path to the data folder, including filesystem (e.g. s3://bucket/foo/root)

Command: merge [options]
Merge Spark metadata files from 2 directories
-o, --old full path to the old data folder, including filesystem (e.g. s3://bucket/foo/old)
-n, --new full path to the new data folder, including filesystem (e.g. s3://bucket/foo/new)

Command: compare-metadata-with-data [options]
Compares metadata records with data and log all inconsistencies
-p, --path full path to the data folder, including filesystem (e.g. s3://bucket/foo/root)

Command: create-metada [options]
Create Spark structured streaming metadata
-p, --path full path to data folder, including filesystem (e.g. s3://bucket/foo/root)
-m, --max-micro-batch-number
set max batch number
-c, --compaction-number
set compaction number

Other options:
-k, --keep-backup persist backup files after successful run
-v, --verbose increase verbosity of application logging
--log-to-file enable logging to a file
--dry-run enable dry run mode
--help print this usage text
```

### S3 Credentials
To be able to perform any operation on S3 you must provide AWS credentials. The easiest way to do so is to set environment variables
`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`. The application will read them automatically. For more information, as well as other
ways to provide credentials, see [Using credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html)

### HDFS Set up
To be able to perform any operation on HDFS you must set environment variable `HADOOP_CONF_DIR`.

## Linking
This project is not meant to be linked against, as it is being published as an executable fat jar.

## Publishing artifacts
Artifacts are published into [GitHub Packages Apache Maven registry](https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages).

To publish a package, you will need Personal Access Token with `read: package` and `write: package` permissions, stored in environment variable `GITHUB_TOKEN`.
For other ways to provide PAT, see [sbt-github-packages](https://github.com/djspiewak/sbt-github-packages) plugin page.

- To create new release, checkout new release branch and push it to remote.
Then switch to SBT project `publishing` and run `release` Task:
```
sbt clean release
```

- To simply publish current version of the artifact, run
```
sbt clean publishSigned
```

Note that automatic overwriting of packages (including `-SNAPSHOT` versions) is currently not supported by [GitHub Packages](https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages)
and will fail unless the previous package is manually deleted.

## How to generate Code coverage report
```sbt
sbt jacoco
```
Code coverage will be generated on path:
```
{project-root}/target/scala-{scala_version}/jacoco/report/html
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/absaoss/spark-metadata-tool

Awesome Lists containing this project

README