{"id":18810393,"url":"https://github.com/absaoss/spark-metadata-tool","last_synced_at":"2025-04-13T20:30:57.660Z","repository":{"id":41817791,"uuid":"402073228","full_name":"AbsaOSS/spark-metadata-tool","owner":"AbsaOSS","description":"Tool to fix _spark_metadata from Structured Streaming queries","archived":false,"fork":false,"pushed_at":"2023-11-06T13:50:37.000Z","size":135,"stargazers_count":6,"open_issues_count":3,"forks_count":1,"subscribers_count":11,"default_branch":"develop","last_synced_at":"2024-04-12T07:05:56.180Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null}},"created_at":"2021-09-01T13:35:21.000Z","updated_at":"2023-09-28T13:59:59.000Z","dependencies_parsed_at":"2023-02-16T05:45:31.280Z","dependency_job_id":"7ce7640c-6236-400c-a680-c583fbeb3d30","html_url":"https://github.com/AbsaOSS/spark-metadata-tool","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-metadata-tool","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-metadata-tool/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-metadata-tool/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-metadata-tool/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/spark-metadata-tool/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223603266,"owners_count":17172072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:20:04.026Z","updated_at":"2024-11-07T23:20:04.678Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spark-metadata-tool\nTool to fix _spark_metadata from Structured Streaming queries\n\n## Motivation\nSpark Structured Streaming references data files using absolute paths, which makes it impossible to move the data to a different location without breaking functionality.\nThe tool solves this issue by fixing all paths in Spark metadata files to point to the current location of the data.\n\n## Features\nThe tool currently provides 3 run modes - `fix-paths`, `merge` and `compare-metadata-with-data`.\n\n### fix-paths\n- Fixes all paths in metadata files to point to the current location of the data\n\nNote that the tool doesn't perform any validation and assumes files are in the correct state. Consider the following example:\n\n- Old data location: `hdfs://old/path/old_root`\n- Current data location: `s3://bucket/new_root`\n\nIn this case Spark metadata files might contain paths like\n```\nhdfs://old/path/old_root/partition1=x/partition2=y/file.parquet\n```\nAfter the fix\n```\ns3://bucket/new_root/partition1=x/partition2=y/file.parquet\n```\n\nOnly the base path is changed, **no checks are performed** whether the `.parquet` files or partition folders actually exist.\n\n### merge\n- Merges content of the metadata files into the files in another Spark metadata directory\n\nExample:\n- Old `_spark_metadata` directory containing metadata files `0`, `1`, `2`, `3`, `3.compact`, `4`, `5`\n- New `_spark_metadata` directory containing metadata files `0`, `1`, `2`, `3`, `4`, `5`, `5.compact`, `6`\n\n1. The tool finds the target file, into which it will write the data. This is either the latest `.compact` file, or earliest regular file, if no `.compact` files are present.\nIn the above example, merged data would be written into file `5.compact`.\n2. The tool determines, which files from the old `_spark_metadata` directory use for merging. They're either the latest `.compact` file and all following regular files in order,\nor simply all regular files, in case no `.compact` files are present.\n3. The contents of the old metadata files are merged into the target file. For the example above, the result would be as follows:\n```\nversion                 // First line taken from target file, i.e. `5.compact`.\nlines from 3.compact    // Contents of the lates .compact file from the old metadata directory. Version is omitted.\nlines from 4            // Contents of the following regular file from the old metadata directory. Version is omitted.\nlines from 5            // Same as with `4`.\nlines from 5.compact   // Remaining contents of the target metadata file, i.e. 5.compact from the new metadata directory.\n```\n\nIn every run mode, the tool offers following universal features:\n- Creates backup of each file before processing\n- Backup is deleted after a successful run(can be overridden to keep the backup)\n- Currently supported file systems:\n    - S3\n    - Unix\n    - HDFS\n\n### compare-metadata-with-data\n- Compares metadata records with data and log all inconsistencies\n\nNote that the tool does not perform any operation to file system\n\n### create-metadata\n- Generates `_spark_metadata` directory with spark streaming metadata from the latest compaction file\n\nExample:\n* Let `hdfs://authority/path/to/data/` be a path with partitioned or unpartitioned data,\n  (i.e. `hdfs://authority/path/to/data/p1=k11/.../pn=kn1/part-wxyz-\u003cuuid\u003e.c001.snappy.parquet`,\n  or simpler version without partitioning `hdfs://authority/path/to/data/part-wxyz-\u003cuuid\u003e.c001.snappy.parquet`)\n\n1. Tool will create `hdfs://authority/path/to/data/_spark_metadata` directory in data root directory\n2. Tool will search for all datafiles in directory tree and extract their metadata (e.g. `size`, `path`, `lastModified`, ...).\n   Returned set will be alphanumerically **ordered** and **trimmed** to match `--max-micro-batch-number`\n3. It will calculate last compaction from provided parameters `--max-micro-batch-number` and `--compaction-number`\n4. Metadata **up to last compaction** will be written to `hdfs://authority/path/to/data/_spark_metada/\u003clast_compaction\u003e.compact`\n5. **Rest** of the metadata will be written to **non-compacted** metadata files **up to max micro batch number**:\n   `hdfs://authority/path/to/data/_spark_metadata/\u003cmax_micro_batch_number\u003e`\n\n\u003e **NOTE**\n\u003e\n\u003e Metadata are aligned to the `--max-micro-batch-number,` so if the `--compaction-number` is higher than \n\u003e the number of metadata files, it can produce empty, but still valid, metadata files.\n\n## Usage\n### Obtaining\nThe application is being published as a standalone executable JAR. Simply download the most recent version of the file `spark-metadata-tool_2.13-x.y.z-assembly.jar` from the [package repository](https://github.com/orgs/AbsaOSS/packages?repo_name=spark-metadata-tool).\n\n### Building\nTo build the package locally, use command\n```\nsbt clean assembly\n```\n\n### Running\nRun the application by executing the JAR with desired arguments, e.g.\n```\njava -jar spark-metadata-tool_2.13-x.y.z-assembly.jar fix-paths --path \"s3://bucket/foo/baz\n```\n\nThe target filesystem is derived automatically from the provided path:\n- `s3://`             for S3 storage\n- `/`                 for Unix filesystem\n- `hdfs://\u003curl:port\u003e` for HDFS filesystem\n\n### Complete list of allowed arguments:\n```\nUsage: spark-metadata-tool [fix-paths|merge|create-metadata] [options]\n\nCommand: fix-paths [options]\nFix paths in Spark metadata files to match current location\n  -p, --path \u003cvalue\u003e       full path to the data folder, including filesystem (e.g. s3://bucket/foo/root)\n\n\nCommand: merge [options]\nMerge Spark metadata files from 2 directories\n  -o, --old \u003cvalue\u003e        full path to the old data folder, including filesystem (e.g. s3://bucket/foo/old)\n  -n, --new \u003cvalue\u003e        full path to the new data folder, including filesystem (e.g. s3://bucket/foo/new)\n\nCommand: compare-metadata-with-data [options]\nCompares metadata records with data and log all inconsistencies\n  -p, --path \u003cvalue\u003e       full path to the data folder, including filesystem (e.g. s3://bucket/foo/root)\n  \nCommand: create-metada [options]\nCreate Spark structured streaming metadata\n  -p, --path \u003cvalue\u003e      full path to data folder, including filesystem (e.g. s3://bucket/foo/root)\n  -m, --max-micro-batch-number \u003cvalue\u003e\n                          set max batch number\n  -c, --compaction-number \u003cvalue\u003e\n                          set compaction number\n\n\nOther options:\n  -k, --keep-backup        persist backup files after successful run\n  -v, --verbose            increase verbosity of application logging\n  --log-to-file            enable logging to a file\n  --dry-run                enable dry run mode\n  --help                   print this usage text\n```\n\n### S3 Credentials\nTo be able to perform any operation on S3 you must provide AWS credentials. The easiest way to do so is to set environment variables\n`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`. The application will read them automatically. For more information, as well as other\nways to provide credentials, see [Using credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html)\n\n### HDFS Set up\nTo be able to perform any operation on HDFS you must set environment variable `HADOOP_CONF_DIR`.\n\n## Linking\nThis project is not meant to be linked against, as it is being published as an executable fat jar. \n\n## Publishing artifacts\nArtifacts are published into [GitHub Packages Apache Maven registry](https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages).\n\nTo publish a package, you will need Personal Access Token with `read: package` and `write: package` permissions, stored in environment variable `GITHUB_TOKEN`.\nFor other ways to provide PAT, see [sbt-github-packages](https://github.com/djspiewak/sbt-github-packages) plugin page.\n\n- To create new release, checkout new release branch and push it to remote.\nThen switch to SBT project `publishing` and run `release` Task:\n```\nsbt clean release\n```\n\n- To simply publish current version of the artifact, run\n```\nsbt clean publishSigned\n```\n\nNote that automatic overwriting of packages (including `-SNAPSHOT` versions) is currently not supported by [GitHub Packages](https://docs.github.com/en/packages/learn-github-packages/introduction-to-github-packages)\nand will fail unless the previous package is manually deleted.\n\n## How to generate Code coverage report\n```sbt\nsbt jacoco\n```\nCode coverage will be generated on path:\n```\n{project-root}/target/scala-{scala_version}/jacoco/report/html\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspark-metadata-tool","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fspark-metadata-tool","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspark-metadata-tool/lists"}