{"id":19705542,"url":"https://github.com/treeverse/hadoop-router-fs","last_synced_at":"2025-04-29T15:30:46.845Z","repository":{"id":38242156,"uuid":"470896187","full_name":"treeverse/hadoop-router-fs","owner":"treeverse","description":"RouterFileSystem is a Hadoop FileSystem implementation that transforms URIs at runtime according to provided configurations. It then routes file system operations to another Hadoop file system that executes it against the underlying object store.","archived":false,"fork":false,"pushed_at":"2023-09-18T09:45:15.000Z","size":495,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-04-16T10:58:31.149Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/treeverse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-03-17T07:50:59.000Z","updated_at":"2024-04-16T10:58:31.150Z","dependencies_parsed_at":"2022-09-05T09:41:42.038Z","dependency_job_id":null,"html_url":"https://github.com/treeverse/hadoop-router-fs","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Fhadoop-router-fs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Fhadoop-router-fs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Fhadoop-router-fs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Fhadoop-router-fs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/treeverse","download_url":"https://codeload.github.com/treeverse/hadoop-router-fs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224178299,"owners_count":17268862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T21:28:51.409Z","updated_at":"2024-11-11T21:28:52.066Z","avatar_url":"https://github.com/treeverse.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# hadoop-router-fs\n\n[RouterFileSystem](src/main/java/io/lakefs/RouterFileSystem.java) is a Hadoop [FileSystem](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html) \nimplementation that transforms URIs at runtime according to provided configurations. It then routes file system operations to \nanother Hadoop file system that executes it against the underlying object store. \n\n## Use-cases \n\n- Interact with multiple storages side-by-side, without making any changes to your code.    \n- Migrate a collection to a new storage location without changing your Spark application code, or breaking it.\n\n## Build instructions \n\n#### Pre-requisites \n- Install maven \n\n#### Steps\n\n1. Clone the repo:\n    ```shell\n    git clone git@github.com:treeverse/hadoop-router-fs.git\n    ```\n\n2. Build with maven:\n   ```shell\n   mvn clean install\n   ```\n\n## How to configure RouterFS \n\n### Configure Spark to use RouterFS\n\nInstruct Spark to use RouterFS as the file system implementation for the URIs you would like to transform at runtime by adding the following property to your Spark configurations: \n```properties\nfs.${fromFsScheme}.impl=io.lakefs.routerfs.RouterFileSystem\n```\n\nFor example, by adding the `fs.s3a.impl=io.lakefs.routerfs.RouterFileSystem` you are instructing Spark to use RouterFS as the file system for any \nURI with `scheme=s3a`.\n\n### Add custom mapping configurations\n\nRouterFS consumes your mapping configurations to understand which paths it needs to modify and how to modify them. It then \nperforms a simple prefix replacement accordingly.  \nMapping configurations are Hadoop properties of the form:\n```properties\nrouterfs.mapping.${fromFsScheme}.${mappingIdx}.(replace|with)=${path-prefix}\n```  \nFor a given URI, RouterFS scans the mapping configurations defined for the URI's scheme, searches for the first mapping\nconfiguration that matches the URI prefix, and transforms the URI according to the matching configuration.\n\n#### Notes about mapping configurations:\n\n* Make sure your source prefix ends with a slash when needed.\n* Mapping configurations apply in-order, and it is up to you to create non-conflicting configurations.\n\n### Default file system\n\nFor each mapped scheme you should configure a default file system implementation in case mapping is found.  \nAdd the following configuration for the schemes you configured RouteFS to handle.\n```properties\nrouterfs.default.fs.${fromFsScheme}=${the file system you used for this scheme without routerFS}\n```\nFor example, by adding:\n```properties\nrouterfs.default.fs.s3a=org.apache.hadoop.fs.s3a.S3AFileSystem\n```\nYou are instructing RouterFS to use `S3AFileSystem` for any URI with `scheme=s3a` for which RouterFS did not find\na mapping configuration.\n\n#### When no mapping was found\n\nIn case RouterFS can't find a matching mapping configuration, it will make sure that it's handled by the [default\nfile system](#default-file-system) for the URI scheme.\n\n**Example**\n\nGiven the following mapping configurations:\n```properties \nfs.s3a.impl=io.lakefs.routerfs.RouterFileSystem\nrouterfs.mapping.s3a.1.replace=s3a://bucket/dir1/ # mapping src\nrouterfs.mapping.s3a.1.with=lakefs://repo/main/ # mapping dst\nrouterfs.mapping.s3a.2.replace=s3a://bucket/dir2/ # mapping src\nrouterfs.mapping.s3a.2.with=lakefs://example-repo/dev/ # mapping dst\nrouterfs.default.fs.s3a=org.apache.hadoop.fs.s3a.S3AFileSystem # default file system implementation for the `s3a` scheme\n```\n\n* For the URI `s3a://bucket/dir1/foo.parquet`, RouterFS will perform the next steps:\n  1. Scan all `routerfs` mapping configurations include the `s3a` scheme in their key: `routerfs.mapping.s3a.${mappingIndex}.replace`.\n  2. Iterate the configurations by the order of the priorities specified by `${mappingIdx}` and try to match the URI prefix to the configurations values. The iteration stops once reaching the `s3a://bucket/dir1/` prefix that matches the URI `s3a://bucket/dir1/foo.parquet`.\n  3. Replace it with the destination mapping value: `lakefs://repo/main/` to create the desired URI: `lakefs://repo/main/foo.parquet`.\n\n\n* For the URI `s3a://bucket/dir3/bar.parquet`, RouterFS will perform the next steps:\n  1. Scan all `routerfs` mapping configurations include the `s3a` scheme in their key: `routerfs.mapping.s3a.${mappingIndex}.replace`.\n  2. Iterate the configurations by the order of the priorities specified by `${mappingIdx}` and try to match the URI prefix to the configurations values. The iteration stops with no matching mapping.\n  3. Fall back to the [default file system](#default-file-system) implementation (`S3AFileSystem`) and leave the URI as it is.\n\n### Configure File Systems Implementations\n\nThe final configuration step is to instruct Spark what file system to use for each URI scheme. Make sure to \nadd this configuration for any URI scheme you defined a mapping configuration for.\nFor example, to instruct Spark to use `S3AFileSystem` for any URI with `scheme=lakefs`\n```properties\nfs.lakefs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem\n```\n\n## Usage\n\n### Run your Spark Application with RouterFS \n\nAfter [building](#build-instructions) RouterFS, the build artifact is a jar under the `target` directory. \nYou should supply this jar to your Spark application when running the application, or by placing it under your `$SPARK_HOME/jars` directory. \n\n### Usage with lakeFS \n\nThe current version of RouterFS only works for Spark applications that interact with lakeFS via the [S3 Gateway](https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-s3a-gateway). \nThat is, you can't use both RouterFS and LakeFSFileSystem together, but we have [concrete plans](https://github.com/treeverse/lakeFS/issues/3058) to make this work.\n\n### `S3AFileSystem`\n\nThe current version of RouterFS requires the use of S3AFileSystem's [per-bucket configuration](https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets) functionality to support multiple mappings that use \nS3AFileSystem as their file system implementation. That means that the compiled Hadoop version should be \u003e= 2.8.0.  \nThe per-bucket configurations treat the first part of the path (also called the \"authority\") as the bucket to which we configure the S3A file system property.  \nFor example, for the following configurations:\n```properties\nfs.s3a.impl=io.lakefs.routerfs.RouterFileSystem\nrouterfs.mapping.s3a.1.replace=s3a://bucket/dir/\nrouterfs.mapping.s3a.1.with=lakefs://repo/branch/\nrouterfs.default.fs.s3a=org.apache.hadoop.fs.s3a.S3AFileSystem\n\nfs.lakefs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem\n\n# The following configs will be used when URIs of the form `lakefs://repo/...` will be addressed\nfs.s3a.bucket.repo.endpoint=https://lakefs.example.com\nfs.s3a.bucket.repo.access.key=AKIAlakefs12345EXAMPLE\nfs.s3a.bucket.repo.secret.key=abc/lakefs/1234567bPxRfiCYEXAMPLEKEY\n...\n# The following configs will be used when any non-mapped s3a URIs will be addressed\nfs.s3a.endpoint=https://s3.us-east-1.amazonaws.com\nfs.s3a.access.key=...\nfs.s3a.secret.key=...\n```\nthe configurations that begin with `fs.s3a.bucket.repo` will be used when trying to access `lakefs://repo/\u003cpath\u003e`.  \nAll other `fs.s3a.\u003cconf\u003e` properties will be used for the general case.\n\n### Working example\n\nPlease refer to the [sample app](./sample_app/README.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftreeverse%2Fhadoop-router-fs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftreeverse%2Fhadoop-router-fs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftreeverse%2Fhadoop-router-fs/lists"}