{"id":17132209,"url":"https://github.com/bitsofinfo/s3-bucket-loader","last_synced_at":"2025-04-13T07:55:46.299Z","repository":{"id":21693789,"uuid":"25015120","full_name":"bitsofinfo/s3-bucket-loader","owner":"bitsofinfo","description":"Utility for quickly loading or copying massive amount of files into S3, optionally via yas3fs or any other S3 filesystem abstraction; as well from s3 bucket to bucket (mirroring/copy)","archived":false,"fork":false,"pushed_at":"2014-11-10T19:55:28.000Z","size":1228,"stargazers_count":42,"open_issues_count":0,"forks_count":7,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-13T07:55:41.422Z","etag":null,"topics":["aws","bulk-loader","bulkimport","copy-files","file-transfer","s3","yas3fs"],"latest_commit_sha":null,"homepage":"https://bitsofinfo.wordpress.com/2014/11/10/copying-lots-of-files-into-s3-and-within-s3-using-s3-bucket-loader/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bitsofinfo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-10-10T01:13:49.000Z","updated_at":"2024-08-12T19:15:03.000Z","dependencies_parsed_at":"2022-08-18T05:10:30.328Z","dependency_job_id":null,"html_url":"https://github.com/bitsofinfo/s3-bucket-loader","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitsofinfo%2Fs3-bucket-loader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitsofinfo%2Fs3-bucket-loader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitsofinfo%2Fs3-bucket-loader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bitsofinfo%2Fs3-bucket-loader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bitsofinfo","download_url":"https://codeload.github.com/bitsofinfo/s3-bucket-loader/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248681491,"owners_count":21144700,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","bulk-loader","bulkimport","copy-files","file-transfer","s3","yas3fs"],"created_at":"2024-10-14T19:26:22.618Z","updated_at":"2025-04-13T07:55:46.262Z","avatar_url":"https://github.com/bitsofinfo.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"s3-bucket-loader\n================\n\nThis project originated out of a need to quickly import (and backup) a massive amount of files (hundreds of gigabytes) into an AWS S3 bucket, \nwith the ultimate intent that this bucket be managed going forward via the S3 distributed file-system; \n[yas3fs](https://github.com/danilop/yas3fs). Initial attempts at doing this a traditional way, \n(i.e. rsyncing or copying from source to destination) quickly became impractical due to the sheer\n amount of time that single-threaded, and even limited multi-threaded copiers would take.\n\ns3-bucket-loader leverages a simple master/worker paradigm to get economies of scale for copying many files from sourceA to targetB. \n\"sourceA\" and \"targetB\" could be two S3 buckets, or a file-system to S3 bucket (via an S3 file-system abstraction like yas3fs or s3fs etc).\nEven though this is coded with S3 being the ultimate destination it could be used for other targets as well including other shared file-systems.\nThe speed at which you can import a given file-set into S3 (through yas3fs in this case) is only limited on how much money you \nwant to spend in worker hardware. For example this has been used to import and validate in S3 over 35k files (11gb total) \nin roughly 16 minutes; using 40 ec2 t2.medium instances as workers. In another scenario it was used to import and validate\nover 800k files totaling roughly 600gb in under 8 hours. This program has also been used to copy the previously imported\nbuckets to secondary 'backup' buckets in under an hour.\n\n\n![Alt text](/diagram1.png \"Diagram1\")\n\n![Alt text](/diagram2.png \"Diagram2\")\n\n## How it works\n\nThis is a multi-threaded Java program that can be launched in two modes `master` or `worker`. The `master` is \nresponsible for determining a table of contents (TOC) (i.e. file paths) which are candidates for WRITE to the \ndestination and subsequently VALIDATED. The `master` node streams these TOC events over an SQS queue which is \nconsumed to by one or more `workers`. Each `worker` must also have access to the `source` from which the TOC \nwas generated from. The `source` data could be the same physical set of files, an S3 bucket, a copy of them or whatever... it really \ndoes not matter, but they just need to be accessible from each `worker` (i.e. via a SAN/NAS/NFS share, source S3 bucket etc). \nThe `worker` then copies each item (in the case of files via rsync (or cp) to S3 via an S3 FS abstraction) or via an S3 key-copy.\nIt uses rsync to preserve uid/gid information which is important for the ultimate consumer; and ensured preservation \nif written to S3 via S3 file-system abstractions like [yas3fs](https://github.com/danilop/yas3fs). \nIt is also important to note that each `worker` leverages N threads to increase parallelism and maximize the \nthroughput to S3. The more `workers` you have the faster it goes.\n\nPlease see [s3BucketLoader.sample.properties](https://github.com/bitsofinfo/s3-bucket-loader/blob/master/src/main/resources/s3BucketLoader.sample.properties) for\nmore details on configuration options and how-to-use etc\n\n## Flow overview\n\n1. End user starts the Master which creates the SNS control-channel and SQS TOC queue\n\n2. The Master (optionally) launches N worker nodes on EC2\n\n3. As each worker node initializes its subscribes to the control-channel and publishes that it is INITIALIZED\n\n4. Once the master sees all of its workers in INITIALIZED state, the master changes the state to WRITE\n\n5. The master begins creating the TOC (consisting of path, isDirectory and size), and sends a SQS message for each file to the TOC queue. Again the 'source' for these\nTOC entries could be a path realized via the file-system, or a file-like key name in a source S3 bucket.\n\n6. Workers begin consuming TOC messages off the queue and execute their TOCPayloadHandler, which might do a S3 key-copy or \nrsyncs (or cp) from the source -\u003e destination through an S3 file-system abstraction. As workers are consuming they periodically \nsend CURRENT SUMMARY updates to the master. If `failfast` is configured and any failures are detected the master can \nswitch the cluster to ERROR_REPORT mode immediately (see below). Depending on the handler, they can also do chowns, chmods etc. \n\n7. When workers are complete, they publish their WRITE SUMMARY and go into an IDLE state\n\n8. Master receives all WRITE SUMMARYs from the workers\n  * If no errors, the master transitions to the VALIDATE state, and sends the TOC to the queue again\n  * If errors the master transitions to the ERROR_REPORT state, and requests error details from the workers\n\n9. In VALIDATE state, all workers consume TOC file paths from the SQS queue and attempt to verify the file exists \nand its sizes matches the expected TOC size (locally and/or s3 object metat-data calls). When complete they go into IDLE state and publish their VALIDATE SUMMARY\n\n10. After receiving all VALIDATE SUMMARYs from the workers\n  * If no errors, the master issues a shutdown command to all workers, then optionally terminates all instances\n  * If errors the master transitions to the ERROR_REPORT state, and requests error details from the workers\n\n11. In ERROR REPORT state, workers summarize and publish their errors from either state WRITE/VALIDATE, \nthe master aggregates them and reports them to the master log file for analysis. All workers are then shutdown.\n\n12. At any stage, issuing a control-C on the master triggers a shutdown of the entire cluster, \nincluding ec2 worker termination if configured in the properties file\n\n\n## How to run\n\n* Clone this repository\n\n* You need a Java JDK installed preferable 1.6+\n\n* You need [Maven](http://maven.apache.org/) installed\n\n* Change dir to the root of the project and run 'mvn package' (this will build a runnable Jar under target/)\n\n* Copy the [s3BucketLoader.sample.properties](https://github.com/bitsofinfo/s3-bucket-loader/blob/master/src/main/resources/s3BucketLoader.sample.properties) \nfile under src/main/resources, make your own and customize it. \n\n* run the below to launch, 1st on the MASTER, and then on the WORKERS (which the Master can do itself...)\n```\njava -jar -DisMaster=true|false -Ds3BucketLoaderHome=/some/dir -DconfigFilePath=s3BucketLoader.properties s3-bucket-loader-0.0.1-SNAPSHOT.jar\n```\n\n* The sample properties should be fairly self explanatory. Its important to understand that it is up \nto YOU to properly configure your environment for both the master and worker(s). The `master` needs access to the \ngold-copy \"source\" files that you want to get into S3. The `workers` need access to both the \"source\" files and \nsome sort of S3 target (via an S3 file-system abstraction like yas3fs). Note that s3-bucket-loader can automatically \nconfigure your workers for you... you just need to configure a 'user-data' startup script for the EC2 instances \nthat your `master` will launch. A example/sample one that I have used previously is provided under \n[ec2-init-s3BucketLoader.sample.py](src/main/resources/ec2-init-s3BucketLoader.sample.py). For example, when ec2 launches your\n workers, a startup script can pull all packages needed to prepare the environment from another S3 bucket, install things, \n configure and even pull down the latest s3-bucket-loader jar file, the worker properties file and finally launch the worker.\n\nEnjoy. \n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitsofinfo%2Fs3-bucket-loader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbitsofinfo%2Fs3-bucket-loader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbitsofinfo%2Fs3-bucket-loader/lists"}