{"id":24344394,"url":"https://github.com/xtream1101/s3-tar","last_synced_at":"2025-04-09T17:15:18.090Z","repository":{"id":46070209,"uuid":"239206288","full_name":"xtream1101/s3-tar","owner":"xtream1101","description":"Stream s3 data into a tar file in s3","archived":false,"fork":false,"pushed_at":"2021-11-16T17:11:15.000Z","size":93,"stargazers_count":26,"open_issues_count":6,"forks_count":11,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-09T17:15:10.047Z","etag":null,"topics":["archive","cli","python","s3","stream","tar"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xtream1101.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-08T21:23:03.000Z","updated_at":"2025-01-31T02:18:44.000Z","dependencies_parsed_at":"2022-09-26T18:30:43.660Z","dependency_job_id":null,"html_url":"https://github.com/xtream1101/s3-tar","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtream1101%2Fs3-tar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtream1101%2Fs3-tar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtream1101%2Fs3-tar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xtream1101%2Fs3-tar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xtream1101","download_url":"https://codeload.github.com/xtream1101/s3-tar/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248074921,"owners_count":21043490,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archive","cli","python","s3","stream","tar"],"created_at":"2025-01-18T09:35:41.453Z","updated_at":"2025-04-09T17:15:18.071Z","avatar_url":"https://github.com/xtream1101.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# s3-tar\n\n[![PyPI](https://img.shields.io/pypi/v/s3-tar.svg)](https://pypi.python.org/pypi/s3-tar)\n[![PyPI](https://img.shields.io/pypi/l/s3-tar.svg)](https://pypi.python.org/pypi/s3-tar)  \n\n\nCreate a `tar`/`tar.gz`/`tar.bz2` file from many s3 files and stream back into s3.   \n\n## Install\n`pip install s3-tar`\n\n\n## Usage\n\nSet up s3 credentials on your system by either of these options:\n- Environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`\n- Shared credential file `~/.aws/credentials`\n- AWS config file `~/.aws/config`\nFor details check out the aws docs: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html  \n\nSet the environment variable `S3_ENDPOINT_URL` to use a custom s3 host (minio/etc...) , not needed if using AWS s3. \n\nThis will use very little RAM. As it downloads files, it streams up the tar'd pieces as it goes.  \nYou can use more or less ram by playing with the options `cache_size` \u0026 `part_size_multiplier`.  \n\n\n\n### Import\n```python\nfrom s3_tar import S3Tar\n\n# Init the job\njob = S3Tar(\n    'YOUR_BUCKET_NAME',\n    'FILE_TO_SAVE_TO.tar',  # Use `tar.gz` or `tar.bz2` to enable compression\n    # target_bucket=None,  # Default: source bucket. Can be used to save the archive into a different bucket\n    # min_file_size='50MB',  # Default: None. The min size to make each tar file [B,KB,MB,GB,TB]. If set, a number will be added to each file name\n    # save_metadata=False,  # If True, and the file has metadata, save a file with the same name using the suffix of `.metadata.json`\n    # remove_keys=False,  # If True, will delete s3 files after the tar is created\n  \n    # ADVANCED USAGE\n    # allow_dups=False,  # When False, will raise ValueError if a file will overwrite another in the tar file, set to True to ignore\n    # cache_size=5,  # Default 5. Number of files to hold in memory to be processed\n    # s3_max_retries=4,  # Default is 4. This value is passed into boto3.client's s3 botocore config as the `max_attempts`\n    # part_size_multiplier=10,  # is multiplied by 5 MB to find how large each part that gets upload should be\n    # session=boto3.session.Session(),  # For custom aws session\n)\n# Add files, can call multiple times to add files from other directories\njob.add_files(\n    'FOLDER_IN_S3/',\n    # folder='',  # If a folder is set, then all files from this directory will be added into that folder in the tar file\n    # preserve_paths=False,  # If True, it will use the dir paths relative to the input path inside the tar file\n)\n# Add a single file at a time\njob.add_file(\n    'some/file_key.json',\n    # folder='',  # If a folder is set, then the file will be added into that folder in the tar file\n)\n# Start the tar'ing job after files have been added\njob.tar()\n```\n\n\n### Command Line\nTo see all command line options run:  \n```\ns3-tar -h                                                       \nusage: s3-tar [-h] --source-bucket SOURCE_BUCKET --folder FOLDER --filename FILENAME [--target-bucket TARGET_BUCKET] [--min-filesize MIN_FILESIZE] [--save-metadata] [--remove]\n              [--preserve-paths] [--allow-dups] [--cache-size CACHE_SIZE] [--s3-max-retries S3_MAX_RETRIES] [--part-size-multiplier PART_SIZE_MULTIPLIER]\n\nTar (and compress) files in s3\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --source-bucket SOURCE_BUCKET\n                        base bucket to use\n  --folder FOLDER       folder whose contents should be combined\n  --filename FILENAME   Output filename for the tar file. Extension: tar, tar.gz, or tar.bz2\n  --target-bucket TARGET_BUCKET\n                        Bucket that the tar will be saved to. Only needed if different then source bucket\n  --min-filesize MIN_FILESIZE\n                        Use to create multiple files if needed. Min filesize of the tar'd files in [B,KB,MB,GB,TB]. e.x. 5.2GB\n  --save-metadata       If a file has metadata, save it to a .metadata.json file\n  --remove              Delete files that were added to the tar file\n  --preserve-paths      Preserve the path layout relative to the input folder\n  --allow-dups          ADVANCED: Allow duplicate filenames to be saved into the tar file\n  --cache-size CACHE_SIZE\n                        ADVANCED: Number of files to download into memory at a time\n  --s3-max-retries S3_MAX_RETRIES\n                        ADVANCED: Max retries for each request the s3 client makes\n  --part-size-multiplier PART_SIZE_MULTIPLIER\n                        ADVANCED: Multiplied by 5MB to set the max size of each upload chunk\n```\n\n\n#### CLI Examples\nThis example will take all the files in the bucket `my-data` in the folder `2020/07/01` and save it into a compressed tar gzip file in the same bucket into the directory `Archives` \n```\ns3-tar --source-bucket my-data --folder 2020/07/01 --filename Archive/2020-07-01.tar.gz\n```\n\nNow lets say you have a large amount of data and it would create a tar file to large to work with. This example will create files that are ~2.5GB each and save into a different bucket. Inside each tar file it will also save the folder structure as it is in s3.\n```\ns3-tar --preserve-paths --source-bucket my-big-data --folder 2009 --target-bucket my-archived-data --filename big_data/2009-archive.tar.gz --min-filesize 2.5GB\n```\nIn the bucket `my-archived-data`, in the folder `big_data/` there will be multiple files named:\n- 2009-archive-1.tar.gz\n- 2009-archive-2.tar.gz\n- 2009-archive-3.tar.gz\n\n\n#### Notes\n\n- For better performance, if you know the files you are adding will not have any duplicate names (or you are ok with duplicates), you can set `--allow-dups` in the cli or pass `allow_dups=True` to the `S3Tar` class to get better performance since it wil not have to check each files name.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxtream1101%2Fs3-tar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxtream1101%2Fs3-tar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxtream1101%2Fs3-tar/lists"}