{"id":27334831,"url":"https://github.com/dmuth/tarsplit","last_synced_at":"2025-07-21T03:33:50.656Z","repository":{"id":57473476,"uuid":"315516060","full_name":"dmuth/tarsplit","owner":"dmuth","description":"A utility to split tarballs into smaller pieces while keeping files intact.","archived":false,"fork":false,"pushed_at":"2022-06-19T21:50:14.000Z","size":4921,"stargazers_count":18,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-18T08:13:05.753Z","etag":null,"topics":["docker","tar"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dmuth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null},"funding":{"github":"dmuth"}},"created_at":"2020-11-24T04:14:59.000Z","updated_at":"2025-07-07T05:40:27.000Z","dependencies_parsed_at":"2022-09-26T17:40:53.900Z","dependency_job_id":null,"html_url":"https://github.com/dmuth/tarsplit","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/dmuth/tarsplit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Ftarsplit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Ftarsplit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Ftarsplit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Ftarsplit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dmuth","download_url":"https://codeload.github.com/dmuth/tarsplit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dmuth%2Ftarsplit/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266235510,"owners_count":23897181,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","tar"],"created_at":"2025-04-12T14:46:26.322Z","updated_at":"2025-07-21T03:33:50.629Z","avatar_url":"https://github.com/dmuth.png","language":"Shell","funding_links":["https://github.com/sponsors/dmuth"],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"./img/tarsplit.png\" width=\"300\" align=\"right\" /\u003e\n\n# Tarsplit\n\nA utility to split tarballs into smaller pieces along file boundaries.\n\nThis is useful for gigantic tarballs that need to need to be split up so that they can fit on USB sticks, more reasonably sized Docker layers, or whatever.\n\n\n\n## Installation\n\n\n### Preferred method\n\n```python3 -m pip install tarsplit```\n\n\n### Manually\n\n```python3 -m pip install git+https://github.com/dmuth/tarsplit.git```\n\n\n## Usage\n\n`tarsplit [ --dry-run ] tarball num_files`\n\nExample run:\n\n\u003cimg src=\"./img/tarsplit-run.png\" /\u003e\n\n\n## FAQ\n\n### How does it work?\n\nThis script is written in Python, and uses the \u003ca href=\"https://docs.python.org/3/library/tarfile.html\"\u003etarfile module\u003c/a\u003e \nto read and write tarfiles.  This has the advantage of not having to extract the entire tarball,\nunlike the previous version of this app which was written in Bash Shell Script.\n\n\n### Why?\n\nWhile working on \u003ca href=\"https://github.com/dmuth/splunk-lab\"\u003eSplunk Lab\u003c/a\u003e, I kept running into\nan issue where a particular layer in the Docker image was a Gigabyte in size.  This was a challenge because\nthere was a number of wallclock seconds wasted when processing the large layer after a push or pull.  If \nonly there was a way to split that layer up into multiple smaller layers, which Docker would then \ntransfer in parallel...\n\nWhile investigating, the culprit turned out to be a very large tarball.  I wanted a way to split that\ntarball into multiple smaller tarballs, each of which contained a portion of the filesystem.  Then, I could\nbuild multiple Docker containers, each with a portion of the original tarball's files, with each container\ninheriting the previous container.  This would leverage one of the things Docker is good at: layered filesystems.\n\n\n### This is slow on large files.  Ever hear of multithreading?\n\nYeah, I tried that after release 1.0.  It turns out that even when using every trick I knew that\na multithreaded approach consisting of one thread per chunk to be written was *slower* than just\ndoing everything in a single thread.  I observed this on a 10-core machine with an SSD, so I'm\njust gonna go ahead and point the finger at the GIL and remind myself that threading in Python is cursed.\n\n\n### What about asyncio?\n\nI used asyncio successfully for another project and haven't ruled it out.  I am however skeptical because of the\nvery high level of disk usage.  Async I/O would be more approiate for dozens/hundreds of writers hitting\nthe disk occasionally, and this is not the case here.\n\n\n## Development\n\n### Support scripts\n\n- `bin/create-test-tarball.sh` - Create a test tarball with directories and files inside.\n- `sha1-from-directory.sh` - Get a recursive list of all files in a directory, sort it, SHA1 each file, then concatenate all SHA1s and SHA1 that!\n- `sha1-from-tarball.sh` - Extract a tarball, then do the same thing to the contents as `sha1-from-directory.sh`.\n\n\n### Publishing a new package\n\n- `rm -rfv dist`\n- Bump version number in `setup.py`\n- `python3 ./setup.py sdist`\n- `twine upload dist/*`\n\n\n### Tests\n\nTests can be run with `tests.sh`.  A successful run looks something like this:\n\n\u003cimg src=\"./img/tests.png\" /\u003e\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmuth%2Ftarsplit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdmuth%2Ftarsplit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdmuth%2Ftarsplit/lists"}