{"id":13574291,"url":"https://github.com/webdataset/tarp","last_synced_at":"2025-12-15T17:44:05.415Z","repository":{"id":44471023,"uuid":"250972860","full_name":"webdataset/tarp","owner":"webdataset","description":"Fast and simple stream processing of files in tar files, useful for deep learning, big data, and many other applications.","archived":false,"fork":false,"pushed_at":"2023-12-10T21:07:49.000Z","size":9555,"stargazers_count":128,"open_issues_count":11,"forks_count":15,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-11T01:40:04.278Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/webdataset.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-29T06:58:06.000Z","updated_at":"2025-04-11T01:22:20.000Z","dependencies_parsed_at":"2024-06-20T06:08:05.204Z","dependency_job_id":null,"html_url":"https://github.com/webdataset/tarp","commit_stats":null,"previous_names":["tmbdev/tarp"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/webdataset/tarp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/webdataset%2Ftarp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/webdataset%2Ftarp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/webdataset%2Ftarp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/webdataset%2Ftarp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/webdataset","download_url":"https://codeload.github.com/webdataset/tarp/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/webdataset%2Ftarp/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265625587,"owners_count":23800625,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T15:00:49.754Z","updated_at":"2025-12-15T17:44:00.343Z","avatar_url":"https://github.com/webdataset.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"[code metrics](https://goreportcard.com/report/github.com/tmbdev/tarp)\n\n# The `tarp` Utility\n\nTarfiles are commonly used for storing large amounts of data in an efficient,\nsequential access, compressed file format, in particular for deep learning\napplications. For processing and data transformation,\npeople usually unpack them, operate over the files, and tar up the result again.\n\nThe `tarp` utility is a port of the Python [tarproc](http://github.com/tmbdev/tarproc)\nutilities to Go. The `tarp` utility is a single executable, a \"Swiss army knife\"\nfor dataset transformations.\n\nAvailable commands are:\n\n- create: create tar files from a list of tar paths and corresponding data sources\n- cat: concatenate tar files\n- proc: process tar files\n- sort: sort tar files\n- split: split tar files\n\nFor `tarp cat`, sources and destinations can be ZMQ URLs (specified using zpush/zpull,\nzpub/zsub, or zr versions that reverse connect/bind). This permits very large\nsorting, processing, and shuffling networks to be set up (Kubernetes is a good platform\nfor this).\n\nCommands consistently take/require a \"-o\" for the output in order to avoid accidental\nfile clobbering. You can specify \"-\" if you want to output to stdout.\n\n# Installation\n\nThe `tarp` command line utility is a standard Golang command line program. You need to\n[install Go](https://golang.org/doc/install). Afterwards, you can install `tarp` with:\n\n\t$ go get -v github.com/tmbdev/tarp/tarp\n\t\nAlternatively, you can also install from a local clone:\n\n\tgit clone https://github.com/tmbdev/tarp.git\n\tcd tarp\n\tmake bin/tarp\n\tsudo make install\n\n# Examples\n\nDownload a dataset from Google Cloud, shuffle it, and split it into shards containing\n1000 training samples each:\n\n```Bash\ngsutil cat gs://bucket/file.tar | tarp sort - -o - | tarp split -c 1000 -o 'output-%06d.tar'\n```\n\nCreate a dataset for images stored in directories whose names represent class labels,\ncreates shards consisting of 1000 images each, and upload them to Google cloud:\n\n```Bash\nfor classdir in *; do\n    test -d $classdir || continue\n    for image in $classdir/*.png; do\n        imageid=$(basename $image .png)\n        echo \"$imageid.txt text:$classdir\"\n        echo \"$imageid.png file:$image\"\n    done\ndone |\nsort |\ntarp create -o - - |\ntarp split -c 1000 -o 'dataset-%06d.tar' \\\n    -p 'gsutil cp %s gs://mybucket/; rm %s'\n```\n\n(Note that in an actual application, you probably want to shuffle the\nsamples in the text file you create after the sort command.)\n\n\n# Internals\n\nInternally, data processing is handled using goroutines and channels passing\naround samples. Samples are simple key/value stores of type `map[string][]byte`.\nMost processing steps are pipeline elements. The general programming style is:\n\n```Go\nfunc ProcessSamples(parameters...) func(inch Pipe, outch Pipe) {\n\treturn func(inch Pipe, outch Pipe) {\n\t\t...\n\t\tfor sample := range inch {\n\t\t\t...\n\t\t}\n\t\t...\n\t\tclose(outch)\n\t}\n}\n```\n\nNote that unlike simple Golang pipeline examples, the caller\nallocates the output channel; this gives code building pipelines\nout of processing stages a bit more control.\nFurthermore, construction of pipeline elements\ninvolves an outer and an inner function (\"currying\"). This lets us\nwrite pipelines more naturally.\nFor example, you can write code like this:\n\n```Go\nsource := TarSource(fname)\nsink := TarSink(fname)\npipeline := Pipeline(\n\tSliceSamples(0, 100),\n\tLogProgress(10, \"progress\"),\n\tRenameSamples(renamings, false)\n)\nProcessing(source, pipeline, sink)\n```\n\nThe main processing library is in the `datapipes` subdirectory;\ntests for the library functions are also found here (run with\n`go test` in that subdirectory).\nThe toplevel command and its subcommands are defined in `cmd`.\nTests for the command line functions can be executed with `./run-tests`\nfrom the top of the source tree.\n\n# Status\n\nThis is fairly new software. The command line interface is fairly stable,\nbut the internal APIs may still change substantially.\n\nFuture work:\n\n- high priority\n    - add Github testing and release workflows\n    - make function/library naming more consistent\n    - add ParallelMapSamples\n    - more documentation\n    - add 'tarp sendeof'\n    - add tensorcom tensor outputs\n    - refactor GOpen/GCreate\n    - create dispatch for GOpen/GCreate\n    - add key rewriting / key grouping via regexp\n    - integrate go-tfdata\n- medium priority\n    - switch sort backend from sqlite3 to bbolt or badger\n    - performance optimizations (remove needless copying)\n    - add different FnameSplit options\n    - close to 100% test coverage for Go\n    - more command line tests\n    - Kubernetes examples for large scale processing\n    - add basic image processing, decompression, etc. functionality\n    - add Lua scripting to `tarp proc` for fast internal processing\n    - switch to interface and registry for GOpen (from current ad hoc code)\n    - spec: JSON files for inputs\n    - replace mpio with cbor\n    - add multiple ZMQ sources as options\n- low priority\n    - use Go libraries for accessing cloud/object storage directly\n    - TFRecord/tf.Example interoperability\n    - add JSON input to \"tarp create\"\n    - add separator option to \"tarp create\"\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwebdataset%2Ftarp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwebdataset%2Ftarp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwebdataset%2Ftarp/lists"}