{"id":15649152,"url":"https://github.com/sd2k/ttv","last_synced_at":"2025-04-07T08:23:38.365Z","repository":{"id":33275253,"uuid":"150728667","full_name":"sd2k/ttv","owner":"sd2k","description":"A command line tool for splitting files into test, train, and validation sets.","archived":false,"fork":false,"pushed_at":"2025-03-24T12:04:29.000Z","size":600,"stargazers_count":40,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-31T07:06:15.328Z","etag":null,"topics":["command-line","hacktoberfest","split","test","train","validation"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sd2k.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-28T11:13:21.000Z","updated_at":"2025-03-24T12:04:42.000Z","dependencies_parsed_at":"2023-10-04T22:20:20.373Z","dependency_job_id":"c9a9fdd3-a54e-4c17-a1d1-f4439d2bfff1","html_url":"https://github.com/sd2k/ttv","commit_stats":{"total_commits":281,"total_committers":9,"mean_commits":31.22222222222222,"dds":0.604982206405694,"last_synced_commit":"1780f408116c0c7bfd819ced3f3640158af127c8"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sd2k%2Fttv","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sd2k%2Fttv/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sd2k%2Fttv/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sd2k%2Fttv/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sd2k","download_url":"https://codeload.github.com/sd2k/ttv/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247616069,"owners_count":20967321,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["command-line","hacktoberfest","split","test","train","validation"],"created_at":"2024-10-03T12:28:20.777Z","updated_at":"2025-04-07T08:23:38.340Z","avatar_url":"https://github.com/sd2k.png","language":"Rust","readme":"[![Dependabot Status](https://api.dependabot.com/badges/status?host=github\u0026repo=sd2k/ttv)](https://dependabot.com)\n\nttv - create train, test, validation sets\n=========================================\n\nttv is a command line tool for splitting large files up into chunks suitable for train/test/validation splits for machine learning. It arose from the need to split files that were too large to fit into memory to split, and the desire to do it in a clean way.\n\n`ttv` requires Rust 2021.\n\nInstallation\n------------\n\nBuild using `cargo build --release` to get a binary at `./target/release/ttv`. Copy this into your path to use it.\n\nUsage\n-----\n\nRun `ttv --help` to get help, or infer what you can from one of these examples:\n\n    # Split CSV file into two sets of a fixed number of rows\n    $ ttv split data.csv --rows=train=9000 --rows=test=1000\n\n    # Accepts gzipped data (no flag required). Shorthand argument version. As many splits as you like!\n    $ ttv split data.csv.gz --rows=train=65000,validation=15000,test=15000 -d\n\n    # Alternatively, specify proportion-based splits.\n    $ ttv split data.csv --prop=train=0.8,test=0.2\n\n    # When using proportions, include the total rows to get a progress bar\n    $ ttv split data.csv --prop=train=0.8,test=0.2 --total-rows=1234\n\n    # Accepts data from stdin, compressed or not (must give a filename)\n    $ cat data.csv | ttv split --rows=test=10000,train=90000 --output-prefix data -u\n    $ cat data.csv.gz | ttv split --rows=test=10000,train=90000 --output-prefix data -d\n\n    # Using pigz for faster decompression\n    $ pigz -dc data.csv.gz | ttv split --prop=test=0.1,train=0.9 --chunk-size 5000 --output-prefix data\n\n    # Split outputs into chunks for faster writing/reading later\n    $ ttv split data.csv.gz --rows=test=100000,train=900000 --chunk-size 5000 -d\n\n    # Write outputs uncompressed\n    $ ttv split data.csv.gz --prop=test=0.5,train=0.5\n\n    # Reproducible splits using seed\n    $ ttv split data.csv.gz --prop=test=0.5,train=0.5 --chunk-size 1000 --seed 5330 -d\n\nDevelopment\n-----------\n\nYou'll need a recent version of the Rust nightly toolchain and Cargo. Then just hack away as normal.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsd2k%2Fttv","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsd2k%2Fttv","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsd2k%2Fttv/lists"}