Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sd2k/ttv
A command line tool for splitting files into test, train, and validation sets.
https://github.com/sd2k/ttv
command-line hacktoberfest split test train validation
Last synced: 6 days ago
JSON representation
A command line tool for splitting files into test, train, and validation sets.
- Host: GitHub
- URL: https://github.com/sd2k/ttv
- Owner: sd2k
- Created: 2018-09-28T11:13:21.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2024-10-23T12:01:39.000Z (14 days ago)
- Last Synced: 2024-10-29T19:43:45.136Z (8 days ago)
- Topics: command-line, hacktoberfest, split, test, train, validation
- Language: Rust
- Homepage:
- Size: 556 KB
- Stars: 40
- Watchers: 4
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
[![Dependabot Status](https://api.dependabot.com/badges/status?host=github&repo=sd2k/ttv)](https://dependabot.com)
ttv - create train, test, validation sets
=========================================ttv is a command line tool for splitting large files up into chunks suitable for train/test/validation splits for machine learning. It arose from the need to split files that were too large to fit into memory to split, and the desire to do it in a clean way.
`ttv` requires Rust 2021.
Installation
------------Build using `cargo build --release` to get a binary at `./target/release/ttv`. Copy this into your path to use it.
Usage
-----Run `ttv --help` to get help, or infer what you can from one of these examples:
# Split CSV file into two sets of a fixed number of rows
$ ttv split data.csv --rows=train=9000 --rows=test=1000# Accepts gzipped data (no flag required). Shorthand argument version. As many splits as you like!
$ ttv split data.csv.gz --rows=train=65000,validation=15000,test=15000 -d# Alternatively, specify proportion-based splits.
$ ttv split data.csv --prop=train=0.8,test=0.2# When using proportions, include the total rows to get a progress bar
$ ttv split data.csv --prop=train=0.8,test=0.2 --total-rows=1234# Accepts data from stdin, compressed or not (must give a filename)
$ cat data.csv | ttv split --rows=test=10000,train=90000 --output-prefix data -u
$ cat data.csv.gz | ttv split --rows=test=10000,train=90000 --output-prefix data -d# Using pigz for faster decompression
$ pigz -dc data.csv.gz | ttv split --prop=test=0.1,train=0.9 --chunk-size 5000 --output-prefix data# Split outputs into chunks for faster writing/reading later
$ ttv split data.csv.gz --rows=test=100000,train=900000 --chunk-size 5000 -d# Write outputs uncompressed
$ ttv split data.csv.gz --prop=test=0.5,train=0.5# Reproducible splits using seed
$ ttv split data.csv.gz --prop=test=0.5,train=0.5 --chunk-size 1000 --seed 5330 -dDevelopment
-----------You'll need a recent version of the Rust nightly toolchain and Cargo. Then just hack away as normal.