{"id":13429172,"url":"https://github.com/sstadick/hck","last_synced_at":"2025-05-16T05:04:10.297Z","repository":{"id":40412602,"uuid":"379939977","full_name":"sstadick/hck","owner":"sstadick","description":"A sharp cut(1) clone.","archived":false,"fork":false,"pushed_at":"2025-03-14T13:37:33.000Z","size":927,"stargazers_count":711,"open_issues_count":6,"forks_count":18,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-05-16T05:03:21.401Z","etag":null,"topics":["command-line","rust","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sstadick.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE-MIT","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-06-24T13:48:14.000Z","updated_at":"2025-05-10T20:32:15.000Z","dependencies_parsed_at":"2024-01-15T16:59:03.426Z","dependency_job_id":"12197cf0-f60c-495d-b1a9-1e3dd23e52c5","html_url":"https://github.com/sstadick/hck","commit_stats":null,"previous_names":[],"tags_count":61,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sstadick%2Fhck","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sstadick%2Fhck/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sstadick%2Fhck/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sstadick%2Fhck/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sstadick","download_url":"https://codeload.github.com/sstadick/hck/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254471061,"owners_count":22076585,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["command-line","rust","text-processing"],"created_at":"2024-07-31T02:00:28.908Z","updated_at":"2025-05-16T05:04:10.276Z","avatar_url":"https://github.com/sstadick.png","language":"Rust","readme":"# 🪓 hck\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/sstadick/hck/actions?query=workflow%3ACheck\"\u003e\u003cimg src=\"https://github.com/sstadick/hck/workflows/Check/badge.svg\" alt=\"Build Status\"\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/crates/l/hck.svg\" alt=\"license\"\u003e\n  \u003ca href=\"https://crates.io/crates/hck\"\u003e\u003cimg src=\"https://img.shields.io/crates/v/hck.svg?colorB=319e8c\" alt=\"Version info\"\u003e\u003c/a\u003e\u003cbr\u003e\n  A sharp \u003ci\u003ecut(1)\u003c/i\u003e clone.\n\u003c/p\u003e\n\n_`hck` is a shortening of `hack`, a rougher form of `cut`._\n\nA close to drop in replacement for cut that can use a regex delimiter instead of a fixed string.\nAdditionally this tool allows for specification of the order of the output columns using the same column selection syntax as cut (see below for examples).\n\nNo single feature of `hck` on its own makes it stand out over `awk`, `cut`, `xsv` or other such tools. Where `hck` excels is making common things easy, such as reordering output fields, or splitting records on a weird delimiter.\nIt is meant to be simple and easy to use while exploring datasets.\nThink of this as filling a gap between `cut` and `awk`.\n\n`hck` is dual-licensed under MIT or the [UNLICENSE](https://unlicense.org/).\n\n## Features\n\n- Reordering of output columns! i.e. if you use `-f4,2,8` the output columns will appear in the order `4`, `2`, `8`\n- Delimiter treated as a regex, i.e. you can split on multiple spaces without an extra pipe to `tr`!\n- Specification of output delimiter\n- Selection of columns by header string literal with the `-F` option, or by regex by setting the `-r` flag\n- Input files will be automatically decompressed if their file extension is recognizable and a local binary exists to perform the decompression (similar to ripgrep). See [Decompression](#decompression).\n- Output can be gzip compressed using the multi-threaded compressors from [`gzp`](https://github.com/sstadick/gzp) with `-Z` flag\n  - This gzipped output is in BGZF format and can be indexed and queried with `tabix`\n- Exclude fields by index or by header.\n- Speed\n\n## Non-goals\n\n- `hck` does not aim to be a complete CSV / TSV parser a la `xsv` which will respect quoting rules. It acts similar to `cut` in that it will split on the delimiter no matter where in the line it is.\n- Delimiters cannot contain newlines... well they can, they will just never be seen. `hck` will always be a line-by-line tool where newlines are the standard `\\n` `\\r\\n`.\n\n## Install\n\n- Homebrew / Linuxbrew\n\n```bash\nbrew tap sstadick/hck\nbrew install hck\n```\n\n- Conda\n\n```bash\n# Note, this version lags by about a day\nconda install -c conda-forge hck\n```\n\n- MacPorts\n\n```bash\n# Note, version may lag behind latest\nsudo port selfupdate\nsudo port install hck\n```\n\n- Debian (Ubuntu)\n\n```bash\ncurl -LO https://github.com/sstadick/hck/releases/download/\u003clatest\u003e/hck-linux-amd64.deb\nsudo dpkg -i hck-linux-amd64.deb\n```\n\n\\* Built with profile guided optimizations\n\n- With the Rust toolchain:\n\n```bash\nexport RUSTFLAGS='-C target-cpu=native'\ncargo install hck\n```\n\n- From the [releases page](https://github.com/sstadick/hck/releases) (the binaries have been built with profile guided optimizations)\n\n- Or, if you want the absolute fastest possible build that makes use of profile guided optimizations AND native cpu features:\n\n```bash\n# Assumes you are on stable rust\n# NOTE: this won't work on windows, see CI for linked issue\ncargo install just\ngit clone https://github.com/sstadick/hck\ncd hck\njust install-native\n```\n\n- PRs are both welcome and encouraged for adding more packaging options and build types! I'd especially welcome PRs for the windows family of package managers / general making sure things are windows friendly.\n\n## Packaging status\n\n[![Packaging status](https://repology.org/badge/vertical-allrepos/hck.svg)](https://repology.org/project/hck/versions)\n\n## Examples\n\n### Splitting with a string literal\n\n```bash\n❯ hck -Ld' ' -f1-3,5- ./README.md | head -n4\n#       🪓      hck\n\n\u003cp      align=\"center\"\u003e\n                \u003ca      src=\"https://github.com/sstadick/hck/workflows/Check/badge.svg\" alt=\"Build      Status\"\u003e\u003c/a\u003e\n```\n\n### Splitting with a regex delimiter\n\n```bash\n# note, '\\s+' is the default\n❯ ps aux | hck -f1-3,5- | head -n4\nUSER    PID     %CPU    VSZ     RSS     TTY     STAT    START   TIME    COMMAND\nroot    1       0.0     169452  13472   ?       Ss      Jun21   0:19    /sbin/init      splash\nroot    2       0.0     0       0       ?       S       Jun21   0:00    [kthreadd]\nroot    3       0.0     0       0       ?       I\u003c      Jun21   0:00    [rcu_gp]\n```\n\n### Reordering output columns\n\n```bash\n❯ ps aux | hck -f2,1,3- | head -n4\nPID     USER    %CPU    %MEM    VSZ     RSS     TTY     STAT    START   TIME    COMMAND\n1       root    0.0     0.0     169452  13472   ?       Ss      Jun21   0:19    /sbin/init      splash\n2       root    0.0     0.0     0       0       ?       S       Jun21   0:00    [kthreadd]\n3       root    0.0     0.0     0       0       ?       I\u003c      Jun21   0:00    [rcu_gp]\n```\n\n### Excluding output columns\n\n```bash\n❯ ps aux | hck -e3,5 | head -n4\nUSER    PID     %MEM    RSS     TTY     STAT    START   TIME    COMMAND\nroot    1       0.0     14408   ?       Ss      Jun21   0:27    /sbin/init      splash\nroot    2       0.0     0       ?       S       Jun21   0:01    [kthreadd]\nroot    3       0.0     0       ?       I\u003c      Jun21   0:00    [rcu_gp]\n```\n\n### Excluding output columns by header regex\n\n```bash\n❯  ps aux | hck -r -E \"CPU\" -E \"^ST.*\" | head -n4\nUSER    PID     %MEM    VSZ     RSS     TTY     TIME    COMMAND\nroot    1       0.0     170224  14408   ?       0:27    /sbin/init      splash\nroot    2       0.0     0       0       ?       0:01    [kthreadd]\nroot    3       0.0     0       0       ?       0:00    [rcu_gp]\n```\n\n### Changing the output record separator\n\n```bash\n❯ ps aux | hck -D'___' -f2,1,3 | head -n4\nPID___USER___%CPU\n1___root___0.0\n2___root___0.0\n3___root___0.0\n```\n\n### Select columns with regex\n\n```bash\n# Note the order match the order of the -F args\nps aux | hck -r -F '^ST.*' -F '^USER$' | head -n4\nSTAT    START   USER\nSs      Jun21   root\nS       Jun21   root\nI\u003c      Jun21   root\n```\n\n### Automagic decompresion\n\n```bash\n❯ gzip ./README.md\n❯ hck -Ld' ' -f1-3,5- -z ./README.md.gz | head -n4\n#       🪓      hck\n\n\u003cp      align=\"center\"\u003e\n                \u003ca      src=\"https://github.com/sstadick/hck/workflows/Check/badge.svg\" alt=\"Build      Status\"\u003e\u003c/a\u003e\n```\n\n### Splitting on multiple characters\n\n```bash\n# with string literal\n❯ printf 'this$;$is$;$a$;$test\\na$;$b$;$3$;$four\\n' \u003e test.txt\n❯ hck -Ld'$;$' -f3,4 ./test.txt\na       test\n3       four\n# with an interesting regex\n❯ printf 'this123__is456--a789-test\\na129_-b849-_3109_-four\\n' \u003e test.txt\n❯ hck -d'\\d{3}[-_]+' -f3,4 ./test.txt\na       test\n3       four\n```\n\n### Splitting by-index and by-header\n\nThis one requires some explaining first. Basically, by-index and by-header selections each have their own \"order\", and then the orders are merged ex:\n\n```bash\n❯ printf 'a,b,c,d,e\\n1,2,3,4,5\\n' | hck -d, -D: -f3 -F 'b' -F 'a'\nb:c:a\n2:3:1\n```\n\nIn the by-index group, we've specified column 3 to be in output position 0. In the by-header group, we've specified that column `b` be in position 0. They by-index and by-header selections are merged together and when merging, if there are two outputs specified to be in the same output position the are output in input order (input meaning the order of columns in the input data).\n\nThis can lead to unexpected outcomes, such as the following example where `a` now comes first in the output when compared to the example above.\n\n```bash\n❯ printf 'a,b,c,d,e\\n1,2,3,4,5\\n' | hck -d, -D: -f3 -F 'a'\na:c\n1:3\n```\n\nTakeaway: be careful when a specific output order is desired and you are mixing and matching by-index and by-header field selections.\n\n## Benchmarks\n\nThis set of benchmarks is simply meant to show that `hck` is in the same ballpark as other tools. These are meant to capture real world usage of the tools, so in the multi-space delimiter benchmark for `gcut`, for example, we use `tr` to convert the space runs to a single space and then pipe to `gcut`.\n\n*Note* this is not meant to be an authoritative set of benchmarks, it is just meant to give a relative sense of performance of different ways of accomplishing the same tasks.\n\n#### Hardware\n\nUbuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive\n\n#### Data\n\nThe [all_train.csv](https://archive.ics.uci.edu/ml/machine-learning-databases/00347/all_train.csv.gz) data is used.\n\nThis is a CSV dataset with 7 million lines. We test it both using `,` as the delimiter, and then also using `\\s\\s\\s` as a delimiter.\n\nPRs are welcome for benchmarks with more tools, or improved (but still realistic) pipelines for commands.\n\n#### Tools\n\n`cut`:\n  - https://www.gnu.org/software/coreutils/manual/html_node/The-cut-command.html\n  - 8.30\n\n`mawk`:\n  - https://invisible-island.net/mawk/mawk.html\n  - v1.3.4\n\n`xsv`:\n  - https://github.com/BurntSushi/xsv\n  - v0.13.0 (compiled locally with optimizations)\n\n`tsv-utils`:\n  - https://github.com/eBay/tsv-utils\n  - v2.2.0 (ldc2, compiled locally with optimizations)\n\n`choose`:\n  - https://github.com/theryangeary/choose\n  - v1.3.2 (compiled locally with optimizations)\n\n### Single character delimiter benchmark\n\n| Command                                                      |      Mean [s] | Min [s] | Max [s] |    Relative |\n| :----------------------------------------------------------- | ------------: | ------: | ------: | ----------: |\n| `hck -Ld, -f1,8,19 ./hyper_data.txt \u003e /dev/null`             | 1.198 ± 0.015 |   1.185 |   1.215 |        1.00 |\n| `hck -Ld, -f1,8,19 --no-mmap ./hyper_data.txt \u003e /dev/null`   | 1.349 ± 0.029 |   1.320 |   1.389 | 1.13 ± 0.03 |\n| `hck -d, -f1,8,19  ./hyper_data.txt \u003e /dev/null`             | 1.649 ± 0.023 |   1.624 |   1.673 | 1.38 ± 0.03 |\n| `hck -d, -f1,8,19  --no-mmap ./hyper_data.txt \u003e /dev/null`   | 1.869 ± 0.019 |   1.842 |   1.894 | 1.56 ± 0.02 |\n| `tsv-select -d, -f 1,8,19 ./hyper_data.txt \u003e /dev/null`      | 1.702 ± 0.021 |   1.687 |   1.734 | 1.42 ± 0.02 |\n| `choose -f , -i ./hyper_data.txt 0 7 18  \u003e /dev/null`        | 4.285 ± 0.092 |   4.214 |   4.428 | 3.58 ± 0.09 |\n| `xsv select -d, 1,8,19 ./hyper_data.txt \u003e /dev/null`         | 5.693 ± 0.042 |   5.635 |   5.745 | 4.75 ± 0.07 |\n| `awk -F, '{print $1, $8, $19}' ./hyper_data.txt \u003e /dev/null` | 4.993 ± 0.029 |   4.959 |   5.030 | 4.17 ± 0.06 |\n| `cut -d, -f1,8,19 ./hyper_data.txt \u003e /dev/null`              | 7.541 ± 1.250 |   6.827 |   9.769 | 6.30 ± 1.05 |\n\n\n### Multi-character delimiter benchmark\n\n| Command                                                                                       |       Mean [s] | Min [s] | Max [s] |     Relative |\n| :-------------------------------------------------------------------------------------------- | -------------: | ------: | ------: | -----------: |\n| `hck -Ld'   ' -f1,8,19 ./hyper_data_multichar.txt \u003e /dev/null`                                |  1.718 ± 0.003 |   1.715 |   1.722 |         1.00 |\n| `hck -Ld'   ' -f1,8,19 --no-mmap ./hyper_data_multichar.txt \u003e /dev/null`                      |  2.191 ± 0.072 |   2.135 |   2.291 |  1.28 ± 0.04 |\n| `hck -d'   ' -f1,8,19 ./hyper_data_multichar.txt \u003e /dev/null`                                 |  2.180 ± 0.029 |   2.135 |   2.208 |  1.27 ± 0.02 |\n| `hck -d'   ' --no-mmap -f1,8,19 ./hyper_data_multichar.txt \u003e /dev/null`                       |  2.542 ± 0.014 |   2.529 |   2.565 |  1.48 ± 0.01 |\n| `hck -d'[[:space:]]+' -f1,8,19 ./hyper_data_multichar.txt \u003e /dev/null`                        |  8.597 ± 0.023 |   8.575 |   8.631 |  5.00 ± 0.02 |\n| `hck -d'[[:space:]]+' --no-mmap -f1,8,19 ./hyper_data_multichar.txt \u003e /dev/null`              |  8.890 ± 0.013 |   8.871 |   8.903 |  5.17 ± 0.01 |\n| `hck -d'\\s+' -f1,8,19 ./hyper_data_multichar.txt \u003e /dev/null`                                 | 10.014 ± 0.247 |   9.844 |  10.449 |  5.83 ± 0.14 |\n| `hck -d'\\s+' -f1,8,19 --no-mmap ./hyper_data_multichar.txt \u003e /dev/null`                       | 10.173 ± 0.035 |  10.111 |  10.193 |  5.92 ± 0.02 |\n| `choose -f '   ' -i ./hyper_data_multichar.txt 0 7 18  \u003e /dev/null`                           |  6.537 ± 0.148 |   6.452 |   6.799 |  3.80 ± 0.09 |\n| `choose -f '[[:space:]]' -i ./hyper_data_multichar.txt 0 7 18  \u003e /dev/null`                   | 10.656 ± 0.219 |  10.484 |  10.920 |  6.20 ± 0.13 |\n| `choose -f '\\s' -i ./hyper_data_multichar.txt 0 7 18  \u003e /dev/null`                            | 37.238 ± 0.153 |  37.007 |  37.383 | 21.67 ± 0.10 |\n| `awk -F' ' '{print $1, $8 $19}' ./hyper_data_multichar.txt \u003e /dev/null`                       |  6.673 ± 0.064 |   6.595 |   6.734 |  3.88 ± 0.04 |\n| `awk -F'   ' '{print $1, $8, $19}' ./hyper_data_multichar.txt \u003e /dev/null`                    |  5.947 ± 0.098 |   5.896 |   6.121 |  3.46 ± 0.06 |\n| `awk -F'[:space:]+' '{print $1, $8, $19}' ./hyper_data_multichar.txt \u003e /dev/null`             | 11.080 ± 0.215 |  10.881 |  11.376 |  6.45 ± 0.13 |\n| `\u003c ./hyper_data_multichar.txt tr -s ' ' \\| cut -d ' ' -f1,8,19 \u003e /dev/null`                   |  7.471 ± 0.066 |   7.397 |   7.561 |  4.35 ± 0.04 |\n| `\u003c ./hyper_data_multichar.txt tr -s ' ' \\| xsv select -d ' ' 1,8,19 --no-headers \u003e /dev/null` |  6.172 ± 0.068 |   6.071 |   6.235 |  3.59 ± 0.04 |\n| `\u003c ./hyper_data_multichar.txt tr -s ' ' \\| hck -Ld' ' -f1,8,19 \u003e /dev/null`                   |  6.171 ± 0.112 |   5.975 |   6.243 |  3.59 ± 0.07 |\n| `\u003c ./hyper_data_multichar.txt tr -s ' ' \\| tsv-select -d ' ' -f 1,8,19 \u003e /dev/null`           |  6.202 ± 0.130 |   5.984 |   6.290 |  3.61 ± 0.08 |\n\n## Decompression\n\nThe following table indicates the file extension / binary pairs that are used to try to decompress a file when the `-z` option is specified:\n\n| Extension | Binary                   | Type       |\n| :-------- | :----------------------- | :--------- |\n| `*.gz`    | Native                   | gzip       |\n| `*.tgz`   | `gzip -d -c`             | gzip       |\n| `*.bz2`   | `bzip2 -d -c`            | bzip2      |\n| `*.tbz2`  | `bzip2 -d -c`            | bzip2      |\n| `*.xz`    | `xz -d -c`               | xz         |\n| `*.txz`   | `xz -d -c`               | xz         |\n| `*.lz4`   | `lz4 -d -c`              | lz4        |\n| `*.lzma`  | `xz --format=lzma -d -c` | lzma       |\n| `*.br`    | `brotli -d -c`           | brotli     |\n| `*.zst`   | `zstd -d -c`             | zstd       |\n| `*.zstd`  | `zstd -q -d -c`          | zstd       |\n| `*.Z`     | `uncompress -c`          | uncompress |\n\nWhen a file with one of the extensions above is found, `hck` will open a subprocess running the the decompression tool listed above and read from the output of that tool. If the binary can't be found then `hck` will try to read the compressed file as is. See [`grep_cli`](https://github.com/BurntSushi/ripgrep/blob/9eddb71b8e86a04d7048b920b9b50a2e97068d03/crates/cli/src/decompress.rs#L468) for source code. The end goal is to add a similar preprocessor as [ripgrep](https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md#preprocessor). Where there are multiple binaries for a given type, they are tried in the order listed above.\n\n## Profile Guided Optimization\n\nSee the `pgo*.sh` scripts for how to build this with optimizations. You will need to install the llvm tools via `rustup component add llvm-tools-preview` for this to work. Building with PGO seems to improve performance anywhere from 5-30% depending on the platform and codepath. i.e. on mac os it seems to have a larger effect, and on the regex codepath it also seems to have a greater effect.\n\n## TODO\n\n- Add output compression detection when writing to a file\n- Don't reparse fields / headers for each new file\n- Figure out how to better reuse / share a vec\n- Support indexing from the end (unlikely though)\n- Bake in grep / filtering somehow (this will not be done at the expense of the primary utility of `hck`)\n- Move tests from main to core\n- Add more tests all around\n- Experiment with parallel parser as described [here](https://www.semanticscholar.org/paper/Instant-Loading-for-Main-Memory-Databases-M%C3%BChlbauer-R%C3%B6diger/a1b067fc941d6727169ec18a882080fa1f074595?p2df) This should be very doable given we don't care about escaping quotes and such.\n\n## More packages and builds\n\nhttps://github.com/sharkdp/bat/blob/master/.github/workflows/CICD.yml\n\n## References\n\n- [rust-coreutils-cut](https://github.com/uutils/coreutils/blob/e48ff9dd9ee0d55da285f99d75f6169a5e4e7acc/src/uu/cut/src/cut.rs)\n- [ripgrep](https://github.com/BurntSushi/ripgrep/tree/master/crates/searcher)\n","funding_links":[],"categories":["PGO Showcases","Command Line","Rust","Applications","应用程序 Applications","Tech","Text Processing","\u003ca name=\"text-processing\"\u003e\u003c/a\u003eText processing","File","Other"],"sub_categories":["Other","Dependency Management","Text processing","文本处理 Text processing","Tools","System tools"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsstadick%2Fhck","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsstadick%2Fhck","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsstadick%2Fhck/lists"}