{"id":20504191,"url":"https://github.com/stefan-schroedl/tabulator","last_synced_at":"2025-04-13T20:48:05.132Z","repository":{"id":21814010,"uuid":"25136754","full_name":"stefan-schroedl/tabulator","owner":"stefan-schroedl","description":"A set of Unix shell command line tools for quick and convenient batch processing of tabular text files (a.k.a., tab-delimited, tsv, csv, or flat data file format) with a header line. Provides column reference by name, automatic delimiter and compression detection for per-line transformations, sql-like group-by operation and relational join.","archived":false,"fork":false,"pushed_at":"2015-05-27T10:45:37.000Z","size":320,"stargazers_count":35,"open_issues_count":0,"forks_count":7,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-13T20:47:58.075Z","etag":null,"topics":["comma-separated-values","command-line","csv","csv-files","data","delimited-files","join","tab-separated","tsv","unix"],"latest_commit_sha":null,"homepage":"","language":"Perl","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stefan-schroedl.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-10-13T00:41:32.000Z","updated_at":"2025-02-07T19:34:48.000Z","dependencies_parsed_at":"2022-07-22T03:47:07.901Z","dependency_job_id":null,"html_url":"https://github.com/stefan-schroedl/tabulator","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-schroedl%2Ftabulator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-schroedl%2Ftabulator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-schroedl%2Ftabulator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-schroedl%2Ftabulator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stefan-schroedl","download_url":"https://codeload.github.com/stefan-schroedl/tabulator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248782280,"owners_count":21160716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["comma-separated-values","command-line","csv","csv-files","data","delimited-files","join","tab-separated","tsv","unix"],"created_at":"2024-11-15T19:36:49.620Z","updated_at":"2025-04-13T20:48:05.111Z","avatar_url":"https://github.com/stefan-schroedl.png","language":"Perl","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tabulator: Shell scripts for delimited data files\n\nURL: https://github.com/stefan-schroedl/tabulator\n\nAuthor: Stefan Schroedl\n\n## Date\n* 2015/04/03 release 1.2.1\n* 2015/03/21 release 1.2\n* 2014/10/12 release 1.1.2\n* 2012/01/24 release 1.1.1\n* 2011/11/25 release 1.1.0\n* 2009/06/16 release 1.0.0\n\n\n## Purpose\n\n\nUnix/Linux comes with several tools, such as `cut,` `paste,` `join,` `sort,` to\nprocess tabular text data files (a.k.a., *tab-delimited*, *csv*, *tsv*,\nor *flat file* format). However, they have shortcomings that often requires\nadditional scripting and prevents them from being directly used in one-liners.\nFor example, having to count columns to pass as arguments is cumbersome and\nerror-prone; `join` needs presorted files, and works with a single key column;\nthere is no direct 'group-by' functionality.\n\nOne remedy would be to load the data into a relational database or noSQL system\n(e.g., Hadoop/Pig) first, but these might not be available or be more\ntime-consuming for short, ad-hoc tasks.\n\n`Tabulator` is a collection of command-line shell tools for Unix/Linux platforms\nthat build on native shell programs and can be used as filters, but make them more\neasy and flexible to use. In particular, they\n\n* allow to reference columns by names rather than position, as indicated by the\n  first line in a file; this makes scripts more readable and robust to changes\n  in the input data format.\n* automate file format recognition (delimiter, compression etc).\n* check file format (e.g., consistent number of columns).\n* offer expanded functionality, such as SQL-like *relational join* and *group-by*\n  operations.\n\n## Installation\n\nInstallation is easy -- just unpack the tarball, add the unpacked directory to\nyour PATH.\n\n## License\n\nTabulator is licensed under the MIT LICENSE, see LICENSE for details.\n\n## Documentation\n\nHere is a brief list of the programs together with their main functionality.\nEach one provides more documentation and examples when called with the `-m` or\n`-h` options. A common assumption is that the first line in input files contains\nthe column names.\n\n* `tblcat:` concatenate files of the same data format without header repetition.\n* `tblcmd:` execute a program on the body of a file (e.g., `sort`, `uniq`),\n  without affecting the header.\n* `tbldesc:` for each column, summarize its type (e.g., char, int, float),\n  percentage of undefined values, min/max/mean/median/std, etc. Can also provide\n  correlation coefficients with a target column.\n\n\u003e Example: Suppose `file` is\n\n        name,house_nr,height,shoe_size\n        arthur,42,6.1,11.5\n        berta,101,5.5,8\n        chris,333,5.9,10\n        don,77,5.9,12.5\n\n\u003e Then `tbldesc file` prints:\n\n        summarizing file_desc (4 lines, target column: shoe_size)\n        field name     type char% uniq min max avg  std    mse  corr   prob%\n        1 name       char 100    4 [arthur; berta; chris; don]\n        2 house_nr   int    0    4 42   333 138   114    172   -0.287 71.25\n        3 height     float  0    3 5.5  6.1 5.85  0.218  4.89   0.812 18.82\n        4 shoe_size  float  0    4 8   12.5 10.5  1.7     0.0   1.0    0.00\n\n* `tblmap:` simple line-wise (\"map\") computation similar to awk.\n\n\u003e Example: Compute ratio of columns `sales` and `clients` for lines where column\n  `region` has value `us`:\n\n    tblmap -s'region==\"us\"' -c'sales_per_client=sales/client' file\n\n* `tblred:` compute (\"reduce\") aggregations (e.g., `sum`, `min`, `max`, `avg`,\n   etc.) over groups of keys -- similar to the SQL `group by` operator.\n\n\u003e Example:\n\n    tblred -k'region' 'sales_ratio=sales/sum(sales)'\n\u003e computes for each line proportion of column `sales` to total sales for all\n\u003e lines with the same value of column `region`.\n\n* `tbljoin:` Relational join. In contrast to Unix `join`, input files don't\n  need to be pre-sorted, and multiple join columns can be specified.\n\n\u003e Example: Suppose `file1` is\n\n    name,street,house\n    zorro,desert road,5\n    john,main st,2\n    arthur,pan-galactic bypass,42\n    arthur,main st,15\n\n\u003e and `file2` is\n\n    name,street,phone\n    john,main st,654-321\n    arthur,main st,121-212\n    john,round cir,123-456\n\n\u003e Then `tbljoin file1 file2` gives\n\n    name,street,house,phone\n    arthur,main st,15,121-212\n    john,main st,2,654-321\n\n* `tblhist:` computation and ascii-plotting of the histogram of column values.\n* `tblsplit:` split a file into several ones based on a column value.\n\n\u003e Example: Suppose `file` is\n\n        continent,country\n        americas,us\n        americas,mx\n        europe,de\n        europe,fr\n\n\u003e Then `tblsplit -rk'continent' file` generates two files,\n\u003e `file.select.continent=americas`:\n\n        country\n        us\n        mx\n\u003e and `file.select.continent=europe`:\n\n        country\n        de\n        fr\n\n* `tblsort:` interface for unix sort.\n* `tbltex:` formatting for latex tables.\n* `tbltranspose:` transposition of rows and columns.\n* `tbluniq:` check for and cut out duplicate columns; also, discover functional\n     value dependencies.\n* `tblcolumn:` format columns in a more readable way, using the unix 'column'\n     program, and aligning and shortening numbers.\n* `tblless:` page through formatted column output (calls tblcolumn).\n\n## Implementation Notes\n\n* The scripts have been developed over time to help with various data\n  processing tasks, and were not designed from the outset to be released in one\n  package. Therefore, some scripts are implemented in Python, and some in Perl;\n  and there is a small amount of overlapping functionality.\n* For very large files, it is crucial to be able to process them in *streaming*\n  mode, i.e., without keeping them entirely in main memory. This is the case for\n  `tblmap` (since it translates into an `cut`/`awk` script) and for `tbljoin`\n  (using `sort` and `join`), For `tblred`, presort the file first, then run it\n  with option `-s`.\n* `tblmap`, `tbljoin`, and `tblcmd` work by first translating the command into\n  a standalone shell script.\n\n## Limitations\n\n* There is no special interpretation of block delimiters like `'` or `\"`; it is\n  the user's responsibility to ensure that the column delimiter cannot occur\n  within column values.\n* `tbljoin`, `tblred`, `tblhist`, `tbluniq`, `tblcat`, and `tbltex` have some\n  restrictions when run as a filter (repeated reading is necessary in some cases).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefan-schroedl%2Ftabulator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstefan-schroedl%2Ftabulator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefan-schroedl%2Ftabulator/lists"}