{"id":13942495,"url":"https://github.com/rufuspollock/command-line-data-wrangling","last_synced_at":"2026-01-02T19:53:24.053Z","repository":{"id":55089197,"uuid":"9858659","full_name":"rufuspollock/command-line-data-wrangling","owner":"rufuspollock","description":"Wrangling data using only (unix) shell commands.","archived":false,"fork":false,"pushed_at":"2021-01-11T15:55:47.000Z","size":4,"stargazers_count":45,"open_issues_count":0,"forks_count":21,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-01-23T10:31:24.127Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rufuspollock.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-05-04T18:52:01.000Z","updated_at":"2024-11-01T16:38:28.000Z","dependencies_parsed_at":"2022-08-14T11:40:24.390Z","dependency_job_id":null,"html_url":"https://github.com/rufuspollock/command-line-data-wrangling","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rufuspollock%2Fcommand-line-data-wrangling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rufuspollock%2Fcommand-line-data-wrangling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rufuspollock%2Fcommand-line-data-wrangling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rufuspollock%2Fcommand-line-data-wrangling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rufuspollock","download_url":"https://codeload.github.com/rufuspollock/command-line-data-wrangling/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243955719,"owners_count":20374372,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-08T02:01:54.143Z","updated_at":"2026-01-02T19:53:24.027Z","avatar_url":"https://github.com/rufuspollock.png","language":null,"funding_links":[],"categories":["Others"],"sub_categories":[],"readme":"# Command Line Data Wrangling\n\nWrangling data using only (unix) shell commands.\n\nThe focus is on plain text and CSV (interpreted broadly in that the C(omma) can\nbe any other kind of delimiter)!\n\n## Tools of the trade\n\n* `cut` = filter columns\n* `sed` = replace (and much more)\n* `grep` = filter rows\n* `sort` = sort!\n* `uniq` = count duplicate (with sort = crude group by)\n* `paste` = join 2 files (line by line)\n* `wc` = count lines or \"words\"\n* `split` = split a file into pieces (less useful)\n\nThe limitation of most shell utilities for CSV is that:\n\n* they are fundamentally line oriented\n* they are limited to a fairly naive approach to delimiters\n\nIf your CSVs are well behaved and do not:\n\n* include line terminators within fields (which is allowed by CSV!)\n* have delimiter occurring with values (e.g. having , appear in a field value -\n  again this is very allowed as long as field value is quoted)\n\nThen there is no problem. If your CSVs are not like this it is still possible\nthese tools will be useful but you will have to be more careful.\n\n## Key Concepts\n\nPipes! One of the great (computing design) ideas of all time:\n\n\u003e 1. We should have some ways of coupling programs like garden hose--screw in\n\u003e another segment when it becomes when it becomes necessary to massage data in\n\u003e another way.  This is the way of IO also. [Doug McIlroy, 1964] ([source](http://doc.cat-v.org/unix/pipes/))\n\nBasic point: the unix shell (and all of the above commands) have great support\nfor feeding the output of one command into the input of another and doing so in\na streaming \"pipe-like\" manner.\n\nExample:\n```bash\nhead -n20 file.txt | tail -n5 | cut -c1-10\n```\n\nThis take first 20 lines of file.txt pipes that into tail which takes first 5\nlines of that pipes that into cut which take characters 1-10 of each line. The\nresult: characters 1-10 of lines 15-20.\n\n\n## Examples\n\n### Delete lines at Start or End\n\nIf continuous lines at top and bottom of the file use head or tail.\n\nDelete first line:\n```bash\ntail -n +2 {file}\n```\n\nDelete last line:\n```bash\nhead -n +2 {file}\n```\n\n### Delete Lines Generally\n\nDelete the nth line:\n```bash\nsed nd {file}\n```\n\nDelete a range of lines (n-m):\n```bash\nsed n,md {file}\n```\n\nMultiple deletes (first, third, n-m)\n```bash\nsed 1;3;n,md {file}\n```\n\n### Sort\n\nSort a CSV file by 2nd, 1st and 3rd columns.\n```bash\nsort --field-separator=',' --key=2,1,3 {file}\n```\n\n### Group By\n\n### View a CSV file on the command line\n\nYou can just do a simple cat:\n\n```bash\ncat somefile.csv\n```\n\nFor a nice view with proper spacing:\n\n```bash\ncolumn -s, -t -n \u003c somefile.csv | less -#2 -N -S\n```\n\nNote: when you have empty fields, you will need to put some kind of placeholder in it, otherwise the column gets merged with following columns. The following example demonstrates how to use sed to insert a placeholder:\n\n```console\n$ cat data.csv\n1,2,3,4,5\n1,,,,5\n$ sed 's/,,/, ,/g;s/,,/, ,/g' data.csv | column -s, -t -n\n1  2  3  4  5\n1           5\n$ cat data.csv\n1,2,3,4,5\n1,,,,5\n$ column -s, -t -n \u003c data.csv\n1  2  3  4  5\n1  5\n$ sed 's/,,/, ,/g;s/,,/, ,/g' data.csv | column -s, -t -n\n1  2  3  4  5\n1           5\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frufuspollock%2Fcommand-line-data-wrangling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frufuspollock%2Fcommand-line-data-wrangling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frufuspollock%2Fcommand-line-data-wrangling/lists"}