{"id":16277526,"url":"https://github.com/rbitr/command-line-data-sci","last_synced_at":"2026-01-21T03:02:29.253Z","repository":{"id":184726756,"uuid":"163766387","full_name":"rbitr/command-line-data-sci","owner":"rbitr","description":"Basic data science at the command line","archived":false,"fork":false,"pushed_at":"2019-01-05T20:29:05.000Z","size":1999,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-08T17:26:59.882Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rbitr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-01-01T21:03:38.000Z","updated_at":"2019-01-05T20:29:07.000Z","dependencies_parsed_at":"2023-07-29T23:09:40.508Z","dependency_job_id":"b8fafc48-c663-487c-8448-53d0451e01ab","html_url":"https://github.com/rbitr/command-line-data-sci","commit_stats":null,"previous_names":["rbitr/command-line-data-sci"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rbitr/command-line-data-sci","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbitr%2Fcommand-line-data-sci","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbitr%2Fcommand-line-data-sci/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbitr%2Fcommand-line-data-sci/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbitr%2Fcommand-line-data-sci/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rbitr","download_url":"https://codeload.github.com/rbitr/command-line-data-sci/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rbitr%2Fcommand-line-data-sci/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28624341,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T02:47:06.670Z","status":"ssl_error","status_checked_at":"2026-01-21T02:45:44.886Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-10T18:55:17.666Z","updated_at":"2026-01-21T03:02:29.238Z","avatar_url":"https://github.com/rbitr.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Basic data science at the command line\n\nLinux command line tools can be used to perform many of the key data cleaning and exploration activities that make up data science workflow.\n\nIf you ssh into another server, this is a fast way to take a look at data. It also lets you do quick processing without the overhead of writing a full python program. And parsing big files at the command line can speed up subsequent data processing and reduce memory usage by discarding unnecessary data.\n\nThe main tools are grep, awk, sed, tr, and a few others, all of which come with ubuntu.\n\ngrep is a utility for finding patterns in a file\n\nsed is a stream editor that lets you find and replace text in a file; tr is a simpler utility for finding and replacing text\n\nawk is a programming language that operates on individual lines in a data stream\n\nLinux also has some commands like sort, head, etc. that can do simple operations on data.\n\nI will go through an example and explain what the commands and options are doing as the are used:\n\n## Weather data example \n\nEnvironment canada has an api that lets you download a .csv file of weather data. The URL for more information is ftp://ftp.tor.ec.gc.ca/Pub/Get_More_Data_Plus_de_donnees/Readme.txt\n\n__Getting a station code__\n\nIn order to download the data, you need to specify the code for the weather station. These codes are in a file ftp://client_climate@ftp.tor.ec.gc.ca/Pub/Get_More_Data_Plus_de_donnees/Station%20Inventory%20EN.csv\n\nYou can use curl to download this list and save it in a file:\n```bash\n$ curl -s ftp://client_climate@ftp.tor.ec.gc.ca/Pub/Get_More_Data_Plus_de_donnees/Station%20Inventory%20EN.csv \n```\nThe file is a csv so we would expect a row of headers and then some data. This one turns out to have some other information at the top as well. Using head to look at the first 5 lines, we get:\n\n```bash\n$ \u003c stations.csv head -5\nModified Date: 2018-12-31 23:33 UTC\n\"Station Inventory Disclaimer: Please note that this inventory list is a snapshot of stations on our website as of the modified date, and may be subject to change without notice.\"\n\"Station ID Disclaimer: Station IDs are an internal index numbering system and may be subject to change without notice.\"\n\"Name\",\"Province\",\"Climate ID\",\"Station ID\",\"WMO ID\",\"TC ID\",\"Latitude (Decimal Degrees)\",\"Longitude (Decimal Degrees)\",\"Latitude\",\"Longitude\",\"Elevation (m)\",\"First Year\",\"Last Year\",\"HLY First Year\",\"HLY Last Year\",\"DLY First Year\",\"DLY Last Year\",\"MLY First Year\",\"MLY Last Year\"\n\"ACTIVE PASS\",\"BRITISH COLUMBIA\",\"1010066\",\"14\",\"\",\"\",\"48.87\",\"-123.28\",\"485200000\",\"-1231700000\",\"4\",\"1984\",\"1996\",\"\",\"\",\"1984\",\"1996\",\"1984\",\"1996\"\n```\n\nThe fourth line contains the headers and the fifth shows what the first row of data looks like.\n\n__Selecting a row__\n\nNow that we know the headers are on line 4, let's print them out in a way that is easier to see. There are lots of different ways to do this but I like awk. We can start by piping the file into awk and using the NR selector to only show the fourth row:\n\n```bash\n$ \u003c stations.csv awk 'NR==4'\n\"Name\",\"Province\",\"Climate ID\",\"Station ID\",\"WMO ID\",\"TC ID\",\"Latitude (Decimal Degrees)\",\"Longitude (Decimal Degrees)\",\"Latitude\",\"Longitude\",\"Elevation (m)\",\"First Year\",\"Last Year\",\"HLY First Year\",\"HLY Last Year\",\"DLY First Year\",\"DLY Last Year\",\"MLY First Year\",\"MLY Last Year\"\n```\n\n__Splitting up text__\n\nEspecially in a terminal that wraps, it is nicer to see these as a list. The quickest way is to replace the commas with newlines:\n\n```bash\n$ \u003c stations.csv awk 'NR==4' | tr ',' '\\n'\n\"Name\"\n\"Province\"\n\"Climate ID\"\n\"Station ID\"\n\"WMO ID\"\n\"TC ID\"\n\"Latitude (Decimal Degrees)\"\n\"Longitude (Decimal Degrees)\"\n\"Latitude\"\n\"Longitude\"\n\"Elevation (m)\"\n\"First Year\"\n\"Last Year\"\n\"HLY First Year\"\n\"HLY Last Year\"\n\"DLY First Year\"\n\"DLY Last Year\"\n\"MLY First Year\"\n\"MLY Last Year\"\n```\n\nAn for readibility, let's add row numbers. I will show two ways just for fun. The lazy way, since we already have the command above, is to pipe back into awk and print the row number:\n\n```bash\n$ \u003c stations.csv awk 'NR==4' | tr ',' '\\n' | awk '{print NR,$0}'\n1 \"Name\"\n2 \"Province\"\n3 \"Climate ID\"\n4 \"Station ID\"\n5 \"WMO ID\"\n6 \"TC ID\"\n7 \"Latitude (Decimal Degrees)\"\n8 \"Longitude (Decimal Degrees)\"\n9 \"Latitude\"\n10 \"Longitude\"\n11 \"Elevation (m)\"\n12 \"First Year\"\n13 \"Last Year\"\n14 \"HLY First Year\"\n15 \"HLY Last Year\"\n16 \"DLY First Year\"\n17 \"DLY Last Year\"\n18 \"MLY First Year\"\n19 \"MLY Last Year\"\n```\n\nawk has an internal variable called NR that is the row number being operated on. And $0 is just the text in the whole line.\n\nThe command above is probable the most natural way to do this, because we are discovering what we want to do as we go. If you already knew this was what you wanted to do, you could use awk in one go:\n\n```bash\n$ \u003c stations.csv awk -F, 'NR==4 { for (i=1;i\u003c=NF;i++) { print i, $i}}'\n1 \"Name\"\n2 \"Province\"\n3 \"Climate ID\"\n4 \"Station ID\"\n5 \"WMO ID\"\n6 \"TC ID\"\n7 \"Latitude (Decimal Degrees)\"\n8 \"Longitude (Decimal Degrees)\"\n9 \"Latitude\"\n10 \"Longitude\"\n11 \"Elevation (m)\"\n12 \"First Year\"\n13 \"Last Year\"\n14 \"HLY First Year\"\n15 \"HLY Last Year\"\n16 \"DLY First Year\"\n17 \"DLY Last Year\"\n18 \"MLY First Year\"\n19 \"MLY Last Year\"\n```\n\nThis way is actually longer, but illustrates a couple things. awk separates the data into columns that can be accessed by `$c` where `c` is the column except $0 which gives whole line as in the previous version. The overall syntax of awk is to combine a condition, here `NR==4`, with what to do if that condition is met. The internal variable `NF` tells us the number of fields (columns) in the data. Lastly, the switch `-F,` (or `-F ','`) tells awk to use a comma as the field separator, because the default is a space.\n\n__Searching within text__\n\nNow that we know the fields, lets look up stations for a particular city. We can do this easily by using grep to match the city name:\n\n```bash\n$ \u003c stations.csv grep MONTREAL\n\"MONTREAL LAKE\",\"SASKATCHEWAN\",\"4065260\",\"3390\",\"\",\"\",\"53.62\",\"-105.67\",\"533700000\",\"-1054000000\",\"490.4\",\"1959\",\"1959\",\"\",\"\",\"1959\",\"1959\",\"1959\",\"1959\"\n\"SOUTH MONTREAL LAKE DNR\",\"SASKATCHEWAN\",\"4067670\",\"3396\",\"\",\"\",\"54.05\",\"-105.8\",\"540300000\",\"-1054800000\",\"490.1\",\"1960\",\"1960\",\"\",\"\",\"1960\",\"1960\",\"1960\",\"1960\"\n\"MONTREAL FALLS\",\"ONTARIO\",\"6055300\",\"4084\",\"\",\"\",\"47.25\",\"-84.4\",\"471500000\",\"-842400000\",\"408.4\",\"1932\",\"1955\",\"\",\"\",\"1932\",\"1955\",\"1932\",\"1955\"\n\"MONTREAL FALLS\",\"ONTARIO\",\"6055302\",\"4085\",\"\",\"\",\"47.27\",\"-84.43\",\"471600000\",\"-842600000\",\"306.3\",\"1976\",\"1999\",\"\",\"\",\"1976\",\"1999\",\"1976\",\"1999\"\n...\n```\n\nThere is a long list of stations for Montreal. Only the first few are shown above. First, the quotation marks in every line are annoying so lets remove them with sed:\n\n__Replacing text with sed__\n\n```bash\n$ \u003c stations.csv sed -e 's/\"//g' | grep MONTREAL\nMONTREAL LAKE,SASKATCHEWAN,4065260,3390,,,53.62,-105.67,533700000,-1054000000,490.4,1959,1959,,,1959,1959,1959,1959\nSOUTH MONTREAL LAKE DNR,SASKATCHEWAN,4067670,3396,,,54.05,-105.8,540300000,-1054800000,490.1,1960,1960,,,1960,1960,1960,1960\nMONTREAL FALLS,ONTARIO,6055300,4084,,,47.25,-84.4,471500000,-842400000,408.4,1932,1955,,,1932,1955,1932,1955\n...\n```\n\nsed uses the s command to relace all quotation marks ( /\" command ) with nothing ( // ) and do so globally in the file (the g).\n\n__Displaying certain columns with awk__\n\nWhat we really care about is the station numnber, and also, if we are looking to find a station that is still in service, the years the station was active. \n\n```bash\n$ \u003c stations.csv sed -e 's/\"//g' | grep MONTREAL | awk -F, '{print $1, $4, $12, $13}'\nMONTREAL LAKE 3390 1959 1959\nSOUTH MONTREAL LAKE DNR 3396 1960 1960\nMONTREAL FALLS 4084 1932 1955\nMONTREAL FALLS 4085 1976 1999\nMONTREAL RIVER (AUT) 41595 2000 2007\nMONTREAL RIVER 4166 1910 1967\nMONTREAL ADAC A 8343 1974 1976\nMONTREAL ICE CONTROL 5414 1967 1970\nMONTREAL/PIERRE ELLIOTT TRUDEAU INTL A 5415 1941 2013\n```\n\nHere we show the first, fourth, twelfth and thirteenth columns to get the name, the id, and the operational years. There is still a long list of stations (I showed a few more). So finally let's look at only those still operating in 2018:\n\n__Filtering with awk__\n\n```bash\n$ \u003c stations.csv sed -e 's/\"//g' | grep MONTREAL | awk -F, '$13\u003e=\"2018\" {print $1, $4, $12, $13}'\nMONTREAL INTL A 51157 2013 2018\nMONTREAL/ST-HUBERT 48374 2009 2018\nMONTREAL/PIERRE ELLIOTT TRUDEAU INTL 30165 2002 2018\nMONTREAL MIRABEL INTL A 49608 2012 2018\n```\n\nWe used the `'$13\u003e=\"2018\"'` condition with awk in order to only display stations recording in 2018 or later, and now the list is short.\n\n## Downloading the weather data\n\nNow that we have the code, we can use it do download a spreadsheet of weather data from the Environment Canada API. The documentation explains that we can get the data from the following URL:\n\nhttp://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv\u0026stationID=${ID}\u0026Year=${year}\u0026Month=${month}\u0026Day=${day}4\u0026timeframe=${tf}\u0026submit=Download+Data\n\nThe station ID, year, month, and day are specified as shown, along with a timeframe, 1=hourly, 2=daily, 3=monthly. \n\n__Checking out the data__\n\nMaking the substitutions for Trudeau Airport (ID=30165) gives us:\n\n```bash\n$ URL=\"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv\u0026stationID=30165\u0026Year=2018\u0026Month=1\u0026Day=1\u0026timeframe=2\u0026submit=Download+Data\" \n$ curl -s $URL | head\n\"Station Name\",\"MONTREAL/PIERRE ELLIOTT TRUDEAU INTL\"\n\"Province\",\"QUEBEC\"\n\"Current Station Operator\",\"Environment and Climate Change Canada - Meteorological Service of Canada\"\n\"Latitude\",\"45.47\"\n\"Longitude\",\"-73.74\"\n\"Elevation\",\"32.10\"\n\"Climate Identifier\",\"702S006\"\n\"WMO Identifier\",\"71183\"\n\"TC Identifier\",\"WTQ\"\n```\n\nThis is a bunch of indentifying information about the station. We also set a variable for the URL so we don't have to keep looking at it. Eventually it contains rows of data:\n\n```bash\n$ curl -s $URL | tail\n\"2018-12-22\",\"2018\",\"12\",\"22\",\"\",\"8.5\",\"\",\"-6.1\",\"\",\"1.2\",\"\",\"16.8\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"2.0\",\"\",\"1\",\"\",\"28\",\"\",\"56\",\"\"\n\"2018-12-23\",\"2018\",\"12\",\"23\",\"\",\"-6.1\",\"\",\"-11.0\",\"\",\"-8.5\",\"\",\"26.5\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"0.2\",\"\",\"1\",\"\",\"25\",\"\",\"39\",\"\"\n\"2018-12-24\",\"2018\",\"12\",\"24\",\"\",\"-6.4\",\"\",\"-11.3\",\"\",\"-8.8\",\"\",\"26.8\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"0.0\",\"\",\"1\",\"\",\"\",\"\",\"\",\"\"\n\"2018-12-25\",\"2018\",\"12\",\"25\",\"\",\"-8.2\",\"\",\"-13.5\",\"\",\"-10.8\",\"\",\"28.8\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"0.0\",\"\",\"1\",\"\",\"\",\"\",\"\",\"\"\n\"2018-12-26\",\"2018\",\"12\",\"26\",\"\",\"-4.3\",\"\",\"-12.1\",\"\",\"-8.2\",\"\",\"26.2\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"0.2\",\"\",\"1\",\"\",\"\",\"\",\"\",\"\"\n\"2018-12-27\",\"2018\",\"12\",\"27\",\"\",\"-8.4\",\"\",\"-14.4\",\"\",\"-11.4\",\"\",\"29.4\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"0.6\",\"\",\"2\",\"\",\"5\",\"\",\"31\",\"\"\n\"2018-12-28\",\"2018\",\"12\",\"28\",\"\",\"3.0\",\"\",\"-8.7\",\"\",\"-2.9\",\"\",\"20.9\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"14.6\",\"\",\"2\",\"\",\"16\",\"\",\"32\",\"\"\n\"2018-12-29\",\"2018\",\"12\",\"29\",\"\",\"6.4\",\"\",\"-12.2\",\"\",\"-2.9\",\"\",\"20.9\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"0.6\",\"\",\"2\",\"\",\"25\",\"\",\"51\",\"\"\n\"2018-12-30\",\"2018\",\"12\",\"30\",\"\",\"-6.9\",\"\",\"-12.7\",\"\",\"-9.8\",\"\",\"27.8\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"0.5\",\"\",\"2\",\"\",\"\",\"\",\"\",\"\"\n\"2018-12-31\",\"2018\",\"12\",\"31\",\"\",\"2.2\",\"\",\"-8.4\",\"\",\"-3.1\",\"\",\"21.1\",\"\",\"0.0\",\"\",\"\",\"\",\"\",\"M\",\"2.5\",\"\",\"2\",\"\",\"15\",\"\",\"38\",\"\"\n```\n\nThis is the end of 2018. We will have to play around a bit to figure out where the columns headers are in the file:\n\n```bash\n$ curl -s $URL | head -30 | tail -5\n\"Date/Time\",\"Year\",\"Month\",\"Day\",\"Data Quality\",\"Max Temp (°C)\",\"Max Temp Flag\",\"Min Temp (°C)\",\"Min Temp Flag\",\"Mean Temp (°C)\",\"Mean Temp Flag\",\"Heat Deg Days (°C)\",\"Heat Deg Days Flag\",\"Cool Deg Days (°C)\",\"Cool Deg Days Flag\",\"Total Rain (mm)\",\"Total Rain Flag\",\"Total Snow (cm)\",\"Total Snow Flag\",\"Total Precip (mm)\",\"Total Precip Flag\",\"Snow on Grnd (cm)\",\"Snow on Grnd Flag\",\"Dir of Max Gust (10s deg)\",\"Dir of Max Gust Flag\",\"Spd of Max Gust (km/h)\",\"Spd of Max Gust Flag\"\n\"2018-01-01\",\"2018\",\"01\",\"01\",\"\",\"-18.8\",\"\",\"-25.3\",\"\",\"-22.1\",\"\",\"40.1\",\"\",\"0.0\",\"\",\"\",\"M\",\"\",\"M\",\"0.0\",\"\",\"18\",\"\",\"25\",\"\",\"32\",\"\"\n\"2018-01-02\",\"2018\",\"01\",\"02\",\"\",\"-14.0\",\"\",\"-24.9\",\"\",\"-19.5\",\"\",\"37.5\",\"\",\"0.0\",\"\",\"\",\"M\",\"\",\"M\",\"2.5\",\"\",\"18\",\"\",\"13\",\"\",\"32\",\"\"\n\"2018-01-03\",\"2018\",\"01\",\"03\",\"\",\"-10.2\",\"\",\"-16.2\",\"\",\"-13.2\",\"\",\"31.2\",\"\",\"0.0\",\"\",\"\",\"M\",\"\",\"M\",\"0.2\",\"\",\"22\",\"\",\"\",\"\",\"\u003c31\",\"\"\n\"2018-01-04\",\"2018\",\"01\",\"04\",\"\",\"-6.7\",\"\",\"-14.1\",\"\",\"-10.4\",\"\",\"28.4\",\"\",\"0.0\",\"\",\"\",\"M\",\"\",\"M\",\"0.9\",\"\",\"23\",\"\",\"27\",\"\",\"46\",\"\"\n```\n\nA lucky guess. The 26th row is contains the headers.\n\n__Breaking a line into fields__\n\n```bash\n$ curl -s $URL | sed 's/\"//g' | awk -F, 'NR==26 {for (i=1;i\u003c=NF;i++) {print i, $i}}'\n1 Date/Time\n2 Year\n3 Month\n4 Day\n5 Data Quality\n6 Max Temp (°C)\n7 Max Temp Flag\n8 Min Temp (°C)\n9 Min Temp Flag\n10 Mean Temp (°C)\n11 Mean Temp Flag\n12 Heat Deg Days (°C)\n13 Heat Deg Days Flag\n14 Cool Deg Days (°C)\n15 Cool Deg Days Flag\n16 Total Rain (mm)\n17 Total Rain Flag\n18 Total Snow (cm)\n19 Total Snow Flag\n20 Total Precip (mm)\n21 Total Precip Flag\n22 Snow on Grnd (cm)\n23 Snow on Grnd Flag\n24 Dir of Max Gust (10s deg)\n25 Dir of Max Gust Flag\n26 Spd of Max Gust (km/h)\n27 Spd of Max Gust Flag\n```\n\nWe followed the same pattern as with the station names file to display the fields.\n\n__Select some rows with sed and print some columns with awk__\n\nHere are a few of the mean temperatures (column 10) by date:\n\n```bash\n$ curl -s $URL | sed 's/\"//g' | sed -n '27,36p' | awk -F, '{print $1, $10}'\n2018-01-01 -22.1\n2018-01-02 -19.5\n2018-01-03 -13.2\n2018-01-04 -10.4\n2018-01-05 -18.3\n2018-01-06 -21.9\n2018-01-07 -17.7\n2018-01-08 -7.1\n2018-01-09 -5.3\n2018-01-10 -8.0\n```\n\n__Data consolidation__\n\nHaving read in some data, we want to do some calculations on it. For example, get the average monthly temperature. Consider:\n\n```bash\n$ curl -s $URL | sed -e 's/\"//g' | awk -F, 'NR\u003e26 {sum[$3]+=$10; num[$3]+=1} END {for (k in sum) print k,sum[k]/num[k]}' | sort -n\n01 -9.71613\n02 -4.56429\n03 -0.919355\n04 3.92\n05 14.8161\n06 18.3167\n07 24.2129\n08 22.2645\n09 17.63\n10 6.05161\n11 -0.716667\n12 -4.81613\n```\n\nThis statement uses two new features of awk. One is array indexing. The statement `sum[$3]+=$10` uses the month (field 3) as an index to an array called `sum`. Indices previously not encountered are initialized to 0. The mean temperature that day (field 10) is then added to the sum. At the same time, we use `num[$3]+=1` to count the number of days summed for each month.\n\nThe second feature is the END statement for awk. This is what is executed after we have processed all lines in the file. In this case, we are printing out the averages for each month, obtained by dividing the sums by the counts.\n\nThe resulting experession is piped to sort because awk does not necessarily step though the array indices in order - they are strings, not numbers.\n\n__An aside about arrays__\n\nAs an aside, awk's array indexing works with any strings:\n\n```bash\n$ curl -s http://www.gutenberg.org/files/108/108-0.txt | tr '[:upper:]' '[:lower:]' | tr -cd '[a-z]\\n ' | tr -s '\\n' | tr ' ' '\\n' | awk '{words[$1]+=1} END { for (w in words) print w, words[w]}' | sort -k 2 -r -n | head\nthe 6430\nand 2955\nof 2927\ni 2910\na 2721\nto 2682\nthat 2107\nin 1900\nwas 1816\nit 1814\n```\n\nThe line above downloads the text of a book, uses tr to change all letters to lowercase and remove non-letters, puts one work on each line (by replacing spaces with newlines) and then uses awk to count the occurrence of each work. The statement `words[$1]+=1` is all it takes to use the word on the current line as an index into the array and add one to the count for that word.\n\n__More complex calculations with multiple passes__\n\nLastly, we can do more complicated things if we want. It may be better off to use python at this point, but for big files or remote access the following idea may still make sense. Here we calculate the standard deviation for each month (I adapted this from another tutorial available at http://john-hawkins.blogspot.com/2013/09/using-awk-for-data-science.html\n\nFirst save the data in a file and get some of the preprocessing out of the way:\n\n```bash\n$ curl -s $URL | sed -e 's/\"' | awk 'NR\u003e26' \u003e TMPFILE\n```\n\nNow, compute the standard deviation in two passes:\n\n```bash\n$ awk -F, 'pass==1 {sum[$3]+=$10; num[$3]+=1} pass==2 { mean=sum[$3]/num[$3]; ssd[$3]+=($10-mean)*($10-mean)} END {for (k in sum) print k,sum[k]/num[k], sqrt(ssd[k]/num[k])}' pass=1 TMPFILE pass=2 TMPFILE | sort\n01 -9.71613 7.37997\n02 -4.56429 5.49586\n03 -0.919355 4.14825\n04 3.92 4.79794\n05 14.8161 5.24731\n06 18.3167 4.86991\n07 24.2129 2.57716\n08 22.2645 4.70857\n09 17.63 4.76873\n10 6.05161 4.88473\n11 -0.716667 5.41738\n12 -4.81613 4.82006\n```\n\nYou can see we pass the data to awk twice along with a `pass` variable. On the first pass, we find the mean for each month as before. On the second pass, we add the squared residual value for each day to an array for its month ( `ssd[$3]+=($10-mean)*($10-mean)` ) and then take the square root of the average to get the standard deviation.\n\nThis gives an idea of how a more complex analysis could take place. For example the link referenced above shows an example of using awk to calculate the correlation between two variables. \n\n__Conclusions__\n\nWhile awk does most of the heavy lifting, combining it with sed, grep, and a few other utilities creates a simple and powerful workflow that can handle many simple data manipulation and analysis tasks.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frbitr%2Fcommand-line-data-sci","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frbitr%2Fcommand-line-data-sci","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frbitr%2Fcommand-line-data-sci/lists"}