{"id":17383640,"url":"https://github.com/pgdr/ph","last_synced_at":"2025-04-15T10:04:06.441Z","repository":{"id":40681239,"uuid":"246001661","full_name":"pgdr/ph","owner":"pgdr","description":"ph — the tabular data shell tool","archived":false,"fork":false,"pushed_at":"2023-05-15T19:32:33.000Z","size":736,"stargazers_count":17,"open_issues_count":0,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-28T19:07:43.543Z","etag":null,"topics":["csv","pandas","pipeline","plot","shell","tabular-data"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/ph/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pgdr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-03-09T10:07:10.000Z","updated_at":"2025-03-11T20:27:44.000Z","dependencies_parsed_at":"2023-12-11T21:48:31.865Z","dependency_job_id":null,"html_url":"https://github.com/pgdr/ph","commit_stats":{"total_commits":256,"total_committers":3,"mean_commits":85.33333333333333,"dds":0.109375,"last_synced_commit":"e51a1930f6342f48617329a356295d319045b809"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgdr%2Fph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgdr%2Fph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgdr%2Fph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgdr%2Fph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pgdr","download_url":"https://codeload.github.com/pgdr/ph/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248750074,"owners_count":21155685,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","pandas","pipeline","plot","shell","tabular-data"],"created_at":"2024-10-16T07:43:21.737Z","updated_at":"2025-04-15T10:04:06.422Z","avatar_url":"https://github.com/pgdr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ph (pronounced _φ_) - the tabular data shell tool ![ph tests](https://github.com/pgdr/ph/workflows/ph%20tests/badge.svg?branch=master)\n\n\nSpoiler: Working with tabular data (csv) in the command line is difficult.\n\n`ph` makes it easy:\n\n```bash\n$ pip install ph\n$ cat iris.csv | ph columns\n150\n4\nsetosa\nversicolor\nvirginica\n$ cat iris.csv | ph columns setosa versicolor | ph head 15 | ph tail 5 | ph show\n      setosa    versicolor\n--  --------  ------------\n 0       1.5           0.2\n 1       1.6           0.2\n 2       1.4           0.1\n 3       1.1           0.1\n 4       1.2           0.2\n```\n\n```bash\n$ cat iris.csv | ph describe\n              150           4      setosa  versicolor   virginica\ncount  150.000000  150.000000  150.000000  150.000000  150.000000\nmean     5.843333    3.057333    3.758000    1.199333    1.000000\nstd      0.828066    0.435866    1.765298    0.762238    0.819232\nmin      4.300000    2.000000    1.000000    0.100000    0.000000\n25%      5.100000    2.800000    1.600000    0.300000    0.000000\n50%      5.800000    3.000000    4.350000    1.300000    1.000000\n75%      6.400000    3.300000    5.100000    1.800000    2.000000\nmax      7.900000    4.400000    6.900000    2.500000    2.000000\n```\n\nOccasionally you would like to plot a CSV file real quick, in which case you can\nsimply pipe it to `ph plot`:\n\nSuppose you have a dataset `covid.csv`\n\n```csv\nSK,Italy,Iran,France,Spain,US\n51,79,95,57,84,85\n104,150,139,100,125,111\n204,227,245,130,169,176\n433,320,388,191,228,252\n602,445,593,212,282,352\n833,650,978,285,365,495\n977,888,1501,423,430,640\n1261,1128,2336,613,674,926\n1766,1694,2922,949,1231,NaN\n2337,2036,3513,1126,1696,NaN\n3150,2502,4747,1412,NaN,NaN\n4212,3089,5823,1748,NaN,NaN\n4812,3858,6566,NaN,NaN,NaN\n5328,4638,7161,NaN,NaN,NaN\n5766,5883,8042,NaN,NaN,NaN\n6284,7375,NaN,NaN,NaN,NaN\n6767,9172,NaN,NaN,NaN,NaN\n7134,10149,NaN,NaN,NaN,NaN\n7382,NaN,NaN,NaN,NaN,NaN\n7513,NaN,NaN,NaN,NaN,NaN\n```\n\nWith this simple command, you get a certified _\"So fancy\" plot_.\n\n```bash\n$ cat covid.csv | ph plot\n```\n\n![So fancy covid plot](https://raw.githubusercontent.com/pgdr/ph/master/assets/covid-plot.png)\n\n\n_(Notice that this needs [matplotlib](https://matplotlib.org/): `pip install ph[plot]`)_\n\n\n---\n\n## Raison d'être\n\nUsing the _pipeline_ in Linux is nothing short of a dream in the life of the\ncomputer super user.\n\nHowever the pipe is clearly most suited for a stream of lines of textual data,\nand not when the stream is actually tabular data.\n\nTabular data is much more complex to work with due to its dual indexing and the\nfact that we often read horizontally and often read vertically.\n\nThe defacto format for tabular data is `csv`\n([comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values),\nwhich is not perfect in any sense\nof the word), and the defacto tool for working with tabular data in Python is\nPandas.\n\nThis is a shell utility `ph` (pronounced _phi_)\nthat reads tabular data from\n[_standard in_](https://en.wikipedia.org/wiki/Standard_streams#Standard_input_(stdin))\nand allows\nyou to perform a pandas function on the data, before writing it to standard out\nin `csv` format.\n\nThe goal is to create a tool which makes it nicer to work with tabular data in a\npipeline.\n\nTo achieve the goal, `ph` then reads csv data, does some manipulation,\nand prints out csv data.  With csv as the invariant, `ph` can be used in\na pipeline.\n\n---\n\nA very quick introduction to what `ph` can do for you,\nrun this in your shell:\n\n```bash\nph open csv https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/archived/ecdc/total_cases.csv \\\n    | ph slugify                                                       \\\n    | ph columns date norway sweden denmark                            \\\n    | ph diff norway sweden denmark                                    \\\n    | ph spencer norway sweden denmark                                 \\\n    | ph rolling 7 norway sweden denmark --how=mean                    \\\n    | ph dropna                                                        \\\n    | ph slice 50:                                                     \\\n    | ph plot --linewidth=3 --savefig=cases.svg --index=date\n```\n\n![cases](https://raw.githubusercontent.com/pgdr/ph/master/assets/cases.png)\n\n---\n\n## Table of contents\n\n1. [Getting started](#getting-started)\n1. [Example usage](#example-usage)\n1. [The tools](#the-tools)\n   1. [Concatenating, merging, filtering](#concatenating-merging-filtering)\n      1. [`cat`, `open`, `from`](#cat-open-from)\n      1. [`dropna` and `fillna`](#dropna-and-fillna)\n      1. [`head` and `tail`](#head-and-tail)\n      1. [`date`](#date)\n      1. [`merge`](#merge)\n   1. [Editing the csv](#editing-the-csv)\n      1. [`columns`, listing, selecting and re-ordering of](#columns-listing-selecting-and-re-ordering-of)\n      1. [`rename`](#rename)\n      1. [`replace`](#replace)\n      1. [`slice`](#slice)\n      1. [`eval`; Mathematipulating and creating new columns](#eval-mathematipulating-and-creating-new-columns)\n      1. [`normalize`](#normalize)\n      1. [`query`](#query)\n      1. [`grep`](#grep)\n      1. [`strip`](#strip)\n      1. [`removeprefix` and `removesuffix`](#removeprefix-and-removesuffix)\n   1. [Analyzing the csv file](#analyzing-the-csv-file)\n      1. [`describe`](#describe)\n      1. [`show`](#show)\n      1. [`tabulate`](#tabulate)\n      1. [`sort` values by column](#sort-values-by-column)\n      1. [`plot`](#plot)\n      1. [`groupby`](#groupby)\n      1. [`rolling`, `ewm`, `expanding`](#rolling-ewm-expanding)\n      1. [`index`](#index)\n      1. [`polyfit`](#polyfit)\n1. [Working with different formats](#working-with-different-formats)\n   1. [`open`](#open)\n   1. [`to` and `from`; Exporting and importing](#to-and-from-exporting-and-importing)\n   1. [Supported formats](#supported-formats)\n\n\n---\n\n\n## Getting started\n\nIf you have installed `ph[data]`, you can experiment using `ph dataset` if you\ndon't have an appropriate csv file available.\n\n\n```bash\nph dataset boston | ph describe\n```\n\nAvailable datasets are from\n[scikit-learn.datasets](https://scikit-learn.org/stable/datasets/index.html)\n\nToy datasets:\n\n* `boston`\n* `iris`\n* `diabetes`\n* `digits`\n* `linnerud`\n* `wine`\n* `breast_cancer`\n\n\nReal world:\n\n* `olivetti_faces`\n* `lfw_people`\n* `lfw_pairs`\n* `rcv1`\n* `kddcup99`\n* `california_housing`\n\n\n## Example usage\n\nSuppose you have a csv file `a.csv` that looks like this:\n\n```csv\nx,y\n3,8\n4,9\n5,10\n6,11\n7,12\n8,13\n```\n\nTranspose:\n\n```bash\n$ cat a.csv | ph transpose\n0,1,2,3,4,5\n3,4,5,6,7,8\n8,9,10,11,12,13\n```\n\n`median` (as well as many others, e.g.  `abs`, `corr`, `count`, `cov`, `cummax`,\n`cumsum`, `diff`, `max`, `product`, `quantile`, `rank`, `round`, `sum`, `std`,\n`var` etc.):\n\n```bash\n$ cat a.csv | ph median\nx,y\n5.5,10.5\n```\n\n**_Use `ph help` to list all commands_**\n\n\n## The tools\n\n### Concatenating, merging, filtering\n\n#### `cat`, `open`, `from`\n\n**cat**\n\nIt is possible to _concatenate_ (`cat`) multiple csv-files with `ph cat`:\n\n```bash\n$ ph cat a.csv b.csv --axis=index\n```\n\n```bash\n$ ph cat a.csv b.csv --axis=columns\n```\n\nThe functionality is described in\n[`pandas.concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html).\n\n\n**open**\n\nYou can open a csv, json, excel, gpx (etc., see [_supported\nformats_](#supported-formats)) using `ph open type file`:\n\n```bash\n$ ph open excel a.xlsx\n```\n\n```bash\n$ ph open excel a.xlsx --sheet_name=0 --skiprows=3\n```\n\n\n```bash\n$ ph open tsv a.tsv\n```\n\n```bash\n$ ph open csv a.csv\n```\n\nIn the event that the csv data starts on the first line (i.e. no\nheader is present), use `--header=None`:\n\n```bash\n$ ph open csv a.csv --header=None\n```\n\n\n\n**from**\n\nThe `ph from` command works similarly to `ph open` but reads from stdin\ninstead of opening a file.  It therefore does not take a filename\nargument:\n\n```bash\n$ cat /etc/passwd | ph from csv --sep=':' --header=None\n```\n\n\n#### `dropna` and `fillna`\n\n\nConsider again the `covid.csv` file from above.\n\n```bash\n$ cat covid.csv | ph dropna\n```\n\nwill remove all rows that contain N/A (`nan`) values.  If we want to keep all\nrows with at least 5 non-N/A values, we can use\n\n```bash\n$ cat covid.csv | ph dropna --thresh=5\n```\n\nIf we want to drop all _columns_ with N/A values instead of all _rows_, we use\n`--axis=1`.\n\nIf we want to drop only columns (resp. rows) with _all n/a_ values, we use\n`--how=all`.\n\n\nTo _replace_ N/A values with other values, we can simply run\n\n```bash\ncat covid.csv | ph fillna 999.75\n```\n\nIf we instead want to _pad_ the N/A values, we use `--method=pad`\n\n```bash\ncat covid.csv | ph fillna --method=pad\n```\n\nWe can limit the number of consecutive N/A values that are filled by using\n(e.g.) `--limit=7`.\n\n\n\n\n\n\n\n\n#### `head` and `tail`\n\nUsing `head` and `tail` works approximately as the normal shell equivalents,\nhowever they will preserve the header if there is one, e.g.\n\n```bash\n$ cat a.csv | ph head 7 | ph tail 3\nx,y\n6,11\n7,12\n8,13\n```\n\n#### `date`\n\nIf the `csv` file contains a column, e.g. named `x` containing\ntimestamps, it can be parsed as such with `ph date x`:\n\n```bash\n$ cat a.csv | ph date x\nx,y\n1970-01-04,8\n1970-01-05,9\n1970-01-06,10\n1970-01-07,11\n1970-01-08,12\n1970-01-09,13\n```\n\nIf your column is formatted with _freedom units_, `mm/dd/yyyy`, you can\nuse the flag `--dayfirst=True`:\n\n```csv\ndateRep,geoId\n01/04/2020,US\n31/03/2020,US\n30/03/2020,US\n29/03/2020,US\n28/03/2020,US\n```\n\n```bash\n$ cat ~/cov.csv | ph date dateRep --dayfirst=True\ndateRep,geoId\n2020-04-01,US\n2020-03-31,US\n2020-03-30,US\n2020-03-29,US\n2020-03-28,US\n```\n\n\n\nTo get a column with integers (e.g. 3-8) parsed as, e.g. 2003 - 2008, some\namount of hacking is necessary.  We will go into details later on the `eval` and\n`appendstr`.\n\n```bash\n$ cat a.csv | ph eval \"x = 2000 + x\" | ph appendstr x - | ph date x\nx,y\n2003-01-01,8\n2004-01-01,9\n2005-01-01,10\n2006-01-01,11\n2007-01-01,12\n2008-01-01,13\n```\n\nHowever, it is possible to provide a `--format` instruction to `date`:\n\n```bash\n$ cat a.csv | ph eval \"x = 2000 + x\"  | ph date x --format=\"%Y\"\nx,y\n2003-01-01,8\n2004-01-01,9\n2005-01-01,10\n2006-01-01,11\n2007-01-01,12\n2008-01-01,13\n```\n\nUnder some very special circumstances, we may have a `unix timestamp` in\na column, in which the `--utc=True` handle becomes useful:\n\nConsider `utc.csv`:\n\n```csv\ndate,x,y\n1580601600,3,8\n1580688000,4,9\n1580774400,5,10\n1580860800,6,11\n1580947200,7,12\n1581033600,8,13\n```\n\nwhere you get the correct dates:\n\n```bash\n$ cat utc.csv | ph date date --utc=True\ndate,x,y\n2020-02-02,3,8\n2020-02-03,4,9\n2020-02-04,5,10\n2020-02-05,6,11\n2020-02-06,7,12\n2020-02-07,8,13\n```\n\n\n#### `merge`\n\nMerging two csv files is made available through `ph merge f1 f2`.\n\nConsider `left.csv`\n\n```csv\nkey1,key2,A,B\nK0,K0,A0,B0\nK0,K1,A1,B1\nK1,K0,A2,B2\nK2,K1,A3,B3\n```\n\nand `right.csv`\n\n```csv\nkey1,key2,C,D\nK0,K0,C0,D0\nK1,K0,C1,D1\nK1,K0,C2,D2\nK2,K0,C3,D3\n```\n\nWe can merge them using (default to `--how=inner`)\n\n```bash\n$ ph merge left.csv right.csv\nkey1,key2,A,B,C,D\nK0,K0,A0,B0,C0,D0\nK1,K0,A2,B2,C1,D1\nK1,K0,A2,B2,C2,D2\n```\n\nor using an _outer_ join:\n\n```bash\n$ ph merge left.csv right.csv --how=outer\nkey1,key2,A,B,C,D\nK0,K0,A0,B0,C0,D0\nK0,K1,A1,B1,,\nK1,K0,A2,B2,C1,D1\nK1,K0,A2,B2,C2,D2\nK2,K1,A3,B3,,\nK2,K0,,,C3,D3\n```\n\nand we can specify on which column to join:\n\n```bash\n$ ph merge left.csv right.csv --on=key1 --how=outer\nkey1,key2_x,A,B,key2_y,C,D\nK0,K0,A0,B0,K0,C0,D0\nK0,K1,A1,B1,K0,C0,D0\nK1,K0,A2,B2,K0,C1,D1\nK1,K0,A2,B2,K0,C2,D2\nK2,K1,A3,B3,K0,C3,D3\n```\n\n\nIn the case when the two files do not share a common column key, we can\njoin them on key1 from the left file and key2 from the right file by specifying\n\n```bash\n$ ph merge mergel.csv merger.csv --left=key1 --right=key2\n```\n\n\n\n### Editing the csv\n\n#### `columns`, listing, selecting and re-ordering of\n\nConsider `c.csv`:\n\n```csv\nit,fr,de\n79,57,79\n157,100,130\n229,130,165\n323,191,203\n470,212,262\n655,285,545\n889,423,670\n1128,653,800\n1701,949,1040\n2036,1209,1224\n2502,1412,1565\n3089,1784,1966\n3858,2281,2745\n4636,2876,3675\n5883,3661,4181\n```\n\nPrint the column names:\n\n```bash\n$ cat c.csv | ph columns\nit\nfr\nde\n```\n\nSelecting only certain columns, e.g. `de` and `it`\n\n```bash\n$ cat c.csv | ph columns de it | ph tail 3\nde,it\n2745,3858\n3675,4636\n4181,5883\n```\n\n\n#### `rename`\n\n```bash\n$ cat c.csv | ph rename de Germany | ph rename it Italy | ph columns Italy Germany\nItaly,Germany\n79,79\n157,130\n229,165\n323,203\n470,262\n655,545\n889,670\n1128,800\n1701,1040\n2036,1224\n2502,1565\n3089,1966\n3858,2745\n4636,3675\n5883,4181\n```\n\nIn addition to `rename` there is an auxiliary function `slugify` that\nlets you _slugify_ the column names.  Consider `slugit.csv`\n\n```csv\n  Stupid column 1,  Jerky-column No. 2\n3,8\n4,9\n5,10\n6,11\n7,12\n8,13\n```\n\n```bash\n$ cat slugit.csv | ph slugify\nstupid_column_1,jerky_column_no_2\n3,8\n4,9\n5,10\n6,11\n7,12\n8,13\n```\n\nThen you can do\n\n```bash\n$ cat slugit.csv | ph slugify | ph rename stupid_column_1 first | ph rename jerky_column_no_2 second\nfirst,second\n3,8\n4,9\n5,10\n6,11\n7,12\n8,13\n```\n\n\n#### `replace`\n\nWe can replace values in the data (or in a single column) using `ph\nreplace`.  The syntax is\n`ph replace old new [--column=x [--newcolumn=xp]]`:\n\n```bash\n$ cat a.csv| ph replace 8 100\nx,y\n3,100\n4,9\n5,10\n6,11\n7,12\n100,13\n```\n\n```bash\n$ cat a.csv| ph replace 8 100 --column=x\nx,y\n3,8\n4,9\n5,10\n6,11\n7,12\n100,13\n```\n\n```bash\n$ cat a.csv| ph replace 8 100 --column=x --newcolumn=xp\nx,y,xp\n3,8,3\n4,9,4\n5,10,5\n6,11,6\n7,12,7\n8,13,100\n```\n\n\n\n#### `slice`\n\nSlicing in Python is essential, and occasionally, we want to slice\ntabular data, e.g. look at only the 100 first, or 100 last rows, or\nperhaps we want to look at only every 10th row.  All of this is achieved\nusing `ph slice start:end:step` with standard Python slice syntax.\n\n```bash\n$ cat a.csv | ph slice 1:9:2\nx,y\n4,9\n6,11\n8,13\n```\n\nReversing:\n\n```\n$ cat a.csv|ph slice ::-1\nx,y\n8,13\n7,12\n6,11\n5,10\n4,9\n3,8\n```\n\nSee also `ph head` and `ph tail`.\n\n```bash\n$ cat a.csv | ph slice :3\nx,y\n3,8\n4,9\n5,10\n```\n\nequivalent to\n\n```bash\n$ cat a.csv | ph head 3\nx,y\n3,8\n4,9\n5,10\n```\n\n\n\n#### `eval`; Mathematipulating and creating new columns\n\nYou can sum columns and place the result in a new column using\n`eval` (from\n[`pandas.DataFrame.eval`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eval.html#pandas.DataFrame.eval)).\n\n```bash\n$ cat c.csv | ph eval \"total = it + fr + de\" | ph tail 3\nit,fr,de,total\n3858,2281,2745,8884\n4636,2876,3675,11187\n5883,3661,4181,13725\n```\n\n\n```bash\n$ cat a.csv | ph eval \"z = x**2 + y\"\nx,y,z\n3,8,17\n4,9,25\n5,10,35\n6,11,47\n7,12,61\n8,13,77\n```\n\n\nIf you only want the result, you leave the `eval` expression without assignment\n\n```bash\n$ cat a.csv | ph eval \"x**2\"\nx\n9\n16\n25\n36\n49\n64\n```\n\n\n#### `normalize`\n\nYou can normalize a column using `ph normalize col`.\n\n```bash\n$ cat a.csv | ph eval \"z = x * y\" | ph normalize z\nx,y,z\n3,8,0.0\n4,9,0.15\n5,10,0.325\n6,11,0.525\n7,12,0.75\n8,13,1.0\n```\n\n\n\n#### `query`\n\nWe can [query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) data using `ph query expr`.\n\n```bash\n$ cat a.csv | ph query \"x \u003e 5\"\nx,y\n6,11\n7,12\n8,13\n```\n\n\n```bash\n$ ph open csv 'http://bit.ly/2cLzoxH' | ph query \"country == 'Norway'\" | ph tabulate --headers\n    country      year          pop  continent      lifeExp    gdpPercap\n--  ---------  ------  -----------  -----------  ---------  -----------\n 0  Norway       1952  3.32773e+06  Europe          72.67       10095.4\n 1  Norway       1957  3.49194e+06  Europe          73.44       11654\n 2  Norway       1962  3.63892e+06  Europe          73.47       13450.4\n 3  Norway       1967  3.78602e+06  Europe          74.08       16361.9\n 4  Norway       1972  3.933e+06    Europe          74.34       18965.1\n 5  Norway       1977  4.04320e+06  Europe          75.37       23311.3\n 6  Norway       1982  4.11479e+06  Europe          75.97       26298.6\n 7  Norway       1987  4.18615e+06  Europe          75.89       31541\n 8  Norway       1992  4.28636e+06  Europe          77.32       33965.7\n 9  Norway       1997  4.40567e+06  Europe          78.32       41283.2\n10  Norway       2002  4.53559e+06  Europe          79.05       44684\n11  Norway       2007  4.62793e+06  Europe          80.196      49357.2\n```\n\n\n\n#### `grep`\n\nThe powerful `grep` is one of the most used command line tools, and it\nwould be silly to not ship a version of it ourselves.  Using `ph grep`\nis rarely necessary, but helps when you want to ensure the header is\nkept.\n\n```bash\n$ cat txtfile.csv | ph grep \"a|b\" --case=False --column=Text_Column --regex=False\n```\n\nThe arguments denote\n\n* `--case` should be case sensitive?\n* `--column` grep only in given column\n* `--regex` use regex for pattern?\n\n\n\n#### `strip`\n\nOccasionally csv files come with additional spaces which can lead to\ndifficulties in parsing the cells' contents.  A csv file should be\nformatted without spaces after the comma `42,17` over `42, 17`.  But\nsince we are human, we sometimes make mistakes.\n\nIf we want to _strip_, or _trim_, the contents of a column, we use `ph\nstrip`:\n\n```bash\n$ cat txtfile.csv | ph strip col1 col2\n```\n\n\n\n#### `removeprefix` and `removesuffix`\n\nIf `strip` is not sufficiently powerful, it is possible to\n`removeprefix` or `removesuffix` using\n\n```bash\n$cat txtfile.csv | ph removeprefix col1 pattern\n```\n\nand similarly for `removesuffix`.\n\n\n\n\n\n### Analyzing the csv file\n\n\n#### `describe`\n\nThe normal Pandas `describe` is of course available:\n\n```bash\n$ cat a.csv | ph describe\n              x          y\ncount  6.000000   6.000000\nmean   5.500000  10.500000\nstd    1.870829   1.870829\nmin    3.000000   8.000000\n25%    4.250000   9.250000\n50%    5.500000  10.500000\n75%    6.750000  11.750000\nmax    8.000000  13.000000\n```\n\n\n#### `show`\n\nThe shorthand `ph show` simply calls the below `ph tabulate --headers`.\n\n```bash\n$ cat a.csv | ph show\n      x    y\n--  ---  ---\n 0    3    8\n 1    4    9\n 2    5   10\n 3    6   11\n 4    7   12\n 5    8   13\n```\n\n#### `tabulate`\n\nThe amazing _tabulate_ tool comes from the Python package\n[tabulate on PyPI](https://pypi.org/project/tabulate/).\n\nThe `tabulate` command takes arguments `--headers` to toggle printing of header\nrow, `--format=[grid,...]` to modify the table style and `--noindex` to remove\nthe running index (leftmost column in the example above).\n\nAmong the supported format styles are\n\n* `plain`, `simple`,\n* `grid`, `fancy_grid`, `pretty`,\n* `github`, `rst`, `mediawiki`, `html`, `latex`,\n* ... (See full list at the project homepage at\n  [python-tabulate](https://github.com/astanin/python-tabulate).)\n\n\n#### `sort` values by column\n\nYou can the columns in the csv data by a certain column:\n\n```bash\n$ cat iris.csv  | ph sort setosa | ph tail 5\n150,4,setosa,versicolor,virginica\n7.9,3.8,6.4,2.0,2\n7.6,3.0,6.6,2.1,2\n7.7,3.8,6.7,2.2,2\n7.7,2.8,6.7,2.0,2\n7.7,2.6,6.9,2.3,2\n```\n\n#### `plot`\n\nYou can plot data using `ph plot [--index=col]`.\n\n```bash\n$ ph open parquet 1A_2019.parquet | ph columns Time Value | ph plot --index=Time\n```\n\nThis will take the columns `Time` and `Value` from the timeseries provided by\nthe given `parquet` file and plot the `Value` series using `Time` as _index_.\n\n\nThe following example plots the life expectancy in Norway using `year` as _index_:\n\n```bash\n$ ph open csv http://bit.ly/2cLzoxH  | ph query \"country == 'Norway'\" | ph appendstr year -01-01 | ph columns year lifeExp | ph plot --index=year\n```\n\n![life-expectancy over time](https://raw.githubusercontent.com/pgdr/ph/master/assets/lifeexp.png)\n\n\u003e _Note:_ The strange `ph appendstr year -01-01` turns the items `1956` into\n\u003e `\"1956-01-01\"` and `2005` into `\"2005-01-01\"`.  These are necessary to make\n\u003e pandas to interpret `1956` as a _year_ and not as a _millisecond_.\n\u003e\n\u003e The command `ph appendstr col str [newcol]` takes a string and appends it to a\n\u003e column, overwriting the original column, or writing it to `newcol` if provided.\n\n**Advanced plotting**\n\nYou can choose the _kind_ of plotting ( ‘line’, ‘bar’, ‘barh’, ‘hist’, ‘box’,\n‘kde’, ‘density’, ‘area’, ‘pie’, ‘scatter’, ‘hexbin’), the _style_ of plotting\n(e.g. `--style=o`), and in case of scatter plot, you need to specify `--x=col1`\nand `--y=col2`, e.g.:\n\n```bash\n$ ph open csv http://bit.ly/2cLzoxH | ph query \"continent == 'Europe'\" | ph plot --kind=scatter --x=lifeExp --y=gdpPercap\n```\n\n![life-expectancy vs gdp](https://raw.githubusercontent.com/pgdr/ph/master/assets/scatter.png)\n\n\n\n\n\nTo specify the styling `k--` gives a black dashed line:\n\n```bash\n$ ph open csv http://bit.ly/2cLzoxH  | ph query \"country == 'Norway'\" | ph appendstr year -01-01 | ph columns year lifeExp | ph plot --index=year --style=k--\n```\n\n\n**Using `plot` headless**\n\nOccasionally we would like to generate a plot to an image(-like) file on\nthe command line or in a script, without necessarily launching any\ngraphic user interface.\n\nCalling `ph plot` with the argument `--savefig=myfile.png` will create a\nPNG file called `myfile.png` instead of opening the matplotlib window.\nIt is also possible to get other formats by using different extensions,\nlike `eps`, `pdf`, `pgf`, `png`, `ps`, `raw`, `rgba`, `svg`, `svgz`.\n\n\n**_`iplot`_ with `plotly` and `cufflinks`**\n\nInstead of using the `matplotlib` backend, there is an option for using `plotly`\nand [`cufflinks`](https://github.com/santosjorge/cufflinks) to generate\ninteractive plots.\nThis depends on `cufflinks`, and can be installed with `pip install ph[iplot]`.\n\n```bash\n$ cat a.csv | ph iplot --kind=bar --barmode=stack\n```\n\n```bash\n$ cat a.csv | ph iplot --kind=scatter --mode=markers\n```\n\n\n#### `groupby`\n\nSuppose you have a csv file\n\n```csv\nAnimal,Max Speed\nFalcon,380.0\nFalcon,370.0\nParrot,24.0\nParrot,26.0\n```\n\nYou can use Pandas' `groupby` functionality to get the aggregated `sum`,\n`mean`, or `first` value:\n\n```bash\n$ cat group.csv | ph groupby Animal --how=mean\nMax Speed\n375.0\n25.0\n```\n\nIf you want to retain the index column,\n\n```bash\n$ cat group.csv | ph groupby Animal --how=mean --as_index=False\nAnimal,Max Speed\nFalcon,375.0\nParrot,25.0\n```\n\n\n\n#### `rolling`, `ewm`, `expanding`\n\n**rolling**\n\nCompute rolling averages/sums using `ph rolling 3 --how=mean`\n\nConsider again `a.csv`:\n\n```csv\nx,y\n3,8\n4,9\n5,10\n6,11\n7,12\n8,13\n```\n\nMoving average with window size 3:\n\n```bash\n$ cat a.csv|ph rolling 3 --how=mean | ph dropna\nx,y\n4.0,9.0\n5.0,10.0\n6.0,11.0\n7.0,12.0\n```\n\n\nRolling sum with window size 2:\n\n```bash\n$ cat a.csv|ph rolling 2 --how=sum | ph dropna\nx,y\n7.0,17.0\n9.0,19.0\n11.0,21.0\n13.0,23.0\n15.0,25.0\n```\n\n\n**ewm — exponentially weighted methods**\n\n```bash\n$ cat a.csv | ph ewm --com=0.5 --how=mean | ph show\n          x         y\n--  -------  --------\n 0  3         8\n 1  3.75      8.75\n 2  4.61538   9.61538\n 3  5.55     10.55\n 4  6.52066  11.5207\n 5  7.50824  12.5082\n```\n\nUse either `com` (center of mass), `span`, `halflife`, or `alpha`,\ntogether with `--how=mean`, `--how=std`, `--how=var`, etc.\n\n\n**expanding — expanding window**\n\n\u003e A common alternative to rolling statistics is to use an expanding\n\u003e window, which yields the value of the statistic with all the data\n\u003e available up to that point in time.\n\n```bash\n$ cat a.csv | ph expanding 3\nx,y\n,\n,\n12.0,27.0\n18.0,38.0\n25.0,50.0\n33.0,63.0\n```\n\n\n**Spencer's 15-weight average**\n\nWe also support an experimental and slow version of Spencer's 15-weight\naverage.  This method takes a window of size 15, and pointwise multiply\nwith the following vector (normalized)\n\n```\n(-3, -6, -5, 3, 21, 46, 67, 74, 67, 46, 21, 3, -5, -6, -3)\n```\n\nand then takes the sum of the resulting vector.\n\nSpencer's 15-weight average is an interesting (impulse response) filter\nthat preserves all up to cubic polynomial functions.\n\n\n#### `index`\n\nOccasionally you need to have an index, in which case `ph index` is your tool:\n\n```bash\n$ cat a.csv | ph index\nindex,x,y\n0,3,8\n1,4,9\n2,5,10\n3,6,11\n4,7,12\n5,8,13\n```\n\n#### `polyfit`\n\nYou can perform **linear regression** and **polynomial regression** on a certain\nindex column `x` and a `y = f(x)` column using `ph polyfit`.  It takes two\narguments, the `x` column name, the `y` column name and an optional\n`--deg=\u003cdegree\u003e`, the degree of the polynomial.  The default option is `--deg=1`\nwhich corresponds to a linear regression.\n\nSuppose you have a csv file `lr.csv` with content\n\n```csv\nx,y\n4,12\n5,19\n6,17\n7,24\n8,28\n9,34\n```\n\nWith linear (polynomial) regression, you get an extra column, `polyfit_{deg}`:\n\n```bash\n$ cat lr.csv | ph polyfit x y | ph astype int\nx,y,polyfit_1\n4,12,12\n5,19,16\n6,17,20\n7,24,24\n8,28,28\n9,34,32\n```\n\nUsing `ph plot --index=x` results in this plot:\n\n![polyfit](https://raw.githubusercontent.com/pgdr/ph/master/assets/polyfit.png)\n\n## Working with different formats\n\n\n### `open`\n\nPandas supports reading a multitude of [readers](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).\n\nTo read an Excel file and pipe the stream, you can use `ph open`.\n\nThe syntax of `ph open` is `ph open ftype fname`, where `fname` is the\nfile you want to stream and `ftype` is the type of the file.\n\nA list of all available formats is given below.\n\n```bash\n$ ph open xls a.xlsx\nx,y\n3,8\n4,9\n5,10\n6,11\n7,12\n8,13\n```\n\n\nYou can open a _semicolon separated values_ file using `--sep=\";\"`\n\n```bash\n$ ph open csv --sep=\";\" fname.csv\n```\n\n\n\n### `to` and `from`; Exporting and importing\n\nObserve the following:\n\n```json\n{\"x\":{\"0\":3,\"1\":4,\"2\":5,\"3\":6,\"4\":7,\"5\":8},\n \"y\":{\"0\":8,\"1\":9,\"2\":10,\"3\":11,\"4\":12,\"5\":13}}\n```\n\nOf course, then,\n\n```bash\n$ cat a.csv | ph to json | ph from json\nx,y\n3,8\n4,9\n5,10\n6,11\n7,12\n8,13\n```\n\nThis also means that\n\n```bash\n$ cat a.csv | ph to json \u003e a.json\n$ cat a.json\n{\"x\":{\"0\":3,\"1\":4,\"2\":5,\"3\":6,\"4\":7,\"5\":8},\n \"y\":{\"0\":8,\"1\":9,\"2\":10,\"3\":11,\"4\":12,\"5\":13}}\n$ cat a.json | ph from json\nx,y\n3,8\n4,9\n5,10\n6,11\n7,12\n8,13\n```\n\nYou can open Excel-like formats using `ph open excel fname.xls[x]`, `parquet`\nfiles with `ph open parquet data.parquet`.  Note that these two examples require\n`xlrd` and `pyarrow`, respectively, or simply\n\n```\npip install ph[complete]\n```\n\n\n### Supported formats\n\n* `csv` / `tsv` (the latter for tab-separated values)\n* `fwf` (fixed-width file format)\n* `json`\n* `html`\n* `clipboard` (pastes tab-separated content from clipboard)\n* `xls`\n* `odf`\n* `hdf5`\n* `feather`\n* `parquet`\n* `orc`\n* `stata`\n* `sas`\n* `spss`\n* `pickle`\n* `sql`\n* `gbq` / `google` / `bigquery`\n\nWe also support reading GPX files with `ph open gpx`.\nThis uses the GPX Python library [gpxpy](https://github.com/tkrajina/gpxpy).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpgdr%2Fph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpgdr%2Fph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpgdr%2Fph/lists"}