{"id":13465703,"url":"https://github.com/kieferk/dfply","last_synced_at":"2025-03-25T16:32:44.841Z","repository":{"id":11352499,"uuid":"68752303","full_name":"kieferk/dfply","owner":"kieferk","description":"dplyr-style piping operations for pandas dataframes","archived":false,"fork":false,"pushed_at":"2022-05-08T03:25:22.000Z","size":2187,"stargazers_count":893,"open_issues_count":42,"forks_count":103,"subscribers_count":40,"default_branch":"master","last_synced_at":"2025-03-09T18:16:15.673Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kieferk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-09-20T20:51:27.000Z","updated_at":"2025-03-09T03:07:40.000Z","dependencies_parsed_at":"2022-08-08T15:30:15.810Z","dependency_job_id":null,"html_url":"https://github.com/kieferk/dfply","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kieferk%2Fdfply","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kieferk%2Fdfply/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kieferk%2Fdfply/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kieferk%2Fdfply/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kieferk","download_url":"https://codeload.github.com/kieferk/dfply/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245500455,"owners_count":20625585,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T15:00:34.052Z","updated_at":"2025-03-25T16:32:44.325Z","avatar_url":"https://github.com/kieferk.png","language":"Python","funding_links":[],"categories":["Python","Uncategorized","Libraries"],"sub_categories":["Uncategorized"],"readme":"# dfply\n\n### Version: 0.3.2\n\n\u003e Note: Version 0.3.0 is the first big update in awhile, and changes a lot of\nthe \"base\" code. The `pandas-ply` package is no longer being imported. I have coded\nmy own version of the \"symbolic\" objects that I was borrowing from `pandas-ply`. Also,\nI am no longer supporting Python 2, sorry!\n\n\u003e **In v0.3 `groupby` has been renamed to `group_by` to mirror the `dplyr` function.\nIf this breaks your legacy code, one possible fix is to have `from dfply.group import group_by as groupby`\nin your package imports.**\n\nThe `dfply` package makes it possible to do R's `dplyr`-style data manipulation with pipes\nin python on pandas DataFrames.\n\nThis is an alternative to [`pandas-ply`](https://github.com/coursera/pandas-ply)\nand [`dplython`](https://github.com/dodger487/dplython), which both engineer `dplyr`\nsyntax and functionality in python. There are probably more packages that attempt\nto enable `dplyr`-style dataframe manipulation in python, but those are the two I\nam aware of.\n\n`dfply` uses a decorator-based architecture for the piping functionality and\nto \"categorize\" the types of data manipulation functions. The goal of this  \narchitecture is to make `dfply` concise and easily extensible, simply by chaining\ntogether different decorators that each have a distinct effect on the wrapped\nfunction. There is a more in-depth overview of the decorators and how `dfply` can be\ncustomized below.\n\n`dfply` is intended to mimic the functionality of `dplyr`. The syntax\nis the same for the most part, but will vary in some cases as Python is a\nconsiderably different programming language than R.\n\nA good amount of the core functionality of `dplyr` is complete, and the remainder is\nactively being added in. Going forward I hope functionality that is not\ndirectly part of `dplyr` to be incorporated into `dfply` as well. This is not\nintended to be an absolute mimic of `dplyr`, but instead a port of the _ease,\nconvenience and readability_ the `dplyr` package provides for data manipulation\ntasks.\n\n**Expect frequent updates to the package version as features are added and\nany bugs are fixed.**\n\n\u003c!-- START doctoc generated TOC please keep comment here to allow auto update --\u003e\n\u003c!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --\u003e\n\n\n- [Overview of functions](#overview-of-functions)\n  - [The `\u003e\u003e` and `\u003e\u003e=` pipe operators](#the--and--pipe-operators)\n  - [The `X` DataFrame symbol](#the-x-dataframe-symbol)\n  - [Selecting and dropping](#selecting-and-dropping)\n    - [`select()` and `drop()` functions](#select-and-drop-functions)\n    - [Selection using the inversion `~` operator on symbolic columns](#selection-using-the-inversion--operator-on-symbolic-columns)\n    - [Selection filter functions](#selection-filter-functions)\n  - [Subsetting and filtering](#subsetting-and-filtering)\n    - [`row_slice()`](#row_slice)\n    - [`sample()`](#sample)\n    - [`distinct()`](#distinct)\n    - [`mask()`](#mask)\n  - [DataFrame transformation](#dataframe-transformation)\n    - [`mutate()`](#mutate)\n    - [`transmute()`](#transmute)\n  - [Grouping](#grouping)\n    - [`group_by()` and `ungroup()`](#group_by-and-ungroup)\n  - [Reshaping](#reshaping)\n    - [`arrange()`](#arrange)\n    - [`rename()`](#rename)\n    - [`gather()`](#gather)\n    - [`spread()`](#spread)\n    - [`separate()`](#separate)\n    - [`unite()`](#unite)\n  - [Joining](#joining)\n    - [`inner_join()`](#inner_join)\n    - [`outer_join()` or `full_join()`](#outer_join-or-full_join)\n    - [`left_join()`](#left_join)\n    - [`right_join()`](#right_join)\n    - [`semi_join()`](#semi_join)\n    - [`anti_join()`](#anti_join)\n  - [Set operations](#set-operations)\n    - [`union()`](#union)\n    - [`intersect()`](#intersect)\n    - [`set_diff()`](#set_diff)\n  - [Binding](#binding)\n    - [`bind_rows()`](#bind_rows)\n    - [`bind_cols()`](#bind_cols)\n  - [Summarization](#summarization)\n    - [`summarize()`](#summarize)\n    - [`summarize_each()`](#summarize_each)\n- [Embedded column functions](#embedded-column-functions)\n  - [Window functions](#window-functions)\n    - [`lead()` and `lag()`](#lead-and-lag)\n    - [`between()`](#between)\n    - [`dense_rank()`](#dense_rank)\n    - [`min_rank()`](#min_rank)\n    - [`cumsum()`](#cumsum)\n    - [`cummean()`](#cummean)\n    - [`cummax()`](#cummax)\n    - [`cummin()`](#cummin)\n    - [`cumprod()`](#cumprod)\n  - [Summary functions](#summary-functions)\n    - [`mean()`](#mean)\n    - [`first()`](#first)\n    - [`last()`](#last)\n    - [`nth()`](#nth)\n    - [`n()`](#n)\n    - [`n_distinct()`](#n_distinct)\n    - [`IQR()`](#iqr)\n    - [`colmin()`](#colmin)\n    - [`colmax()`](#colmax)\n    - [`median()`](#median)\n    - [`var()`](#var)\n    - [`sd()`](#sd)\n- [Extending `dfply` with custom functions](#extending-dfply-with-custom-functions)\n  - [Case 1: A custom \"pipe\" function with `@dfpipe`](#case-1-a-custom-pipe-function-with-dfpipe)\n  - [Case 2: A function that works with symbolic objects using `@make_symbolic`](#case-2-a-function-that-works-with-symbolic-objects-using-make_symbolic)\n    - [Without symbolic arguments, `@make_symbolic` functions work like normal functions!](#without-symbolic-arguments-make_symbolic-functions-work-like-normal-functions)\n- [Advanced: understanding base `dfply` decorators](#advanced-understanding-base-dfply-decorators)\n  - [The `Intention` class](#the-intention-class)\n  - [`@pipe`](#pipe)\n  - [`@group_delegation`](#group_delegation)\n  - [`@symbolic_evaluation`](#symbolic_evaluation)\n    - [Controlling `@symbolic_evaluation` with the `eval_symbols` argument](#controlling-symbolic_evaluation-with-the-eval_symbols-argument)\n  - [`@dfpipe`](#dfpipe)\n  - [`@make_symbolic`](#make_symbolic)\n- [Contributing](#contributing)\n\n\u003c!-- END doctoc generated TOC please keep comment here to allow auto update --\u003e\n\n## Overview of functions\n\n### The `\u003e\u003e` and `\u003e\u003e=` pipe operators\n\ndfply works directly on pandas DataFrames, chaining operations on the data with\nthe `\u003e\u003e` operator, or alternatively starting with `\u003e\u003e=` for inplace operations.\n\n```python\nfrom dfply import *\n\ndiamonds \u003e\u003e head(3)\n\n   carat      cut color clarity  depth  table  price     x     y     z\n0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43\n1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31\n2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31\n```\n\nYou can chain piped operations, and of course assign the output to a new\nDataFrame.\n\n```python\nlowprice = diamonds \u003e\u003e head(10) \u003e\u003e tail(3)\n\nlowprice\n\n   carat        cut color clarity  depth  table  price     x     y     z\n7   0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53\n8   0.22       Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49\n9   0.23  Very Good     H     VS1   59.4   61.0    338  4.00  4.05  2.39\n```\n\nInplace operations are done with the first pipe as `\u003e\u003e=` and subsequent pipes\nas `\u003e\u003e`.\n\n```python\ndiamonds \u003e\u003e= head(10) \u003e\u003e tail(3)\n\ndiamonds\n\n   carat        cut color clarity  depth  table  price     x     y     z\n7   0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53\n8   0.22       Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49\n9   0.23  Very Good     H     VS1   59.4   61.0    338  4.00  4.05  2.39\n```\n\nWhen using the inplace pipe, the DataFrame is not required on the left hand\nside of the `\u003e\u003e=` pipe and the DataFrame variable is overwritten with the\noutput of the operations.\n\n\n### The `X` DataFrame symbol\n\nThe DataFrame as it is passed through the piping operations is represented\nby the symbol `X`. It records the actions you want to take (represented by\nthe `Intention` class), but does not evaluate them until the appropriate time.\nOperations on the DataFrame are deferred. Selecting\ntwo of the columns, for example, can be done using the symbolic `X` DataFrame\nduring the piping operations.\n\n```python\ndiamonds \u003e\u003e select(X.carat, X.cut) \u003e\u003e head(3)\n\n   carat      cut\n0   0.23    Ideal\n1   0.21  Premium\n2   0.23     Good\n```\n\n\n### Selecting and dropping\n\n#### `select()` and `drop()` functions\n\nThere are two functions for selection, inverse of each other: `select` and\n`drop`. The `select` and `drop` functions accept string labels, integer positions,\nand/or symbolically represented column names (`X.column`). They also accept symbolic \"selection\nfilter\" functions, which will be covered shortly.\n\nThe example below selects \"cut\", \"price\", \"x\", and \"y\" from the diamonds dataset.\n\n```python\ndiamonds \u003e\u003e select(1, X.price, ['x', 'y']) \u003e\u003e head(2)\n\n       cut  price     x     y\n0    Ideal    326  3.95  3.98\n1  Premium    326  3.89  3.84\n```\n\nIf you were instead to use `drop`, you would get back all columns besides those specified.\n\n```python\ndiamonds \u003e\u003e drop(1, X.price, ['x', 'y']) \u003e\u003e head(2)\n\n   carat color clarity  depth  table     z\n0   0.23     E     SI2   61.5   55.0  2.43\n1   0.21     E     SI1   59.8   61.0  2.31\n```\n\n\n#### Selection using the inversion `~` operator on symbolic columns\n\nOne particularly nice thing about `dplyr`'s selection functions is that you can\ndrop columns inside of a select statement by putting a subtraction sign in front,\nlike so: `... %\u003e% select(-col)`. The same can be done in `dfply`, but instead of\nthe subtraction operator you use the tilde `~`.\n\nFor example, let's say I wanted to select any column _except_ carat, color, and\nclarity in my dataframe. One way to do this is to specify those for removal using\nthe `~` operator like so:\n\n\n```python\ndiamonds \u003e\u003e select(~X.carat, ~X.color, ~X.clarity) \u003e\u003e head(2)\n\n       cut  depth  table  price     x     y     z\n0    Ideal   61.5   55.0    326  3.95  3.98  2.43\n1  Premium   59.8   61.0    326  3.89  3.84  2.31\n```\n\nNote that if you are going to use the inversion operator, you _must_ place it\nprior to the symbolic `X` (or a symbolic such as a selection filter function, covered\nnext). For example, using the inversion operator on a list of columns will\nresult in an error:\n\n```python\ndiamonds \u003e\u003e select(~[X.carat, X.color, X.clarity]) \u003e\u003e head(2)\n\nTypeError: bad operand type for unary ~: 'list'\n```\n\n\n#### Selection filter functions\n\nThe vanilla `select` and `drop` functions are useful, but there are a variety of\nselection functions inspired by `dplyr` available to make selecting and dropping\ncolumns a breeze. These functions are intended to be put inside of the `select` and\n`drop` functions, and can be paired with the `~` inverter.\n\nFirst, a quick rundown of the available functions:\n- `starts_with(prefix)`: find columns that start with a string prefix.\n- `ends_with(suffix)`: find columns that end with a string suffix.\n- `contains(substr)`: find columns that contain a substring in their name.\n- `everything()`: all columns.\n- `columns_between(start_col, end_col, inclusive=True)`: find columns between a specified start and end column.\nThe `inclusive` boolean keyword argument indicates whether the end column should be included or not.\n- `columns_to(end_col, inclusive=True)`: get columns up to a specified end column. The `inclusive`\nargument indicates whether the ending column should be included or not.\n- `columns_from(start_col)`: get the columns starting at a specified column.\n\nThe selection filter functions are best explained by example. Let's say I wanted to\nselect only the columns that started with a \"c\":\n\n```python\ndiamonds \u003e\u003e select(starts_with('c')) \u003e\u003e head(2)\n\n   carat      cut color clarity\n0   0.23    Ideal     E     SI2\n1   0.21  Premium     E     SI1\n```\n\nThe selection filter functions are instances of the class `Intention`, just like the\n`X` placeholder, and so I can also use the inversion operator with them. For example,\nI can alternatively select the columns that do not start with \"c\":\n\n```python\ndiamonds \u003e\u003e select(~starts_with('c')) \u003e\u003e head(2)\n\n   depth  table  price     x     y     z\n0   61.5   55.0    326  3.95  3.98  2.43\n1   59.8   61.0    326  3.89  3.84  2.31\n```\n\nThey work the same inside the `drop` function, but with the intention of removal.\nI could, for example, use the `columns_from` selection filter to drop all columns\nfrom \"price\" onwards:\n\n```python\ndiamonds \u003e\u003e drop(columns_from(X.price)) \u003e\u003e head(2)\n\n   carat      cut color clarity  depth  table\n0   0.23    Ideal     E     SI2   61.5   55.0\n1   0.21  Premium     E     SI1   59.8   61.0\n```\n\nAs the example above shows, you can use symbolic column names inside of the\nselection filter function! You can also mix together selection filters and standard\nselections inside of the same `select` or `drop` command.\n\nFor my next trick, I will select the first two columns, the last two columns, and\nthe \"depth\" column using a mixture of selection techniques:\n\n```python\ndiamonds \u003e\u003e select(columns_to(1, inclusive=True), 'depth', columns_from(-2)) \u003e\u003e head(2)\n\n   carat      cut  depth     y     z\n0   0.23    Ideal   61.5  3.98  2.43\n1   0.21  Premium   59.8  3.84  2.31\n```\n\n\n### Subsetting and filtering\n\n#### `row_slice()`\n\nSlices of rows can be selected with the `row_slice()` function. You can pass\nsingle integer indices or a list of indices to select rows as with. This is\ngoing to be the same as using pandas' `.iloc`.\n\n```python\ndiamonds \u003e\u003e row_slice([10,15])\n\n    carat      cut color clarity  depth  table  price     x     y     z\n10   0.30     Good     J     SI1   64.0   55.0    339  4.25  4.28  2.73\n15   0.32  Premium     E      I1   60.9   58.0    345  4.38  4.42  2.68\n```\n\nNote that this can also be used with the `group_by` function, and will operate\nlike a call to `.iloc` on each group. The `group_by` pipe function is\ncovered later, but it essentially works the same as pandas `.groupby` (with a\nfew subtle differences).\n\n```python\ndiamonds \u003e\u003e group_by('cut') \u003e\u003e row_slice(5)\n\n     carat        cut color clarity  depth  table  price     x     y     z\n128   0.91       Fair     H     SI2   64.4   57.0   2763  6.11  6.09  3.93\n20    0.30       Good     I     SI2   63.3   56.0    351  4.26  4.30  2.71\n40    0.33      Ideal     I     SI2   61.2   56.0    403  4.49  4.50  2.75\n26    0.24    Premium     I     VS1   62.5   57.0    355  3.97  3.94  2.47\n21    0.23  Very Good     E     VS2   63.8   55.0    352  3.85  3.92  2.48\n```\n\n\n\n#### `sample()`\n\nThe `sample()` function functions exactly the same as pandas' `.sample()` method\nfor DataFrames. Arguments and keyword arguments will be passed through to the\nDataFrame sample method.\n\n```python\ndiamonds \u003e\u003e sample(frac=0.0001, replace=False)\n\n       carat        cut color clarity  depth  table  price     x     y     z\n19736   1.02      Ideal     E     VS1   62.2   54.0   8303  6.43  6.46  4.01\n37159   0.32    Premium     D     VS2   60.3   60.0    972  4.44  4.42  2.67\n1699    0.72  Very Good     E     VS2   63.8   57.0   3035  5.66  5.69  3.62\n20955   1.71  Very Good     J     VS2   62.6   55.0   9170  7.58  7.65  4.77\n5168    0.91  Very Good     E     SI2   63.0   56.0   3772  6.12  6.16  3.87\n\n\ndiamonds \u003e\u003e sample(n=3, replace=True)\n\n       carat        cut color clarity  depth  table  price     x     y     z\n52892   0.73  Very Good     G     SI1   60.6   59.0   2585  5.83  5.85  3.54\n39454   0.57      Ideal     H     SI2   62.3   56.0   1077  5.31  5.28  3.30\n39751   0.43      Ideal     H    VVS1   62.3   54.0   1094  4.84  4.85  3.02\n```\n\n#### `distinct()`\n\nSelection of unique rows is done with `distinct()`, which similarly passes\narguments and keyword arguments through to the DataFrame's `.drop_duplicates()`\nmethod.\n\n```python\ndiamonds \u003e\u003e distinct(X.color)\n\n    carat        cut color clarity  depth  table  price     x     y     z\n0    0.23      Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43\n3    0.29    Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63\n4    0.31       Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75\n7    0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53\n12   0.22    Premium     F     SI1   60.4   61.0    342  3.88  3.84  2.33\n25   0.23  Very Good     G    VVS2   60.4   58.0    354  3.97  4.01  2.41\n28   0.23  Very Good     D     VS2   60.5   61.0    357  3.96  3.97  2.40\n```\n\n\n#### `mask()`\n\nFiltering rows with logical criteria is done with `mask()`, which accepts\nboolean arrays \"masking out\" False labeled rows and keeping True labeled rows.\nThese are best created with logical statements on symbolic Series objects as\nshown below. Multiple criteria can be supplied as arguments and their intersection\nwill be used as the mask.\n\n```python\ndiamonds \u003e\u003e mask(X.cut == 'Ideal') \u003e\u003e head(4)\n\n    carat    cut color clarity  depth  table  price     x     y     z\n0    0.23  Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43\n11   0.23  Ideal     J     VS1   62.8   56.0    340  3.93  3.90  2.46\n13   0.31  Ideal     J     SI2   62.2   54.0    344  4.35  4.37  2.71\n16   0.30  Ideal     I     SI2   62.0   54.0    348  4.31  4.34  2.68\n\ndiamonds \u003e\u003e mask(X.cut == 'Ideal', X.color == 'E', X.table \u003c 55, X.price \u003c 500)\n\n       carat    cut color clarity  depth  table  price     x     y     z\n26683   0.33  Ideal     E     SI2   62.2   54.0    427  4.44  4.46  2.77\n32297   0.34  Ideal     E     SI2   62.4   54.0    454  4.49  4.52  2.81\n40928   0.30  Ideal     E     SI1   61.6   54.0    499  4.32  4.35  2.67\n50623   0.30  Ideal     E     SI2   62.1   54.0    401  4.32  4.35  2.69\n50625   0.30  Ideal     E     SI2   62.0   54.0    401  4.33  4.35  2.69\n```\n\nAlternatively, `mask()` can also be called using the alias `filter_by()`:\n\n```python\ndiamonds \u003e\u003e filter_by(X.cut == 'Ideal', X.color == 'E', X.table \u003c 55, X.price \u003c 500)\n\n       carat    cut color clarity  depth  table  price     x     y     z\n26683   0.33  Ideal     E     SI2   62.2   54.0    427  4.44  4.46  2.77\n32297   0.34  Ideal     E     SI2   62.4   54.0    454  4.49  4.52  2.81\n40928   0.30  Ideal     E     SI1   61.6   54.0    499  4.32  4.35  2.67\n50623   0.30  Ideal     E     SI2   62.1   54.0    401  4.32  4.35  2.69\n50625   0.30  Ideal     E     SI2   62.0   54.0    401  4.33  4.35  2.69\n```\n\n#### `pull()`\n\n`pull` simply retrieves a column and returns it as a pandas series, in case you only care about one particular column at the end of your pipeline. Columns can be specified either by their name (string) or an integer.\n\nThe default returns the last column (on the assumption that's the column you've created most recently).\n\nExample:\n\n```python\n(diamonds\n \u003e\u003e filter_by(X.cut == 'Ideal', X.color == 'E', X.table \u003c 55, X.price \u003c 500)\n \u003e\u003e pull('carat'))\n\n 26683    0.33\n 32297    0.34\n 40928    0.30\n 50623    0.30\n 50625    0.30\n Name: carat, dtype: float64\n```\n\n\n### DataFrame transformation\n\n#### `mutate()`\n\nNew variables can be created with the `mutate()` function (named that way to match\n`dplyr`).\n\n```python\ndiamonds \u003e\u003e mutate(x_plus_y=X.x + X.y) \u003e\u003e select(columns_from('x')) \u003e\u003e head(3)\n\n      x     y     z  x_plus_y\n0  3.95  3.98  2.43      7.93\n1  3.89  3.84  2.31      7.73\n2  4.05  4.07  2.31      8.12\n```\n\nMultiple variables can be created in a single call.\n\n```python\ndiamonds \u003e\u003e mutate(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) \u003e\u003e select(columns_from('x')) \u003e\u003e head(3)\n\n      x     y     z  x_plus_y   y_div_z\n0  3.95  3.98  2.43      7.93  1.637860\n1  3.89  3.84  2.31      7.73  1.662338\n2  4.05  4.07  2.31      8.12  1.761905\n```\n\n\u003e Note: In Python the new variables created with mutate may not be guaranteed\nto be created in the same order that they are input into the function call, though\nthis may have been changed in Python 3...\n\n\n#### `transmute()`\n\nThe `transmute()` function is a combination of a mutate and a selection of the\ncreated variables.\n\n```python\ndiamonds \u003e\u003e transmute(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) \u003e\u003e head(3)\n\n   x_plus_y   y_div_z\n0      7.93  1.637860\n1      7.73  1.662338\n2      8.12  1.761905\n```\n\n\n### Grouping\n\n#### `group_by()` and `ungroup()`\n\nDataFrames are grouped along variables using the `group_by()` function and\nungrouped with the `ungroup()` function. Functions chained after grouping a\nDataFrame are applied by group until returning or ungrouping. Hierarchical/multiindexing\nis automatically removed.\n\n\u003e Note: In the example below, the `lead()` and `lag()` functions are dfply convenience\nwrappers around the pandas `.shift()` Series method.\n\n\n```python\n(diamonds \u003e\u003e group_by(X.cut) \u003e\u003e\n mutate(price_lead=lead(X.price), price_lag=lag(X.price)) \u003e\u003e\n head(2) \u003e\u003e select(X.cut, X.price, X.price_lead, X.price_lag))\n\n          cut  price  price_lead  price_lag\n8        Fair    337      2757.0        NaN\n91       Fair   2757      2759.0      337.0\n2        Good    327       335.0        NaN\n4        Good    335       339.0      327.0\n0       Ideal    326       340.0        NaN\n11      Ideal    340       344.0      326.0\n1     Premium    326       334.0        NaN\n3     Premium    334       342.0      326.0\n5   Very Good    336       336.0        NaN\n6   Very Good    336       337.0      336.0\n```\n\n\n### Reshaping\n\n#### `arrange()`\n\nSorting is done by the `arrange()` function, which wraps around the pandas\n`.sort_values()` DataFrame method. Arguments and keyword arguments are passed\nthrough to that function.\n\n```python\ndiamonds \u003e\u003e arrange(X.table, ascending=False) \u003e\u003e head(5)\n\n       carat   cut color clarity  depth  table  price     x     y     z\n24932   2.01  Fair     F     SI1   58.6   95.0  13387  8.32  8.31  4.87\n50773   0.81  Fair     F     SI2   68.8   79.0   2301  5.26  5.20  3.58\n51342   0.79  Fair     G     SI1   65.3   76.0   2362  5.52  5.13  3.35\n52860   0.50  Fair     E     VS2   79.0   73.0   2579  5.21  5.18  4.09\n49375   0.70  Fair     H     VS1   62.0   73.0   2100  5.65  5.54  3.47\n\n\n(diamonds \u003e\u003e group_by(X.cut) \u003e\u003e arrange(X.price) \u003e\u003e\n head(3) \u003e\u003e ungroup() \u003e\u003e mask(X.carat \u003c 0.23))\n\n    carat      cut color clarity  depth  table  price     x     y     z\n8    0.22     Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49\n1    0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31\n12   0.22  Premium     F     SI1   60.4   61.0    342  3.88  3.84  2.33\n```\n\n#### `rename()`\n\nThe `rename()` function will rename columns provided as values to what you set\nas the keys in the keyword arguments. You can indicate columns with symbols or\nwith their labels.\n\n```python\ndiamonds \u003e\u003e rename(CUT=X.cut, COLOR='color') \u003e\u003e head(2)\n\n   carat      CUT COLOR clarity  depth  table  price     x     y     z\n0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43\n1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31\n```\n\n#### `gather()`\n\nTransforming between \"wide\" and \"long\" format is a common pattern in data munging.\nThe `gather(key, value, *columns)` function melts the specified columns in your\nDataFrame into two key:value columns.\n\n```python\ndiamonds \u003e\u003e gather('variable', 'value', ['price', 'depth','x','y','z']) \u003e\u003e head(5)\n\n   carat      cut color clarity  table variable  value\n0   0.23    Ideal     E     SI2   55.0    price  326.0\n1   0.21  Premium     E     SI1   61.0    price  326.0\n2   0.23     Good     E     VS1   65.0    price  327.0\n3   0.29  Premium     I     VS2   58.0    price  334.0\n4   0.31     Good     J     SI2   58.0    price  335.0\n```\n\nWithout any columns specified, your entire DataFrame will be transformed into\ntwo key:value pair columns.\n\n```python\ndiamonds \u003e\u003e gather('variable', 'value') \u003e\u003e head(5)\n\n  variable value\n0    carat  0.23\n1    carat  0.21\n2    carat  0.23\n3    carat  0.29\n4    carat  0.31\n```\n\nIf the `add_id` keyword argument is set to true, an id column is added to the\nnew elongated DataFrame that acts as a row id from the original wide DataFrame.\n\n```python\nelongated = diamonds \u003e\u003e gather('variable', 'value', add_id=True)\nelongated \u003e\u003e head(5)\n\n   _ID variable value\n0    0    carat  0.23\n1    1    carat  0.21\n2    2    carat  0.23\n3    3    carat  0.29\n4    4    carat  0.31\n```\n\n#### `spread()`\n\nLikewise, you can transform a \"long\" DataFrame into a \"wide\" format with the\n`spread(key, values)` function. Converting the previously created elongated\nDataFrame for example would be done like so.\n\n```python\nwidened = elongated \u003e\u003e spread(X.variable, X.value)\nwidened \u003e\u003e head(5)\n\n    _ID carat clarity color        cut depth price table     x     y     z\n0     0  0.23     SI2     E      Ideal  61.5   326    55  3.95  3.98  2.43\n1     1  0.21     SI1     E    Premium  59.8   326    61  3.89  3.84  2.31\n2    10   0.3     SI1     J       Good    64   339    55  4.25  4.28  2.73\n3   100  0.75     SI1     D  Very Good  63.2  2760    56   5.8  5.75  3.65\n4  1000  0.75     SI1     D      Ideal  62.3  2898    55  5.83   5.8  3.62\n```\n\nIn this case the `_ID` column comes in handy since it is necessary to not have\nany duplicated identifiers.\n\nIf you have a mixed datatype column in your long-format DataFrame then the\ndefault behavior is for the spread columns to be of type object.\n\n```python\nwidened.dtypes\n\n_ID         int64\ncarat      object\nclarity    object\ncolor      object\ncut        object\ndepth      object\nprice      object\ntable      object\nx          object\ny          object\nz          object\ndtype: object\n```\n\nIf you want to try to convert dtypes when spreading, you can set the `convert`\nkeyword argument in spread to True like so.\n\n```python\nwidened = elongated \u003e\u003e spread(X.variable, X.value, convert=True)\nwidened.dtypes\n\n_ID          int64\ncarat      float64\nclarity     object\ncolor       object\ncut         object\ndepth      float64\nprice        int64\ntable      float64\nx          float64\ny          float64\nz          float64\ndtype: object\n```\n\n#### `separate()`\n\nColumns can be split into multiple columns with the\n`separate(column, into, sep=\"[\\W_]+\", remove=True, convert=False,\nextra='drop', fill='right')` function. `separate()` takes a variety of arguments:\n\n- `column`: the column to split.\n- `into`: the names of the new columns.\n- `sep`: either a regex string or integer positions to split the column on.\n- `remove`: boolean indicating whether to remove the original column.\n- `convert`: boolean indicating whether the new columns should be converted to\nthe appropriate type (same as in `spread` above).\n- `extra`: either `drop`, where split pieces beyond the specified new columns\nare dropped, or `merge`, where the final split piece contains the remainder of\nthe original column.\n- `fill`: either `right`, where `np.nan` values are filled in the right-most\ncolumns for missing pieces, or `left` where `np.nan` values are filled in the\nleft-most columns.\n\n```python\nprint d\n\n         a\n0    1-a-3\n1      1-b\n2  1-c-3-4\n3    9-d-1\n4       10\n\nd \u003e\u003e separate(X.a, ['col1', 'col2'], remove=True, convert=True,\n              extra='drop', fill='right')\n\n   col1 col2\n0     1    a\n1     1    b\n2     1    c\n3     9    d\n4    10  NaN\n\nd \u003e\u003e separate(X.a, ['col1', 'col2'], remove=True, convert=True,\n              extra='drop', fill='left')\n\n   col1 col2\n0   1.0    a\n1   1.0    b\n2   1.0    c\n3   9.0    d\n4   NaN   10\n\nd \u003e\u003e separate(X.a, ['col1', 'col2'], remove=False, convert=True,\n              extra='merge', fill='right')\n\n         a  col1   col2\n0    1-a-3     1    a-3\n1      1-b     1      b\n2  1-c-3-4     1  c-3-4\n3    9-d-1     9    d-1\n4       10    10    NaN\n\nd \u003e\u003e separate(X.a, ['col1', 'col2', 'col3'], sep=[2,4], remove=True, convert=True,\n              extra='merge', fill='right')\n\n  col1 col2 col3\n0   1-   a-    3\n1   1-    b  NaN\n2   1-   c-  3-4\n3   9-   d-    1\n4   10  NaN  NaN\n```\n\n#### `unite()`\n\nThe `unite(colname, *args, sep='_', remove=True, na_action='maintain')` function\ndoes the inverse of `separate()`, joining columns together by a separator. Any\ncolumns that are not strings will be converted to strings. The arguments for\n`unite()` are:\n\n- `colname`: the name of the new joined column.\n- `*args`: list of columns to be joined, which can be strings, symbolic, or\ninteger positions.\n- `sep`: the string separator to join the columns with.\n- `remove`: boolean indicating whether or not to remove the original columns.\n- `na_action`: can be one of `\"maintain\"` (the default), `\"ignore\"`, or\n`\"as_string\"`. The default `\"maintain\"` will make the new column row a `NaN` value\nif any of the original column cells at that row contained `NaN`. `\"ignore\"` will\ntreat any `NaN` value as an empty string during joining. `\"as_string\"` will convert\nany `NaN` value to the string `\"nan\"` prior to joining.\n\n```python\n\nprint d\n\na  b      c\n0  1  a   True\n1  2  b  False\n2  3  c    NaN\n\nd \u003e\u003e unite('united', X.a, 'b', 2, remove=False, na_action='maintain')\n\n   a  b      c     united\n0  1  a   True   1_a_True\n1  2  b  False  2_b_False\n2  3  c    NaN        NaN\n\nd \u003e\u003e unite('united', ['a','b','c'], remove=True, na_action='ignore', sep='*')\n\n      united\n0   1*a*True\n1  2*b*False\n2        3*c\n\nd \u003e\u003e unite('united', d.columns, remove=True, na_action='as_string')\n\n      united\n0   1_a_True\n1  2_b_False\n2    3_c_nan\n```\n\n\n### Joining\n\nCurrently implemented joins are:\n\n1. `inner_join(other, by='column')`\n- `outer_join(other, by='column')` (which works the same as `full_join()`)\n- `right_join(other, by='column')`\n- `left_join(other, by='column')`\n- `semi_join(other, by='column')`\n- `anti_join(other, by='column')`\n\nThe functionality of the join functions are outlined with the toy example\nDataFrames below.\n\n```python\na = pd.DataFrame({\n        'x1':['A','B','C'],\n        'x2':[1,2,3]\n    })\nb = pd.DataFrame({\n    'x1':['A','B','D'],\n    'x3':[True,False,True]\n})\n```\n\n#### `inner_join()`\n\n`inner_join()` joins on values present in both DataFrames' `by` columns.\n\n```python\na \u003e\u003e inner_join(b, by='x1')\n\n  x1  x2     x3\n0  A   1   True\n1  B   2  False\n```\n\n#### `outer_join()` or `full_join()`\n\n`outer_join` merges DataFrame's together on values present in either frame's\n`by` columns.\n\n```python\na \u003e\u003e outer_join(b, by='x1')\n\n  x1   x2     x3\n0  A  1.0   True\n1  B  2.0  False\n2  C  3.0    NaN\n3  D  NaN   True\n```\n\n#### `left_join()`\n\n`left_join` merges on the values present in the left DataFrame's `by` columns.\n\n```python\na \u003e\u003e left_join(b, by='x1')\n\n  x1  x2     x3\n0  A   1   True\n1  B   2  False\n2  C   3    NaN\n```\n\n#### `right_join()`\n\n`right_join` merges on the values present in the right DataFrame's `by` columns.\n\n```python\na \u003e\u003e right_join(b, by='x1')\n\n  x1   x2     x3\n0  A  1.0   True\n1  B  2.0  False\n2  D  NaN   True\n```\n\n#### `semi_join()`\n\n`semi_join()` returns all of the rows in the left DataFrame that have a match\nin the right DataFrame in the `by` columns.\n\n```python\na \u003e\u003e semi_join(b, by='x1')\n\n  x1  x2\n0  A   1\n1  B   2\n```\n\n#### `anti_join()`\n\n`anti_join()` returns all of the rows in the left DataFrame that do not have a\nmatch in the right DataFrame within the `by` columns.\n\n```python\na \u003e\u003e anti_join(b, by='x1')\n\n  x1  x2\n2  C   3\n```\n\n\n### Set operations\n\nThe set operation functions filter a DataFrame based on row comparisons with\nanother DataFrame.\n\nEach of the set operation functions `union()`, `intersect()`, and `set_diff()`\ntake the same arguments:\n\n- `other`: the DataFrame to compare to\n- `index`: a boolean (default `False`) indicating whether to consider the pandas\nindex during comparison.\n- `keep`: string (default `\"first\"`) to be passed through to `.drop_duplicates()`\ncontrolling how to handle duplicate rows.\n\nWith set operations columns are expected to be in the same order in both\nDataFrames.\n\nThe function examples use the following two toy DataFrames.\n\n```python\na = pd.DataFrame({\n        'x1':['A','B','C'],\n        'x2':[1,2,3]\n    })\nc = pd.DataFrame({\n      'x1':['B','C','D'],\n      'x2':[2,3,4]\n})\n```\n\n#### `union()`\n\nThe `union()` function returns rows that appear in either DataFrame.\n\n```python\na \u003e\u003e union(c)\n\n  x1  x2\n0  A   1\n1  B   2\n2  C   3\n2  D   4\n```\n\n#### `intersect()`\n\n`intersect()` returns rows that appear in both DataFrames.\n\n```python\na \u003e\u003e intersect(c)\n\n  x1  x2\n0  B   2\n1  C   3\n```\n\n\n#### `set_diff()`\n\n`set_diff()` returns the rows in the left DataFrame that do not appear in the\nright DataFrame.\n\n```python\na \u003e\u003e set_diff(c)\n\n  x1  x2\n0  A   1\n```\n\n\n### Binding\n\n`dfply` comes with convenience wrappers around `pandas.concat()` for joining\nDataFrames by rows or by columns.\n\nThe toy DataFrames below (`a` and `b`) are the same as the ones used to display\nthe join functions above.\n\n#### `bind_rows()`\n\nThe `bind_rows(other, join='outer', ignore_index=False)` function is an exact\ncall to `pandas.concat([df, other], join=join, ignore_index=ignore_index, axis=0)`,\njoining two DataFrames \"vertically\".\n\n```python\na \u003e\u003e bind_rows(b, join='inner')\n\nx1\n0  A\n1  B\n2  C\n0  A\n1  B\n2  D\n\na \u003e\u003e bind_rows(b, join='outer')\n\n  x1   x2     x3\n0  A  1.0    NaN\n1  B  2.0    NaN\n2  C  3.0    NaN\n0  A  NaN   True\n1  B  NaN  False\n2  D  NaN   True\n```\n\nNote that `bind_rows()` does not reset the index for you!\n\n#### `bind_cols()`\n\nThe `bind_cols(other, join='outer', ignore_index=False)` is likewise just a\ncall to `pandas.concat([df, other], join=join, ignore_index=ignore_index, axis=1)`,\njoining DataFrames \"horizontally\".\n\n```python\na \u003e\u003e bind_cols(b)\n\n  x1  x2 x1     x3\n0  A   1  A   True\n1  B   2  B  False\n2  C   3  D   True\n```\n\nNote that you may well end up with duplicate column labels after binding columns\nas can be seen above.\n\n\n### Summarization\n\nThere are two summarization functions in `dfply` that match `dplr`: `summarize` and\n`summarize_each` (though these functions use the 'z' spelling rather than 's').\n\n#### `summarize()`\n\n`summarize(**kwargs)` takes an arbitrary number of keyword arguments that will\nreturn new columns labeled with the keys that are summary functions of columns\nin the original DataFrame.\n\n```python\ndiamonds \u003e\u003e summarize(price_mean=X.price.mean(), price_std=X.price.std())\n\n    price_mean    price_std\n0  3932.799722  3989.439738\n```\n\n`summarize()` can of course be used with groupings as well.\n\n```python\ndiamonds \u003e\u003e group_by('cut') \u003e\u003e summarize(price_mean=X.price.mean(), price_std=X.price.std())\n\n         cut   price_mean    price_std\n0       Fair  4358.757764  3560.386612\n1       Good  3928.864452  3681.589584\n2      Ideal  3457.541970  3808.401172\n3    Premium  4584.257704  4349.204961\n4  Very Good  3981.759891  3935.862161\n```\n\n#### `summarize_each()`\n\nThe `summarize_each(function_list, *columns)` is a more general summarization\nfunction. It takes a list of summary functions to apply as its first argument and\nthen a list of columns to apply the summary functions to. Columns can be specified\nwith either symbolic, string label, or integer position like in the selection\nfunctions for convenience.\n\n```python\ndiamonds \u003e\u003e summarize_each([np.mean, np.var], X.price, 'depth')\n\n    price_mean     price_var  depth_mean  depth_var\n0  3932.799722  1.591533e+07   61.749405   2.052366\n```\n\n`summarize_each()` works with groupings as well.\n\n```python\ndiamonds \u003e\u003e group_by(X.cut) \u003e\u003e summarize_each([np.mean, np.var], X.price, 4)\n\n         cut   price_mean     price_var  depth_mean  depth_var\n0       Fair  4358.757764  1.266848e+07   64.041677  13.266319\n1       Good  3928.864452  1.355134e+07   62.365879   4.705224\n2      Ideal  3457.541970  1.450325e+07   61.709401   0.516274\n3    Premium  4584.257704  1.891421e+07   61.264673   1.342755\n4  Very Good  3981.759891  1.548973e+07   61.818275   1.900466\n```\n\n\n## Embedded column functions\n\n**UNDER CONSTRUCTION: documentation not complete.**\n\nLike `dplyr`, the `dfply` package provides functions to perform various operations\non pandas Series. These are typically window functions and summarization\nfunctions, and wrap symbolic arguments in function calls.\n\n\n### Window functions\n\nWindow functions perform operations on vectors of values that return a vector\nof the same length.\n\n#### `lead()` and `lag()`\n\nThe `lead(series, n)` function pushes values in a vector upward, adding `NaN`\nvalues in the end positions. Likewise, the `lag(series, n)` function\npushes values downward, inserting `NaN` values in the initial positions. Both\nare calls to pandas `Series.shift()` function under the hood.\n\n```python\n(diamonds \u003e\u003e mutate(price_lead=lead(X.price, 2), price_lag=lag(X.price, 2)) \u003e\u003e\n            select(X.price, -2, -1) \u003e\u003e\n            head(6))\n\n    price  price_lag  price_lead\n 0    326        NaN       327.0\n 1    326        NaN       334.0\n 2    327      326.0       335.0\n 3    334      326.0       336.0\n 4    335      327.0       336.0\n 5    336      334.0       337.0\n```\n\n#### `between()`\n\nThe `between(series, a, b, inclusive=False)` function checks to see if values are\nbetween two given bookend values.\n\n```python\ndiamonds \u003e\u003e select(X.price) \u003e\u003e mutate(price_btwn=between(X.price, 330, 340)) \u003e\u003e head(6)\n\n   price price_btwn\n0    326      False\n1    326      False\n2    327      False\n3    334       True\n4    335       True\n5    336       True\n```\n\n#### `dense_rank()`\n\nThe `dense_rank(series, ascending=True)` function is a wrapper around the `scipy`\nfunction for calculating dense rank.\n\n```python\ndiamonds \u003e\u003e select(X.price) \u003e\u003e mutate(price_drank=dense_rank(X.price)) \u003e\u003e head(6)\n\n   price  price_drank\n0    326          1.0\n1    326          1.0\n2    327          2.0\n3    334          3.0\n4    335          4.0\n5    336          5.0\n```\n\n#### `min_rank()`\n\nLikewise, `min_rank(series, ascending=True)` is a wrapper around the `scipy` ranking\nfunction with min rank specified.\n\n```python\ndiamonds \u003e\u003e select(X.price) \u003e\u003e mutate(price_mrank=min_rank(X.price)) \u003e\u003e head(6)\n\nprice  price_mrank\n0    326          1.0\n1    326          1.0\n2    327          3.0\n3    334          4.0\n4    335          5.0\n5    336          6.0\n```\n\n#### `cumsum()`\n\nThe `cumsum(series)` function calculates a cumulative sum of a column.\n\n```python\ndiamonds \u003e\u003e select(X.price) \u003e\u003e mutate(price_cumsum=cumsum(X.price)) \u003e\u003e head(6)\n\n   price  price_cumsum\n0    326           326\n1    326           652\n2    327           979\n3    334          1313\n4    335          1648\n5    336          1984\n```\n\n#### `cummean()`\n\n`cummean(series)`\n\n```python\ndiamonds \u003e\u003e select(X.price) \u003e\u003e mutate(price_cummean=cummean(X.price)) \u003e\u003e head(6)\n\n   price  price_cummean\n0    326     326.000000\n1    326     326.000000\n2    327     326.333333\n3    334     328.250000\n4    335     329.600000\n5    336     330.666667\n```\n\n#### `cummax()`\n\n`cummax(series)`\n\n```python\ndiamonds \u003e\u003e select(X.price) \u003e\u003e mutate(price_cummax=cummax(X.price)) \u003e\u003e head(6)\n\n   price  price_cummax\n0    326         326.0\n1    326         326.0\n2    327         327.0\n3    334         334.0\n4    335         335.0\n5    336         336.0\n```\n\n#### `cummin()`\n\n`cummin(series)`\n\n```python\ndiamonds \u003e\u003e select(X.price) \u003e\u003e mutate(price_cummin=cummin(X.price)) \u003e\u003e head(6)\n\n   price  price_cummin\n0    326         326.0\n1    326         326.0\n2    327         326.0\n3    334         326.0\n4    335         326.0\n5    336         326.0\n```\n\n#### `cumprod()`\n\n`cumprod(series)`\n\n```python\ndiamonds \u003e\u003e select(X.price) \u003e\u003e mutate(price_cumprod=cumprod(X.price)) \u003e\u003e head(6)\n\n   price     price_cumprod\n0    326               326\n1    326            106276\n2    327          34752252\n3    334       11607252168\n4    335     3888429476280\n5    336  1306512304030080\n```\n\n\n### Summary functions\n\n#### `mean()`\n\n`mean(series)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_mean=mean(X.price))\n\n         cut   price_mean\n0       Fair  4358.757764\n1       Good  3928.864452\n2      Ideal  3457.541970\n3    Premium  4584.257704\n4  Very Good  3981.759891\n```\n\n#### `first()`\n\n`first(series, order_by=None)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_first=first(X.price))\n\n         cut  price_first\n0       Fair          337\n1       Good          327\n2      Ideal          326\n3    Premium          326\n4  Very Good          336\n```\n\n#### `last()`\n\n`last(series, order_by=None)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_last=last(X.price))\n\n         cut  price_last\n0       Fair        2747\n1       Good        2757\n2      Ideal        2757\n3    Premium        2757\n4  Very Good        2757\n```\n\n#### `nth()`\n\n`nth(series, n, order_by=None)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_penultimate=nth(X.price, -2))\n\n         cut  price_penultimate\n0       Fair               2745\n1       Good               2756\n2      Ideal               2757\n3    Premium               2757\n4  Very Good               2757\n```\n\n#### `n()`\n\n`n(series)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_n=n(X.price))\n\n         cut  price_n\n0       Fair     1610\n1       Good     4906\n2      Ideal    21551\n3    Premium    13791\n4  Very Good    12082\n```\n\n#### `n_distinct()`\n\n`n_distinct(series)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_ndistinct=n_distinct(X.price))\n\n         cut  price_ndistinct\n0       Fair             1267\n1       Good             3086\n2      Ideal             7281\n3    Premium             6014\n4  Very Good             5840\n```\n\n#### `IQR()`\n\n`IQR(series)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_iqr=IQR(X.price))\n\n         cut  price_iqr\n0       Fair    3155.25\n1       Good    3883.00\n2      Ideal    3800.50\n3    Premium    5250.00\n4  Very Good    4460.75\n```\n\n#### `colmin()`\n\n`colmin(series)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_min=colmin(X.price))\n\n         cut  price_min\n0       Fair        337\n1       Good        327\n2      Ideal        326\n3    Premium        326\n4  Very Good        336\n```\n\n#### `colmax()`\n\n`colmax(series)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_max=colmax(X.price))\n\n         cut  price_max\n0       Fair      18574\n1       Good      18788\n2      Ideal      18806\n3    Premium      18823\n4  Very Good      18818\n```\n\n#### `median()`\n\n`median(series)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_median=median(X.price))\n\n         cut  price_median\n0       Fair        3282.0\n1       Good        3050.5\n2      Ideal        1810.0\n3    Premium        3185.0\n4  Very Good        2648.0\n```\n\n#### `var()`\n\n`var(series)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_var=var(X.price))\n\n         cut     price_var\n0       Fair  1.267635e+07\n1       Good  1.355410e+07\n2      Ideal  1.450392e+07\n3    Premium  1.891558e+07\n4  Very Good  1.549101e+07\n```\n\n#### `sd()`\n\n`sd(series)`\n\n```python\ndiamonds \u003e\u003e groupby(X.cut) \u003e\u003e summarize(price_sd=sd(X.price))\n\n         cut     price_sd\n0       Fair  3560.386612\n1       Good  3681.589584\n2      Ideal  3808.401172\n3    Premium  4349.204961\n4  Very Good  3935.862161\n```\n\n\n## Extending `dfply` with custom functions\n\nThere are a lot of built-in functions, but you are almost certainly going to\nreach a point where you want to use some of your own functions with the `dfply`\npiping syntax. Luckily, `dfply` comes with two handy decorators to make this\nas easy as possible.\n\n\u003e **For a more detailed walkthrough of these two cases, see the [\nbasics-extending-functionality.ipynb](./examples/basics-extending-functionality.ipynb)\njupyter notebook in the examples folder.**\n\n\n### Case 1: A custom \"pipe\" function with `@dfpipe`\n\nYou might want to make a custom function that can be a piece of the pipe chain.\nFor example, say we wanted to write a `dfply` wrapper around a simplified\nversion of `pd.crosstab`. For the most part, you'll only need to do two things\nto make this work:\n1. Make sure that your function's first argument will be the dataframe passed in\nimplicitly by the pipe.\n2. Decorate the function with the `@dfpipe` decorator.\n\nHere is an example of the `dfply`-enabled crosstab function:\n\n```python\n@dfpipe\ndef crosstab(df, index, columns):\n    return pd.crosstab(index, columns)\n```\n\nNormally you could use `pd.crosstab` like so:\n\n```python\npd.crosstab(diamonds.cut, diamonds.color)\n\ncolor         D     E     F     G     H     I    J\ncut                                               \nFair        163   224   312   314   303   175  119\nGood        662   933   909   871   702   522  307\nIdeal      2834  3903  3826  4884  3115  2093  896\nPremium    1603  2337  2331  2924  2360  1428  808\nVery Good  1513  2400  2164  2299  1824  1204  678\n```\n\nThe same result can be achieved now with the custom function in pipe syntax:\n\n```python\ndiamonds \u003e\u003e crosstab(X.cut, X.color)\n\ncolor         D     E     F     G     H     I    J\ncut                                               \nFair        163   224   312   314   303   175  119\nGood        662   933   909   871   702   522  307\nIdeal      2834  3903  3826  4884  3115  2093  896\nPremium    1603  2337  2331  2924  2360  1428  808\nVery Good  1513  2400  2164  2299  1824  1204  678\n```\n\n\n### Case 2: A function that works with symbolic objects using `@make_symbolic`\n\nMany tasks are simpler and do not require the capacity to work as a pipe function. The dfply window functions are the common examples of this: functions that take a Series (or symbolic Series) and return a modified version.\n\n\nLet's say we had a dataframe with dates represented by strings that we wanted to convert to pandas datetime objects using the pd.to_datetime function. Below is a tiny example dataframe with this issue.\n\n```python\nsales = pd.DataFrame(dict(date=['7/10/17','7/11/17','7/12/17','7/13/17','7/14/17'],\n                          sales=[1220, 1592, 908, 1102, 1395]))\n\nsales\n\n      date  sales\n0  7/10/17   1220\n1  7/11/17   1592\n2  7/12/17    908\n3  7/13/17   1102\n4  7/14/17   1395\n```\n\nUsing the `pd.to_datetime` function inside of a call to mutate will unfortunately\nbreak:\n\n```python\nsales \u003e\u003e mutate(pd_date=pd.to_datetime(X.date, infer_datetime_format=True))\n\n...\n\nTypeError: __index__ returned non-int (type Intention)\n```\n\n`dfply` functions are special in that they \"know\" to delay their evaluation until\nthe data is at that point in the chain. `pd.to_datetime` is not such a function,\nand will immediately try to evaluate `X.date`. With a symbolic `Intention` argument\npassed in (`X` is an `Intention` object), the function will fail.\n\n\nInstead, we will need to make a wrapper around `pd.to_datetime`\nthat can handle these symbolic arguments and delay evaluation until the right time.\n\nIt's quite simple: all you need to do is decorate a function with the @make_symbolic decorator:\n\n```python\n@make_symbolic\ndef to_datetime(series, infer_datetime_format=True):\n    return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)\n```\n\nNow the function can be used with symbolic arguments:\n\n```python\nsales \u003e\u003e mutate(pd_date=to_datetime(X.date))\n\n      date  sales    pd_date\n0  7/10/17   1220 2017-07-10\n1  7/11/17   1592 2017-07-11\n2  7/12/17    908 2017-07-12\n3  7/13/17   1102 2017-07-13\n4  7/14/17   1395 2017-07-14\n```\n\n#### Without symbolic arguments, `@make_symbolic` functions work like normal functions!\n\nA particularly nice thing about functions decorated with `@make_symbolic` is that\nthey will operate normally if passed arguments that are not `Intention` symbolic\nobjects.\n\nFor example, you can pass in the series itself and it will return the new\nseries of converted dates:\n\n```python\nto_datetime(sales.date)\n\n0   2017-07-10\n1   2017-07-11\n2   2017-07-12\n3   2017-07-13\n4   2017-07-14\nName: date, dtype: datetime64[ns]\n```\n\n\n## Advanced: understanding base `dfply` decorators\n\nUnder the hood, `dfply` functions work using a collection of different decorators and\nspecial classes. Below the most important ones are detailed. Understanding these\nare important if you are planning on making big additions or changes to the code.\n\n\n### The `Intention` class\n\nPython is not a lazily-evaluated language. Typically, something like this\nwould not work:\n\n```python\ndiamonds \u003e\u003e select(X.carat) \u003e\u003e head(2)\n```\n\nThe `X` is supposed to represent the current state of the data through the\npiping operator chain, and `X.carat` indicates \"select the carat column from\nthe current data at this point in the chain\". But Python will try to evaluate\nwhat `X` is, then what `X.carat` is, then what `select(X.carat)` is, all before\nthe diamonds dataset ever gets evaluated.\n\nThe solution to this is to delay the evaluation until the appropriate time. I will\nnot get into the granular details here (but feel free to check it out for yourself\nin `base.py`). The gist is that things to be delayed are represented by a\nspecial `Intention` class that \"waits\" until it is time to evaluate the stored\ncommands with a given dataframe. This is the core of how `dplyr` data manipulation\nsyntax is made possible in `dfply`.\n\n(Thanks to the creators of the `dplython` and `pandas-ply` for trailblazing a lot\nof this before I made this package.)\n\n\n### `@pipe`\n\nThe primary decorator that enables chaining functions with the `\u003e\u003e` operator\nis `@pipe`. For functions to work with the piping syntax they must be decorated\nwith `@pipe`.\n\nAny function decorated with `@pipe` implicitly receives a single first argument\nexpected to be a pandas DataFrame. This is the DataFrame being passed through\nthe pipe. For example, `mutate` and `select` have function specifications\n`mutate(df, **kwargs)` and `select(df, *args, **kwargs)`, but when used\ndo not require the user to insert the DataFrame as an argument.\n\n```python\n# the DataFrame is implicitly passed as the first argument\ndiamonds \u003e\u003e mutate(new_var=X.price + X.depth) \u003e\u003e select(X.new_var)\n```\n\nIf you create a new function decorated by `@pipe`, the function definition\nshould contain an initial argument that represents the DataFrame being passed\nthrough the piping operations.\n\n```python\n@pipe\ndef myfunc(df, *args, **kwargs):\n  # code\n```\n\n### `@group_delegation`\n\nIn order to delegate a function across specified groupings (assigned by the\n`group_by()` function), decorate the function with the `@group_delegation`\ndecorator. This decorator will query the DataFrame for assigned groupings and\napply the function to those groups individually.\n\nGroupings are assigned by `dfply` as an attribute `._grouped_by` to the DataFrame\nproceeding through the piped functions. `@group_delegation` checks for the\nattribute and applies the function by group if groups exist. Any hierarchical\nindexing is removed by the decorator as well.\n\nDecoration by `@group_delegation` should come after (internal) to the `@pipe`\ndecorator to function as intended.\n\n```python\n@pipe\n@group_delegation\ndef myfunc(df, *args, **kwargs):\n  # code\n```\n\n### `@symbolic_evaluation`\n\nEvaluation of any `Intention`-class symbolic object (such as `X`) is\nhandled by the `@symbolic_evaluation` function. For example, when calling\n`mutate(new_price = X.price * 2.5)` the `X.price` symbolic representation of\nthe price column in the DataFrame will be evaluated to the actual Series\nby this decorator.\n\nThe `@symbolic_evaluation` decorator can have functionality modified by\noptional keyword arguments:\n\n\n#### Controlling `@symbolic_evaluation` with the `eval_symbols` argument\n\n```python\n@symbolic_evaluation(eval_symbols=False)\ndef my_function(df, arg1, arg2):\n    ...\n```\n\nIf the `eval_symbols` argument is `True`, all symbolics will be evaluated\nwith the passed-in dataframe. If `False` or `None`, there will be no attempt\nto evaluate symbolics.\n\nA list can also be passed in. The list can contain a mix of positional integers\nand string keywords, which reference positional arguments and keyworded arguments\nrespectively. This targets which arguments or keyword arguments to try and\nevaluate specifically:\n\n\n```python\n# This indicates that arg1, arg2, and kw1 should be targeted for symbolic\n# evaluation, but not the other arguments.\n# Note that positional indexes reference arguments AFTER the passed-in dataframe.\n# For example, 0 refers to arg1, not df.\n@symbolic_evaluation(eval_symbols=[0,1,'kw1'])\ndef my_function(df, arg1, arg2, arg3, kw1=True, kw2=False):\n    ...\n```\n\nIn reality, you are unlikely to need this behavior unless you really want to\nprevent `dfply` from trying to evaluate symbolic arguments. Remember that if\nan argument is not symbolic it will be evaluated as normal, so there shouldn't\nbe much harm leaving it at default other than a little bit of computational overhead.\n\n\n### `@dfpipe`\n\nMost new or custom functions for dfply will be decorated with the pattern:\n\n```python\n@pipe\n@group_delegation\n@symbolic_evaluation\ndef myfunc(df, *args, **kwargs):\n  # code\n```\n\nBecause of this, the decorator `@dfpipe` is defined as exactly this combination\nof decorators for your convenience. The above decoration pattern for the function\ncan be simply written as:\n\n```python\n@dfpipe\ndef myfunc(df, *args, **kwargs):\n  # code\n```\n\nThis allows you to easily create new functions that can be chained together\nwith pipes, respect grouping, and evaluate symbolic DataFrames and Series\ncorrectly.\n\n\n### `@make_symbolic`\n\nSometimes, like in the window and summary functions that operate on series,\nit is necessary to defer the evaluation of a function. For example, in the\ncode below:\n\n```python\ndiamonds \u003e\u003e summarize(price_third=nth(X.price, 3))\n```\n\nThe `nth()` function would typically be evaluated before `summarize()` and the\nsymbolic argument would not be evaluated at the right time.\n\nThe `@make_symbolic` decorator can be placed above functions to convert them\ninto symbolic functions that will wait to evaluate. Again, this is used\nprimarily for functions that are embedded inside the function call within\nthe piping syntax.\n\nThe `nth()` code, for example, is below:\n\n\n```python\n@make_symbolic\ndef nth(series, n, order_by=None):\n    if order_by is not None:\n        series = order_series_by(series, order_by)\n    try:\n        return series.iloc[n]\n    except:\n        return np.nan\n```\n\nFunctions you write that you want to be able to embed as an argument\ncan use the `@make_symbolic` to wait until they have access to the DataFrame\nto evaluate.\n\n\n\n## Contributing\n\nBy all means please feel free to comment or contribute to the package. The more\npeople adding code the better. If you submit an issue, pull request, or ask for\nsomething to be added I will do my best to respond promptly.\n\nThe TODO list (now located in the \"Projects\" section of the repo) has an\nongoing list of things that still need to be resolved and features to be added.\n\nIf you submit a pull request with features or bugfixes, please target the\n\"develop\" branch rather than the \"master\" branch.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkieferk%2Fdfply","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkieferk%2Fdfply","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkieferk%2Fdfply/lists"}