https://github.com/machow/siuba

Python library for using dplyr like syntax with pandas and SQL
https://github.com/machow/siuba

data-analysis dplyr pandas python sql

Last synced: 22 days ago
JSON representation

Python library for using dplyr like syntax with pandas and SQL

Host: GitHub
URL: https://github.com/machow/siuba
Owner: machow
License: mit
Created: 2019-02-09T18:24:10.000Z (over 6 years ago)
Default Branch: main
Last Pushed: 2023-09-19T21:04:22.000Z (over 1 year ago)
Last Synced: 2025-05-14T04:17:04.379Z (23 days ago)
Topics: data-analysis, dplyr, pandas, python, sql
Language: Python
Homepage: https://siuba.org
Size: 1.79 MB
Stars: 1,169
Watchers: 20
Forks: 50
Open Issues: 105
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: CODEOWNERS

Awesome Lists containing this project

awesome-dataframes - siuba - Python library for using dplyr like syntax with pandas and SQL. (Libraries)
jimsghstars - machow/siuba - Python library for using dplyr like syntax with pandas and SQL (Python)

README

        siuba

=====

*scrappy data analysis, with seamless support for pandas and SQL*

[![CI](https://github.com/machow/siuba/workflows/CI/badge.svg)](https://github.com/machow/siuba/actions?query=workflow%3ACI+branch%3Amain)

[![Documentation Status](https://img.shields.io/badge/docs-siuba.org-blue.svg)](https://siuba.org)

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/machow/siuba/master)



siuba ([小巴](http://www.cantonese.sheik.co.uk/dictionary/words/9139/)) is a port of [dplyr](https://github.com/tidyverse/dplyr) and other R libraries. It supports a tabular data analysis workflow centered on 5 common actions:

* `select()` - keep certain columns of data.

* `filter()` - keep certain rows of data.

* `mutate()` - create or modify an existing column of data.

* `summarize()` - reduce one or more columns down to a single number.

* `arrange()` - reorder the rows of data.

These actions can be preceded by a `group_by()`, which causes them to be applied individually to grouped rows of data. Moreover, many SQL concepts, such as `distinct()`, `count()`, and joins are implemented.

Inputs to these functions can be a pandas `DataFrame` or SQL connection (currently postgres, redshift, or sqlite).

For more on the rationale behind tools like dplyr, see this [tidyverse paper](https://tidyverse.tidyverse.org/articles/paper.html). 

For examples of siuba in action, see the [siuba guide](https://siuba.org/guide).

Installation

------------

```

pip install siuba

```

Examples

--------

See the [siuba guide](https://siuba.org/guide) or this [live analysis](https://www.youtube.com/watch?v=eKuboGOoP08) for a full introduction.

### Basic use

The code below uses the example DataFrame `mtcars`, to get the average horsepower (hp) per cylinder.

```python

from siuba import group_by, summarize, _

from siuba.data import mtcars

(mtcars

  >> group_by(_.cyl)

  >> summarize(avg_hp = _.hp.mean())

  )

```

```

Out[1]: 

   cyl      avg_hp

0    4   82.636364

1    6  122.285714

2    8  209.214286

```

There are three key concepts in this example:

| concept | example | meaning |

| ------- | ------- | ------- |

| verb    | `group_by(...)` | a function that operates on a table, like a DataFrame or SQL table |

| siu expression | `_.hp.mean()` | an expression created with `siuba._`, that represents actions you want to perform |

| pipe | `mtcars >> group_by(...)` | a syntax that allows you to chain verbs with the `>>` operator |

See the [siuba guide overview](https://siuba.org/guide) for a full introduction.

### What is a siu expression (e.g. `_.cyl == 4`)?

A siu expression is a way of specifying **what** action you want to perform.

This allows siuba verbs to decide **how** to execute the action, depending on whether your data is a local DataFrame or remote table.

```python

from siuba import _

_.cyl == 4

```

```

Out[2]:

█─==

├─█─.

│ ├─_

│ └─'cyl'

└─4

```

You can also think of siu expressions as a shorthand for a lambda function.

```python

from siuba import _

# lambda approach

mtcars[lambda _: _.cyl == 4]

# siu expression approach

mtcars[_.cyl == 4]

```

```

Out[3]: 

     mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb

2   22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1

7   24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2

..   ...  ...    ...  ...   ...    ...    ...  ..  ..   ...   ...

27  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2

31  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2

[11 rows x 11 columns]

```

See the [siuba guide](https://siuba.org/guide) or read more about [lazy expressions](https://siuba.org/guide/basics-lazy-expressions.html).

### Using with a SQL database

A killer feature of siuba is that the same analysis code can be run on a local DataFrame, or a SQL source.

In the code below, we set up an example database.

```python

# Setup example data ----

from sqlalchemy import create_engine

from siuba.data import mtcars

# copy pandas DataFrame to sqlite

engine = create_engine("sqlite:///:memory:")

mtcars.to_sql("mtcars", engine, if_exists = "replace")

```

Next, we use the code from the first example, except now executed a SQL table.

```python

# Demo SQL analysis with siuba ----

from siuba import _, tbl, group_by, summarize, filter

# connect with siuba

tbl_mtcars = tbl(engine, "mtcars")

(tbl_mtcars

  >> group_by(_.cyl)

  >> summarize(avg_hp = _.hp.mean())

  )

```

```

Out[4]: 

# Source: lazy query

# DB Conn: Engine(sqlite:///:memory:)

# Preview:

   cyl      avg_hp

0    4   82.636364

1    6  122.285714

2    8  209.214286

# .. may have more rows

```

See the [querying SQL introduction here](https://siuba.org/guide/basics-sql.html).

### Example notebooks

Below are some examples I've kept as I've worked on siuba.

For the most up to date explanations, see the [siuba guide](https://siuba.org/guide)

* [siu expressions](examples/examples-siu.ipynb)

* [dplyr style pandas](examples/examples-dplyr-funcs.ipynb)

  - [select verb case study](examples/case-iris-select.ipynb)

* sql using dplyr style

  - [simple sql statements](examples/examples-sql.ipynb)

  - [the kitchen sink with postgres](examples/examples-postgres.ipynb)

* [tidytuesday examples](https://github.com/machow/tidytuesday-py)

  - tidytuesday is a weekly R data analysis project. In order to kick the tires

    on siuba, I've been using it to complete the assignments. More specifically,

    I've been porting Dave Robinson's [tidytuesday analyses](https://github.com/dgrtwo/data-screencasts)

    to use siuba.

Testing

-------

Tests are done using pytest.

They can be run using the following.

```bash

# start postgres db

docker-compose up

pytest siuba

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/machow/siuba

Awesome Lists containing this project

README