{"id":13578265,"url":"https://github.com/machow/siuba","last_synced_at":"2025-05-15T03:06:58.098Z","repository":{"id":34111912,"uuid":"169898535","full_name":"machow/siuba","owner":"machow","description":"Python library for using dplyr like syntax with pandas and SQL","archived":false,"fork":false,"pushed_at":"2023-09-19T21:04:22.000Z","size":1880,"stargazers_count":1169,"open_issues_count":105,"forks_count":50,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-05-14T04:17:04.379Z","etag":null,"topics":["data-analysis","dplyr","pandas","python","sql"],"latest_commit_sha":null,"homepage":"https://siuba.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/machow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null}},"created_at":"2019-02-09T18:24:10.000Z","updated_at":"2025-04-29T14:36:06.000Z","dependencies_parsed_at":"2023-01-15T04:41:42.129Z","dependency_job_id":"be0b2af9-e141-4bcb-a77b-1af8effc6f89","html_url":"https://github.com/machow/siuba","commit_stats":{"total_commits":660,"total_committers":12,"mean_commits":55.0,"dds":0.5121212121212122,"last_synced_commit":"df88cc01a64a2b8586e05854e30a6b25acefb0b0"},"previous_names":[],"tags_count":37,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machow%2Fsiuba","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machow%2Fsiuba/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machow%2Fsiuba/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machow%2Fsiuba/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/machow","download_url":"https://codeload.github.com/machow/siuba/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254264766,"owners_count":22041793,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","dplyr","pandas","python","sql"],"created_at":"2024-08-01T15:01:28.922Z","updated_at":"2025-05-15T03:06:53.090Z","avatar_url":"https://github.com/machow.png","language":"Python","readme":"siuba\n=====\n\n*scrappy data analysis, with seamless support for pandas and SQL*\n\n[![CI](https://github.com/machow/siuba/workflows/CI/badge.svg)](https://github.com/machow/siuba/actions?query=workflow%3ACI+branch%3Amain)\n[![Documentation Status](https://img.shields.io/badge/docs-siuba.org-blue.svg)](https://siuba.org)\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/machow/siuba/master)\n\n\u003cimg width=\"30%\" align=\"right\" src=\"./docs/siuba_small.svg\"\u003e\n\nsiuba ([小巴](http://www.cantonese.sheik.co.uk/dictionary/words/9139/)) is a port of [dplyr](https://github.com/tidyverse/dplyr) and other R libraries. It supports a tabular data analysis workflow centered on 5 common actions:\n\n* `select()` - keep certain columns of data.\n* `filter()` - keep certain rows of data.\n* `mutate()` - create or modify an existing column of data.\n* `summarize()` - reduce one or more columns down to a single number.\n* `arrange()` - reorder the rows of data.\n\nThese actions can be preceded by a `group_by()`, which causes them to be applied individually to grouped rows of data. Moreover, many SQL concepts, such as `distinct()`, `count()`, and joins are implemented.\nInputs to these functions can be a pandas `DataFrame` or SQL connection (currently postgres, redshift, or sqlite).\n\nFor more on the rationale behind tools like dplyr, see this [tidyverse paper](https://tidyverse.tidyverse.org/articles/paper.html). \nFor examples of siuba in action, see the [siuba guide](https://siuba.org/guide).\n\nInstallation\n------------\n\n```\npip install siuba\n```\n\nExamples\n--------\n\nSee the [siuba guide](https://siuba.org/guide) or this [live analysis](https://www.youtube.com/watch?v=eKuboGOoP08) for a full introduction.\n\n### Basic use\n\nThe code below uses the example DataFrame `mtcars`, to get the average horsepower (hp) per cylinder.\n\n```python\nfrom siuba import group_by, summarize, _\nfrom siuba.data import mtcars\n\n(mtcars\n  \u003e\u003e group_by(_.cyl)\n  \u003e\u003e summarize(avg_hp = _.hp.mean())\n  )\n```\n\n```\nOut[1]: \n   cyl      avg_hp\n0    4   82.636364\n1    6  122.285714\n2    8  209.214286\n```\n\nThere are three key concepts in this example:\n\n| concept | example | meaning |\n| ------- | ------- | ------- |\n| verb    | `group_by(...)` | a function that operates on a table, like a DataFrame or SQL table |\n| siu expression | `_.hp.mean()` | an expression created with `siuba._`, that represents actions you want to perform |\n| pipe | `mtcars \u003e\u003e group_by(...)` | a syntax that allows you to chain verbs with the `\u003e\u003e` operator |\n\n\nSee the [siuba guide overview](https://siuba.org/guide) for a full introduction.\n\n### What is a siu expression (e.g. `_.cyl == 4`)?\n\nA siu expression is a way of specifying **what** action you want to perform.\nThis allows siuba verbs to decide **how** to execute the action, depending on whether your data is a local DataFrame or remote table.\n\n```python\nfrom siuba import _\n\n_.cyl == 4\n```\n\n```\nOut[2]:\n█─==\n├─█─.\n│ ├─_\n│ └─'cyl'\n└─4\n```\n\nYou can also think of siu expressions as a shorthand for a lambda function.\n\n```python\nfrom siuba import _\n\n# lambda approach\nmtcars[lambda _: _.cyl == 4]\n\n# siu expression approach\nmtcars[_.cyl == 4]\n```\n\n```\nOut[3]: \n     mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb\n2   22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1\n7   24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2\n..   ...  ...    ...  ...   ...    ...    ...  ..  ..   ...   ...\n27  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2\n31  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2\n\n[11 rows x 11 columns]\n```\n\nSee the [siuba guide](https://siuba.org/guide) or read more about [lazy expressions](https://siuba.org/guide/basics-lazy-expressions.html).\n\n### Using with a SQL database\n\nA killer feature of siuba is that the same analysis code can be run on a local DataFrame, or a SQL source.\n\nIn the code below, we set up an example database.\n\n```python\n# Setup example data ----\nfrom sqlalchemy import create_engine\nfrom siuba.data import mtcars\n\n# copy pandas DataFrame to sqlite\nengine = create_engine(\"sqlite:///:memory:\")\nmtcars.to_sql(\"mtcars\", engine, if_exists = \"replace\")\n```\n\nNext, we use the code from the first example, except now executed a SQL table.\n\n```python\n# Demo SQL analysis with siuba ----\nfrom siuba import _, tbl, group_by, summarize, filter\n\n# connect with siuba\ntbl_mtcars = tbl(engine, \"mtcars\")\n\n(tbl_mtcars\n  \u003e\u003e group_by(_.cyl)\n  \u003e\u003e summarize(avg_hp = _.hp.mean())\n  )\n```\n\n```\nOut[4]: \n# Source: lazy query\n# DB Conn: Engine(sqlite:///:memory:)\n# Preview:\n   cyl      avg_hp\n0    4   82.636364\n1    6  122.285714\n2    8  209.214286\n# .. may have more rows\n```\n\nSee the [querying SQL introduction here](https://siuba.org/guide/basics-sql.html).\n\n### Example notebooks\n\nBelow are some examples I've kept as I've worked on siuba.\nFor the most up to date explanations, see the [siuba guide](https://siuba.org/guide)\n\n* [siu expressions](examples/examples-siu.ipynb)\n* [dplyr style pandas](examples/examples-dplyr-funcs.ipynb)\n  - [select verb case study](examples/case-iris-select.ipynb)\n* sql using dplyr style\n  - [simple sql statements](examples/examples-sql.ipynb)\n  - [the kitchen sink with postgres](examples/examples-postgres.ipynb)\n* [tidytuesday examples](https://github.com/machow/tidytuesday-py)\n  - tidytuesday is a weekly R data analysis project. In order to kick the tires\n    on siuba, I've been using it to complete the assignments. More specifically,\n    I've been porting Dave Robinson's [tidytuesday analyses](https://github.com/dgrtwo/data-screencasts)\n    to use siuba.\n\nTesting\n-------\n\nTests are done using pytest.\nThey can be run using the following.\n\n```bash\n# start postgres db\ndocker-compose up\npytest siuba\n```\n","funding_links":[],"categories":["Python","Libraries"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachow%2Fsiuba","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmachow%2Fsiuba","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachow%2Fsiuba/lists"}