https://github.com/dexplo/pandas_cub
https://github.com/dexplo/pandas_cub
Last synced: 10 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/dexplo/pandas_cub
- Owner: dexplo
- License: mit
- Created: 2019-02-08T05:17:13.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-02-09T23:11:45.000Z (over 7 years ago)
- Last Synced: 2025-07-06T00:53:27.589Z (11 months ago)
- Language: Python
- Size: 83 KB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.ipynb
- License: LICENSE
Awesome Lists containing this project
README
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to use pandas_cub\n",
"\n",
"The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.\n",
"\n",
"## Installation\n",
"\n",
"`pip install pandas-cub`\n",
"\n",
"## What is pandas_cub?\n",
"pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit [this repository][1]. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.\n",
"\n",
"## pandas_cub functionality\n",
"\n",
"pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.\n",
"\n",
"* Subset selection with the brackets\n",
"* Arithmetic and comparison operators (+, -, <, !=, etc...)\n",
"* Aggregation of columns with most of the common functions (min, max, mean, median, etc...)\n",
"* Grouping via pivot tables\n",
"* String-only methods for columns containing strings\n",
"* Reading in simple comma-separated value files\n",
"* Several other methods\n",
"\n",
"\n",
"## pandas_cub DataFrame\n",
"\n",
"pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.\n",
"\n",
"### Missing value representation\n",
"Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.\n",
"\n",
"## Code Examples\n",
"\n",
"pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.\n",
"\n",
"[1]: https://github.com/tdpetrou/pandas_cub"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading data with `read_csv`\n",
"\n",
"pandas_cub consists of a single function, `read_csv`, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The `head` method returns the first 5 rows by default."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas_cub as pdc"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"dept race gender salary 0Houston Police Department-HPDWhite Male 452791Houston Fire Department (HFD)White Male 631662Houston Police Department-HPDBlack Male 666143Public Works & Engineering-PWEAsian Male 716804Houston Airport System (HAS)White Male 42390"
],
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pdc.read_csv('data/employee.csv')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### DataFrame properties"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `shape` property returns a tuple of the number of rows and columns"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1535, 4)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `len` function returns just the number of rows."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1535"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `dtypes` property returns a DataFrame of the column names and their respective data type."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Column NameData Type 0dept string 1race string 2gender string 3salary int "
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `columns` property returns a list of the columns."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['dept', 'race', 'gender', 'salary']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set new columns by assigning the `columns` property to a list."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary 0Houston Police Department-HPDWhite Male 452791Houston Fire Department (HFD)White Male 631662Houston Police Department-HPDBlack Male 666143Public Works & Engineering-PWEAsian Male 716804Houston Airport System (HAS)White Male 42390"
],
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns = ['department', 'race', 'gender', 'salary']\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `values` property returns a single numpy array of all the data."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([['Houston Police Department-HPD', 'White', 'Male', 45279],\n",
" ['Houston Fire Department (HFD)', 'White', 'Male', 63166],\n",
" ['Houston Police Department-HPD', 'Black', 'Male', 66614],\n",
" ...,\n",
" ['Houston Police Department-HPD', 'White', 'Male', 43443],\n",
" ['Houston Police Department-HPD', 'Asian', 'Male', 55461],\n",
" ['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],\n",
" dtype=object)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Subset selection\n",
"\n",
"Subset selection is handled with the brackets. To select a single column, place that column name in the brackets."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race 0White 1White 2Black 3Asian 4White "
],
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['race'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select multiple columns with a list of strings."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race salary 0White 452791White 631662Black 666143Asian 716804White 42390"
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['race', 'salary']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary race 0 77076Black 1 81239White 2 81239White "
],
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rows = [10, 50, 100]\n",
"cols = ['salary', 'race']\n",
"df[rows, cols]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use integers for the columns as well."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"gender department0Male Houston Police Department-HPD1Male Houston Police Department-HPD2Male Houston Police Department-HPD"
],
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rows = [10, 50, 100]\n",
"cols = [2, 0]\n",
"df[rows, cols]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use a single integer and not just a list."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary 0 66614"
],
"text/plain": [
""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[99, 3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or a single string for the columns"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary 0 66614"
],
"text/plain": [
""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[99, 'salary']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use a slice for the rows"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race gender 0White Male 1White Male 2Hispanic Male 3White Male 4White Male 5Hispanic Male 6Hispanic Male 7Black Female "
],
"text/plain": [
""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[20:100:10, ['race', 'gender']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also slice the columns with either integers or strings"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace 0Houston Police Department-HPDWhite 1Houston Fire Department (HFD)White 2Houston Police Department-HPDHispanic 3Houston Police Department-HPDWhite 4Houston Fire Department (HFD)White 5Houston Police Department-HPDHispanic 6Houston Fire Department (HFD)Hispanic 7Houston Police Department-HPDBlack "
],
"text/plain": [
""
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[20:100:10, :2]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender 0Houston Police Department-HPDWhite Male 1Houston Fire Department (HFD)White Male 2Houston Police Department-HPDHispanic Male 3Houston Police Department-HPDWhite Male 4Houston Fire Department (HFD)White Male 5Houston Police Department-HPDHispanic Male 6Houston Fire Department (HFD)Hispanic Male 7Houston Police Department-HPDBlack Female "
],
"text/plain": [
""
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[20:100:10, 'department':'gender']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can do boolean selection if you pass the brackets a one-column boolean DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary 0False1False2False3False4False"
],
"text/plain": [
""
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"filt = df['salary'] > 100000\n",
"filt.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary 0Public Works & Engineering-PWEWhite Male 1079621Health & Human ServicesBlack Male 1804162Houston Fire Department (HFD)Hispanic Male 1652163Health & Human ServicesWhite Female 1007914Houston Airport System (HAS)White Male 120916"
],
"text/plain": [
""
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[filt].head()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race salary 0White 1079621Black 1804162Hispanic 1652163White 1007914White 120916"
],
"text/plain": [
""
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[filt, ['race', 'salary']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Assigning Columns\n",
"You can only assign an entire new column or overwrite an old one. You cannot assign a subset of the data. You can assign a new column with a single value like this:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus 0Houston Police Department-HPDWhite Male 45279 10001Houston Fire Department (HFD)White Male 63166 10002Houston Police Department-HPDBlack Male 66614 10003Public Works & Engineering-PWEAsian Male 71680 10004Houston Airport System (HAS)White Male 42390 1000"
],
"text/plain": [
""
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['bonus'] = 1000\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can assign with a numpy array the same length as a column."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus 0Houston Police Department-HPDWhite Male 45279 35361Houston Fire Department (HFD)White Male 63166 12962Houston Police Department-HPDBlack Male 66614 5113Public Works & Engineering-PWEAsian Male 71680 42674Houston Airport System (HAS)White Male 42390 3766"
],
"text/plain": [
""
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"df['bonus'] = np.random.randint(100, 5000, len(df))\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can assign a new column with a one column DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary 0 488151 644622 671253 759474 461565 1100016 537387 1853488 325759 57918......1525 329361526 492941527 342181528 827951529 1049001530 464081531 670501532 473681533 600131534 52624"
],
"text/plain": [
""
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['salary'] + df['bonus']"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDWhite Male 45279 3536 488151Houston Fire Department (HFD)White Male 63166 1296 644622Houston Police Department-HPDBlack Male 66614 511 671253Public Works & Engineering-PWEAsian Male 71680 4267 759474Houston Airport System (HAS)White Male 42390 3766 46156"
],
"text/plain": [
""
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['total salary'] = df['salary'] + df['bonus']\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Arithmetic and comparison operators"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary bonus 0 226395 176801 315830 64802 333070 25553 358400 213354 211950 18830"
],
"text/plain": [
""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = df[['salary', 'bonus']] * 5\n",
"df1.head()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary bonus 0FalseFalse1FalseFalse2FalseFalse3FalseFalse4FalseFalse"
],
"text/plain": [
""
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = df[['salary', 'bonus']] > 100000\n",
"df1.head()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race 0True1True2False3False4True"
],
"text/plain": [
""
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = df['race'] == 'White'\n",
"df1.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Aggregation\n",
"\n",
"Most of the common aggregation methods are available. They only work down the columns and not across the rows."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Health & Human ServicesAsian Female 24960 101 25913"
],
"text/plain": [
""
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Columns that the aggregation does not work are dropped."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary bonus total salary0 56278.746 2594.283 58873.029"
],
"text/plain": [
""
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.mean()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0 3 0 0 145 1516 145"
],
"text/plain": [
""
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.argmax()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary 0 347"
],
"text/plain": [
""
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['salary'].argmin()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check if all salaries are greater than 20000"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"salary 0True"
],
"text/plain": [
""
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = df['salary'] > 20000\n",
"df1.all()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count the number of non-missing values"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0 1535 1535 1535 1535 1535 1535"
],
"text/plain": [
""
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get number of unique values."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0 6 5 2 548 1318 1524"
],
"text/plain": [
""
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.nunique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Non-Aggregating Methods\n",
"These are methods that do not return a single value."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the unique values of each column. The `unique` method returns a list of DataFrames containing the unique values for each column."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"dfs = df.unique()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"department0Health & Human Services1Houston Airport System (HAS)2Houston Fire Department (HFD)3Houston Police Department-HPD4Parks & Recreation5Public Works & Engineering-PWE"
],
"text/plain": [
""
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs[0]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race 0Asian 1Black 2Hispanic 3Native American4White "
],
"text/plain": [
""
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs[1]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"gender 0Female 1Male "
],
"text/plain": [
""
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rename columns with a dictionary."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"dept race gender salary BONUS total salary0Houston Police Department-HPDWhite Male 45279 3536 488151Houston Fire Department (HFD)White Male 63166 1296 644622Houston Police Department-HPDBlack Male 66614 511 671253Public Works & Engineering-PWEAsian Male 71680 4267 759474Houston Airport System (HAS)White Male 42390 3766 46156"
],
"text/plain": [
""
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.rename({'department':'dept', 'bonus':'BONUS'}).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Drop columns with a string or list of strings."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentgender salary bonus total salary0Houston Police Department-HPDMale 45279 3536 488151Houston Fire Department (HFD)Male 63166 1296 644622Houston Police Department-HPDMale 66614 511 671253Public Works & Engineering-PWEMale 71680 4267 759474Houston Airport System (HAS)Male 42390 3766 46156"
],
"text/plain": [
""
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop('race').head()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentsalary bonus total salary0Houston Police Department-HPD 45279 3536 488151Houston Fire Department (HFD) 63166 1296 644622Houston Police Department-HPD 66614 511 671253Public Works & Engineering-PWE 71680 4267 759474Houston Airport System (HAS) 42390 3766 46156"
],
"text/plain": [
""
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop(['race', 'gender']).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Non-aggregating methods that keep all columns\n",
"The next several methods are non-aggregating methods that return a DataFrame with the same exact shape as the original. They only work on boolean, integer and float columns and ignore string columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Absolute value"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDWhite Male 45279 3536 488151Houston Fire Department (HFD)White Male 63166 1296 644622Houston Police Department-HPDBlack Male 66614 511 671253Public Works & Engineering-PWEAsian Male 71680 4267 759474Houston Airport System (HAS)White Male 42390 3766 46156"
],
"text/plain": [
""
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.abs().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cumulative min, max, and sum"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDWhite Male 45279 3536 488151Houston Fire Department (HFD)White Male 63166 3536 644622Houston Police Department-HPDBlack Male 66614 3536 671253Public Works & Engineering-PWEAsian Male 71680 4267 759474Houston Airport System (HAS)White Male 71680 4267 75947"
],
"text/plain": [
""
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.cummax().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clip values to be within a range."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDWhite Male 45279 40000 488151Houston Fire Department (HFD)White Male 60000 40000 600002Houston Police Department-HPDBlack Male 60000 40000 600003Public Works & Engineering-PWEAsian Male 60000 40000 600004Houston Airport System (HAS)White Male 42390 40000 46156"
],
"text/plain": [
""
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.clip(40000, 60000).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Round numeric columns"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDWhite Male 45000 4000 490001Houston Fire Department (HFD)White Male 63000 1000 640002Houston Police Department-HPDBlack Male 67000 1000 670003Public Works & Engineering-PWEAsian Male 72000 4000 760004Houston Airport System (HAS)White Male 42000 4000 46000"
],
"text/plain": [
""
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.round(-3).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Copy the DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDWhite Male 45279 3536 488151Houston Fire Department (HFD)White Male 63166 1296 644622Houston Police Department-HPDBlack Male 66614 511 671253Public Works & Engineering-PWEAsian Male 71680 4267 759474Houston Airport System (HAS)White Male 42390 3766 46156"
],
"text/plain": [
""
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.copy().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Take the nth difference."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDWhite Male nan nan nan1Houston Fire Department (HFD)White Male nan nan nan2Houston Police Department-HPDBlack Male 21335.000 -3025.000 18310.0003Public Works & Engineering-PWEAsian Male 8514.000 2971.000 11485.0004Houston Airport System (HAS)White Male -24224.000 3255.000-20969.0005Public Works & Engineering-PWEWhite Male 36282.000 -2228.000 34054.0006Houston Fire Department (HFD)Hispanic Male 10254.000 -2672.000 7582.0007Health & Human ServicesBlack Male 72454.000 2893.000 75347.0008Public Works & Engineering-PWEBlack Male -22297.000 1134.000-21163.0009Health & Human ServicesBlack Male -125147.000 -2283.000-127430.000"
],
"text/plain": [
""
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.diff(2).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the nth percentage change."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDWhite Male nan nan nan1Houston Fire Department (HFD)White Male nan nan nan2Houston Police Department-HPDBlack Male 0.471 -0.855 0.3753Public Works & Engineering-PWEAsian Male 0.135 2.292 0.1784Houston Airport System (HAS)White Male -0.364 6.370 -0.3125Public Works & Engineering-PWEWhite Male 0.506 -0.522 0.4486Houston Fire Department (HFD)Hispanic Male 0.242 -0.710 0.1647Health & Human ServicesBlack Male 0.671 1.419 0.6858Public Works & Engineering-PWEBlack Male -0.424 1.037 -0.3949Health & Human ServicesBlack Male -0.694 -0.463 -0.688"
],
"text/plain": [
""
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pct_change(2).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort the DataFrame by one or more columns"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDBlack Female 24960 953 259131Public Works & Engineering-PWEHispanic Male 26104 4258 303622Public Works & Engineering-PWEBlack Female 26125 3247 293723Houston Airport System (HAS)Hispanic Female 26125 832 269574Houston Airport System (HAS)Black Female 26125 2461 28586"
],
"text/plain": [
""
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values('salary').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort descending"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Fire Department (HFD)White Male 210588 3724 2143121Houston Police Department-HPDWhite Male 199596 848 2004442Houston Airport System (HAS)Black Male 186192 1778 1879703Health & Human ServicesBlack Male 180416 4932 1853484Public Works & Engineering-PWEWhite Female 178331 2124 180455"
],
"text/plain": [
""
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values('salary', asc=False).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort by multiple columns"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Airport System (HAS)Asian Female 26125 4446 305711Houston Police Department-HPDAsian Male 27914 2855 307692Houston Police Department-HPDAsian Male 28169 2572 307413Public Works & Engineering-PWEAsian Male 28995 2874 318694Public Works & Engineering-PWEAsian Male 30347 4938 35285"
],
"text/plain": [
""
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values(['race', 'salary']).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Randomly sample the DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Fire Department (HFD)White Male 62540 2995 655351Public Works & Engineering-PWEWhite Male 63336 1547 648832Houston Police Department-HPDWhite Male 52514 1150 53664"
],
"text/plain": [
""
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(n=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Randomly sample a fraction"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Houston Police Department-HPDHispanic Female 60347 1200 615471Public Works & Engineering-PWEBlack Male 49109 3598 527072Health & Human ServicesBlack Female 48984 4602 535863Houston Police Department-HPDWhite Male 55461 2813 582744Houston Airport System (HAS)Black Female 29286 1877 311635Houston Police Department-HPDAsian Male 66614 4480 710946Houston Fire Department (HFD)White Male 28024 4475 32499"
],
"text/plain": [
""
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(frac=.005)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sample with replacement"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentrace gender salary bonus total salary0Parks & RecreationBlack Female 31075 1665 327401Public Works & Engineering-PWEHispanic Male 67038 644 676822Houston Police Department-HPDBlack Male 37024 1532 385563Health & Human ServicesBlack Female 57433 3106 605394Public Works & Engineering-PWEBlack Male 53373 924 54297"
],
"text/plain": [
""
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(n=10000, replace=True).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### String-only methods\n",
"\n",
"Use the `str` accessor to call methods available just to string columns. Pass the name of the string column as the first parameter for all these methods."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"department0 21 02 23 24 0"
],
"text/plain": [
""
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.str.count('department', 'P').head()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"department0houston police department-hpd1houston fire department (hfd)2houston police department-hpd3public works & engineering-pwe4houston airport system (has)"
],
"text/plain": [
""
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.str.lower('department').head()"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"department0 01 02 03 -14 0"
],
"text/plain": [
""
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.str.find('department', 'Houston').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grouping\n",
"\n",
"pandas_cub provides the `value_counts` method for simple frequency counting of unique values and `pivot_table` for grouping and aggregating.\n",
"\n",
"The `value_counts` method returns a list of DataFrames, one for each column."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"dfs = df[['department', 'race', 'gender']].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentcount 0Houston Police Department-HPD 5701Houston Fire Department (HFD) 3652Public Works & Engineering-PWE 3413Health & Human Services 1034Houston Airport System (HAS) 1035Parks & Recreation 53"
],
"text/plain": [
""
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs[0]"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race count 0White 5421Black 5182Hispanic 3813Asian 874Native American 7"
],
"text/plain": [
""
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs[1]"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"gender count 0Male 11351Female 400"
],
"text/plain": [
""
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dfs[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If your DataFrame has one column, a DataFrame and not a list is returned. You can also return the relative frequency by setting the `normalize` parameter to `True`."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race count 0White 0.3531Black 0.3372Hispanic 0.2483Asian 0.0574Native American 0.005"
],
"text/plain": [
""
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['race'].value_counts(normalize=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `pivot_table` method allows to group by one or two columns and aggregate values from another column. Let's find the average salary for each race and gender. All parameters must be strings."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race Female Male 0Asian 58304.222 60622.9571Black 48133.382 51853.0002Hispanic 44216.960 55493.0643Native American 58844.333 68850.5004White 66415.528 63439.196"
],
"text/plain": [
""
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_table(rows='race', columns='gender', values='salary', aggfunc='mean')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you don't provide `values` or `aggfunc` then by default it will return frequency (a contingency table)."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"race Female Male 0Asian 18 691Black 207 3112Hispanic 100 2813Native American 3 44White 72 470"
],
"text/plain": [
""
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_table(rows='race', columns='gender')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can group by just a single column."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"departmentmean 0Health & Human Services 51324.9811Houston Airport System (HAS) 53990.3692Houston Fire Department (HFD) 59960.4413Houston Police Department-HPD 60428.7464Parks & Recreation 39426.1515Public Works & Engineering-PWE 50207.806"
],
"text/plain": [
""
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_table(rows='department', values='salary', aggfunc='mean')"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Health & Human ServicesHouston Airport System (HAS)Houston Fire Department (HFD)Houston Police Department-HPDParks & RecreationPublic Works & Engineering-PWE0 51324.981 53990.369 59960.441 60428.746 39426.151 50207.806"
],
"text/plain": [
""
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot_table(columns='department', values='salary', aggfunc='mean')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}