{"id":22039438,"url":"https://github.com/mpolinowski/python-asserts-cheatsheet","last_synced_at":"2026-04-13T22:03:22.950Z","repository":{"id":234831437,"uuid":"642887992","full_name":"mpolinowski/python-asserts-cheatsheet","owner":"mpolinowski","description":"Python Asserts in Data Science Cheat Sheet","archived":false,"fork":false,"pushed_at":"2023-05-19T15:14:13.000Z","size":1145,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-23T13:14:27.239Z","etag":null,"topics":["assert","pandas-dataframe","python"],"latest_commit_sha":null,"homepage":"https://mpolinowski.github.io/docs/Development/Python/2023-05-18-python-asserts/2023-05-18","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mpolinowski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-05-19T15:14:07.000Z","updated_at":"2023-05-22T12:25:15.000Z","dependencies_parsed_at":"2024-04-21T02:03:21.041Z","dependency_job_id":"d3e7ff90-8a83-4796-9f4a-e8bd8f9021c8","html_url":"https://github.com/mpolinowski/python-asserts-cheatsheet","commit_stats":null,"previous_names":["mpolinowski/python-asserts-cheatsheet"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mpolinowski/python-asserts-cheatsheet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-asserts-cheatsheet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-asserts-cheatsheet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-asserts-cheatsheet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-asserts-cheatsheet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mpolinowski","download_url":"https://codeload.github.com/mpolinowski/python-asserts-cheatsheet/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mpolinowski%2Fpython-asserts-cheatsheet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31772643,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T20:17:16.280Z","status":"ssl_error","status_checked_at":"2026-04-13T20:17:08.216Z","response_time":93,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["assert","pandas-dataframe","python"],"created_at":"2024-11-30T11:10:51.734Z","updated_at":"2026-04-13T22:03:22.936Z","avatar_url":"https://github.com/mpolinowski.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Python Asserts Cheat Sheet\n\n\n## Datasets\n\n__Validating and Verifying Data__\n\n\n* [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)\n\nYellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab \u0026 Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.\n\n\n\u003c!-- TOC --\u003e\n\n- [Python Asserts Cheat Sheet](#python-asserts-cheat-sheet)\n  - [Datasets](#datasets)\n  - [Introduction to Asserts](#introduction-to-asserts)\n    - [Asserts in Python](#asserts-in-python)\n    - [Asserts in Pandas](#asserts-in-pandas)\n      - [Indices](#indices)\n      - [Series](#series)\n      - [DataFrames](#dataframes)\n    - [Asserts in Numpy](#asserts-in-numpy)\n  - [Assert-based Testing](#assert-based-testing)\n    - [Quantitative Tests](#quantitative-tests)\n    - [Logical Tests](#logical-tests)\n\n\u003c!-- /TOC --\u003e\n\n```python\n!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -P dataset\n```\n\n```python\nimport csv\nimport math\nimport numpy as np\nimport numpy.testing as npt\nimport pandas as pd\nimport pandas.testing as pdt\n```\n\n```python\nyellow_tripdata_df = pd.read_parquet(\n    'dataset/yellow_tripdata_2023-01.parquet'\n)\n\nyellow_tripdata_df.to_csv('dataset/yellow_tripdata_2023-01.csv')\n```\n\n```python\nyellow_tripdata_df = pd.read_csv(\n    'dataset/yellow_tripdata_2023-01.csv',\n    parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime'],\n    nrows=1000\n)\nyellow_tripdata_df.head(5)\n```\n\n```python\n# https://www.kaggle.com/datasets/neomatrix369/nyc-taxi-trip-duration-extended\ntrip_ext_df = pd.read_csv('dataset/nyc_trip_duration_extended.csv')\ntrip_ext_df.head(5)\n```\n\n## Introduction to Asserts\n\n### Asserts in Python\n\n```python\n# simple assert\nx = 'five'\nassert x == 5\n# AssertionError\n```\n\n```python\nlist = [6,2,3,4,5]\nassert all(list[i] \u003c= list[i+1] for i in range(len(list)-1))\n# AssertionError\n```\n\n```python\ndef add(a,b):\n    return a + b\n\nassert add(2,3) \u003c 5\nAssertionError\n```\n\n```python\ntrip_ext_df.columns\n\n# Index(['name', 'district', 'neighbourhood', 'latitude', 'longitude',\n#        'geonumber'],\n#       dtype='object')\n```\n\n```python\nwith open('dataset/nyc_trip_duration_extended.csv') as f:\n    reader = csv.DictReader(f)\n    \n    expected_columns = ['name', 'district', 'neighbourhood', 'latitude', 'longitude',\n       'geonumber', 'missing_column']\n    \n    assert reader.fieldnames == expected_columns, f\"Expected columns: {expected_columns}, but got {reader.fieldnames}\"\n    \n# AssertionError:\n# Expected columns: ['name', 'district', 'neighbourhood', 'latitude', 'longitude', 'geonumber', 'missing_column'],\n# but got ['name', 'district', 'neighbourhood', 'latitude', 'longitude', 'geonumber']\n\n```\n\n```python\nwith open('dataset/yellow_tripdata_2023-01.csv') as f:\n    reader = csv.DictReader(f)\n    \n    for row in reader:\n        # check passenger count is positive int\n        assert float(row['passenger_count']) \u003e 0., f\"ERROR :: Invalid Passenger Count: {row['passenger_count']}\"\n        \n# AssertionError: ERROR :: Invalid Passenger Count: 0.0\n```\n\n```python\n# how many trips were without passengers\ntrips_without_passengers = 0\n\nwith open('dataset/yellow_tripdata_2023-01.csv') as f:\n    next(f) # # skip header\n    \n    for line in f:\n        values = line.strip().split(',')\n        trip_id = values[0]\n        passenger_count = values[4]\n        if passenger_count == '0.0':\n            trips_without_passengers += 1\n            \nprint(f'Trips without passengers: {trips_without_passengers}')\n# Trips without passengers: 51164\n```\n\n```python\nperct_zero_trips = len(zero_trips) * 100 / len(yellow_tripdata_df)\nprint(\"%.2f\" % perct_zero_trips + ' %')\n# 1.67 %\n```\n\n### Asserts in Pandas\n\n#### Indices\n\n```python\nindex1 = pd.Index([1,2,3])\nindex2 = pd.Index([1,2,'three'])\n\npdt.assert_index_equal(index1, index2)\n\n# Index classes are different\n# [left]:  Int64Index([1, 2, 3], dtype='int64')\n# [right]: Index([1, 2, 'three'], dtype='object')\n```\n\n```python\nindex1 = pd.Index([1,2,3])\nindex2 = pd.Index([3,2,1])\n\npdt.assert_index_equal(index1, index2, check_order=True)\n# Index values are different (66.66667 %)\n# [left]:  Int64Index([1, 2, 3], dtype='int64')\n# [right]: Int64Index([3, 2, 1], dtype='int64')\n```\n\n```python\nindex1 = pd.Index([1.0,2.0,3.0])\nindex2 = pd.Index([1.0,2.0,3.1])\n\npdt.assert_index_equal(index1, index2, check_exact=False, atol=0.1)\n\n# Index values are different (33.33333 %)\n# [left]:  Float64Index([1.0, 2.0, 3.0], dtype='float64')\n# [right]: Float64Index([1.0, 2.0, 3.1], dtype='float64')\n```\n\n```python\nindex1 = pd.Index(yellow_tripdata_df['tpep_pickup_datetime'].dt.date)\nindex1[:3]\n# Index([2023-01-01, 2023-01-01, 2023-01-01], dtype='object', name='tpep_pickup_datetime')\n```\n\n```python\nindex2 = pd.Index(yellow_tripdata_df['tpep_dropoff_datetime'].dt.date)\nindex2[:3]\n# Index([2023-01-01, 2023-01-01, 2023-01-01], dtype='object', name='tpep_dropoff_datetime')\n```\n\n```python\npdt.assert_index_equal(index1, index2, check_exact=True, check_names=False)\n\n# Index values are different (0.3 %)\n# [left]:  Index([2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01,\n#        2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01,\n#        ...\n#        2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01,\n#        2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01],\n#       dtype='object', name='tpep_pickup_datetime', length=1000)\n# [right]: Index([2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01,\n#        2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01,\n#        ...\n#        2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01,\n#        2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01],\n#       dtype='object', name='tpep_dropoff_datetime', length=1000)\n```\n\n```python\n# show difference\nindex_diff = yellow_tripdata_df[\n    index1 != index2\n][\n    ['tpep_pickup_datetime','tpep_dropoff_datetime']\n]\n\nindex_diff.head()\n\n#     tpep_pickup_datetime tpep_dropoff_datetime\n# 383  2023-01-01 00:36:07   2023-01-02 00:17:13\n# 567  2022-12-31 23:59:37   2023-01-01 00:07:28\n# 761  2022-12-31 23:58:27   2023-01-01 00:02:21\n```\n\n#### Series\n\n```python\ns1 = pd.Series([1,2,3], name='series1')\ns2 = pd.Series([1,2,3], name='series2')\n\npdt.assert_series_equal(s1,s2,check_names=True)\n\n# Attribute \"name\" are different\n# [left]:  series1\n# [right]: series2\n```\n\n```python\ns1 = pd.Series([1,2,3], name='series1')\ns2 = pd.Series(['1','2','3'], name='series2')\n\npdt.assert_series_equal(s1,s2,check_names=False, check_dtype=False)\n\n# Series values are different (100.0 %)\n# [index]: [0, 1, 2]\n# [left]:  [1, 2, 3]\n# [right]: [1, 2, 3]\n```\n\n```python\npickup_series = yellow_tripdata_df['tpep_pickup_datetime'].dt.date\npickup_series.head(2)\n\n# 0    2023-01-01\n# 1    2023-01-01\n# Name: tpep_pickup_datetime, dtype: object\n```\n\n```python\ndropoff_series = yellow_tripdata_df['tpep_dropoff_datetime'].dt.date\npickup_series.head(2)\n\n# 0    2023-01-01\n# 1    2023-01-01\n# Name: tpep_pickup_datetime, dtype: object\n```\n\n```python\npdt.assert_series_equal(pickup_series, dropoff_series, check_exact=True, check_names=False)\n\n# AssertionError: Series are different\n# Series values are different (0.3 %)\n```\n\n```python\n# drop values that don't fit\nindex_diff = pickup_series.index[pickup_series != dropoff_series]\nyellow_tripdata_df_drop = yellow_tripdata_df.drop(index_diff).reset_index(drop=True)\n# rebuild series\npickup_series = yellow_tripdata_df_drop['tpep_pickup_datetime'].dt.date\ndropoff_series = yellow_tripdata_df_drop['tpep_dropoff_datetime'].dt.date\n# re-check - this times it works\npdt.assert_series_equal(pickup_series, dropoff_series, check_exact=True, check_names=False)\n```\n\n#### DataFrames\n\n```python\ndf1 = pd.DataFrame({'A': [1,2,3], 'B': [3,2,1]})\ndf2 = pd.DataFrame({'B': [3,2,1], 'A': [1,2,3]})\n\npdt.assert_frame_equal(df1,df2,check_like=False)\n\n# DataFrame.columns values are different (100.0 %)\n# [left]:  Index(['A', 'B'], dtype='object')\n# [right]: Index(['B', 'A'], dtype='object')\n```\n\n```python\npickup_df = yellow_tripdata_df.copy()\ndropoff_df = yellow_tripdata_df.copy()\n\npickup_df['date'] = yellow_tripdata_df['tpep_pickup_datetime'].dt.date\ndropoff_df['date'] = yellow_tripdata_df['tpep_dropoff_datetime'].dt.date\n\npdt.assert_frame_equal(\n    pickup_df[['date']],\n    dropoff_df[['date']],\n    check_exact=True,\n    check_names=False\n)\n\n# AssertionError: DataFrame.iloc[:, 0] (column name=\"date\") are different\n# DataFrame.iloc[:, 0] (column name=\"date\") values are different (0.3 %)\n```\n\n```python\n# get index of mismatched rows\nindex_diff = pickup_df.index[\n    pickup_df['date'].ne(dropoff_df['date'])\n]\n\nindex_diff\n# Int64Index([383, 567, 761], dtype='int64')\n```\n\n```python\n# drop rows at those indices\npickup_df_drop = pickup_df.drop(index_diff)\ndropoff_df_drop = dropoff_df.drop(index_diff)\n\n# verify that assert now works\npdt.assert_frame_equal(\n    pickup_df_drop[['date']],\n    dropoff_df_drop[['date']],\n    check_exact=True,\n    check_names=False\n)\n```\n\n### Asserts in Numpy\n\n```python\na = np.array([1,2,3])\nb = np.array([3,2,1])\n\nnpt.assert_array_equal(a,b)\n\n# AssertionError: Arrays are not equal\n# Mismatched elements: 2 / 3 (66.7%)\n# Max absolute difference: 2\n# Max relative difference: 2.\n#  x: array([1, 2, 3])\n#  y: array([3, 2, 1])\n```\n\n```python\nstring1 = 'string'\nstring2 = 'STRING'\n\nnpt.assert_string_equal(string1,string2)\n```\n\n```python\na = np.array([1,2,3])\nb = np.array([0.98,2.02,2.98])\n\nnpt.assert_allclose(a,b,atol=0.01)\n\n# AssertionError: Not equal to tolerance rtol=1e-07, atol=0.01\n# Mismatched elements: 3 / 3 (100%)\n# Max absolute difference: 0.02\n# Max relative difference: 0.02040816\n# x: array([1, 2, 3])\n# y: array([0.98, 2.02, 2.98])\n```\n\n```python\na = np.array([1,2,3])\nb = np.array([2,3,4])\n\nnpt.assert_array_less(a,b)\n```\n\n```python\nnpt.assert_array_less(b,a)\n\n# AssertionError: Arrays are not less-ordered\n# Mismatched elements: 3 / 3 (100%)\n# Max absolute difference: 1\n# Max relative difference: 1.\n#  x: array([2, 3, 4])\n#  y: array([1, 2, 3])\n```\n\n## Assert-based Testing\n\n### Quantitative Tests\n\n```python\ndef test_for_missing_data(df):\n    # count all missing values and assert number to be zero\n    assert df.isnull().sum().sum() == 0, 'ERROR :: DataFrame contains missing data!'\n    return True\n```\n\n```python\nassert test_for_missing_data(trip_ext_df)\n# True\n```\n\n```python\ndef test_non_numerical_data_types(df, columns):\n    for col in columns:\n        assert df[col].dtype == 'int64' or df[col].dtype =='float64', f'ERROR :: {col} has a non-numerical dType'\n    return True\n```\n\n```python\ntest_columns = ['neighbourhood','latitude','longitude','geonumber']\nassert test_non_numerical_data_types(trip_ext_df, trip_ext_df[test_columns])\n# AssertionError: ERROR :: neighbourhood has a non-numerical dType\n```\n\n```python\ndef test_for_out_of_range(df, columns):\n    for col in columns:\n        assert df[col].dtype == 'int64' or df[col].dtype == 'float64', f'ERROR :: {col} has a non-numerical dType'\n        assert df[col].max() \u003c= math.inf, f'ERROR :: {col} contains infinite values'\n        assert df[col].min() \u003e= -math.inf, f'ERROR :: {col} contains infinite values'\n        assert not np.isnan(df[col]).any(), f'ERROR :: {col} contains NaN values'\n        assert not np.isinf(df[col]).any(), f'ERROR :: {col} contains infinite values'\n    return True\n```\n\n```python\ntest_columns = ['latitude','longitude','geonumber']\nassert test_non_numerical_data_types(trip_ext_df, trip_ext_df[test_columns])\n# True\n```\n\n### Logical Tests\n\n```python\ndef test_for_logical_errors(df):\n    # all dropoffs AFTER pickups\n    assert all(df['tpep_dropoff_datetime'] \u003e df['tpep_pickup_datetime']), 'ERROR :: Drop-off time before pickup'\n    # no negative trip distances\n    assert (df['trip_distance'] \u003e= 0).all(), 'ERROR :: Negative trip distances'\n    # no negative passenger count\n    assert (df['passenger_count'] \u003e= 0).all(), 'ERROR :: Negative passenger count'\n    \n    return True\n```\n\n```python\nassert test_for_logical_errors(yellow_tripdata_df)\n# True\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmpolinowski%2Fpython-asserts-cheatsheet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmpolinowski%2Fpython-asserts-cheatsheet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmpolinowski%2Fpython-asserts-cheatsheet/lists"}