{"id":20325428,"url":"https://github.com/getyourguide/typedpyspark","last_synced_at":"2026-03-01T20:31:58.533Z","repository":{"id":42431581,"uuid":"431867522","full_name":"getyourguide/TypedPyspark","owner":"getyourguide","description":"Type-annotate your spark dataframes and validate them","archived":false,"fork":false,"pushed_at":"2023-09-27T09:13:53.000Z","size":55,"stargazers_count":14,"open_issues_count":3,"forks_count":3,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-10T21:40:04.285Z","etag":null,"topics":["pyspark","python","spark","typing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/getyourguide.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-25T14:06:38.000Z","updated_at":"2024-03-15T14:37:29.000Z","dependencies_parsed_at":"2024-08-02T13:29:12.662Z","dependency_job_id":null,"html_url":"https://github.com/getyourguide/TypedPyspark","commit_stats":{"total_commits":45,"total_committers":3,"mean_commits":15.0,"dds":0.06666666666666665,"last_synced_commit":"d9ad544c07ff4a80b2d0349ed8d5a3b7b9fca0ea"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getyourguide%2FTypedPyspark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getyourguide%2FTypedPyspark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getyourguide%2FTypedPyspark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getyourguide%2FTypedPyspark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/getyourguide","download_url":"https://codeload.github.com/getyourguide/TypedPyspark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248468554,"owners_count":21108838,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pyspark","python","spark","typing"],"created_at":"2024-11-14T19:39:48.047Z","updated_at":"2026-03-01T20:31:58.492Z","avatar_url":"https://github.com/getyourguide.png","language":"Python","readme":"# TypedPyspark\n\nContains a set of abstractions to type dataframes in pyspark.\n\nAllows one to:\n\n- Define dataframe schemas and use them as annotations in variables, functions, classes\n- Create mock data for tests easily based on these schemas\n\n#  Rationale\n\nIn Projects using spark that follow software engineering best practices is common to see\nfunctions defined with type annotations like this:\n\n```py\nfrom pyspark.sql import DataFrame\n\ndef get_name_from_id(dt: DataFrame) -\u003e DataFrame:\n    ...\n```\n\nBut this type annotation only guarantees that a DataFrame instance is called.\nIt says nothing about how the dataframe looks like.\n\n1. If the columns needed are there\n2. If they have the correct types\n3. If there are much more data than needed\n\nThis library tries to address exactly these problems.\nBy running it you get type errors when the annotations dont match reality.\nYou also get self-documenting code in form of expressive annotations.\n\n\n# How to use it\n\n```py\n\nfrom typed_pyspark import Dataframe\n\nreviewTable = Dataframe(\n    default_values={\n        \"original_review_id\": 0,\n        \"is_original_language\": True,\n        \"review_text_id\": 1,\n        \"rating\": 2.4,\n    },\n    schema={\n        \"date_of_review\": \"Timestamp\",\n        \"review_id\": \"Integer\",\n        \"tour_id\": \"Integer\",\n        \"rating\": \"Double\",\n        \"original_review_id\": \"Integer\",\n        \"is_original_language\": \"Boolean\",\n        \"review_text_id\": \"Integer\",\n    },\n)\n\nReviewTableType = ReviewTable.type_annotation()\n\nDaily_Reviews = Dataframe( schema={\n        \"date\": \"Date\",\n        \"tour_id\": \"Integer\",\n        \"num_reviews\": \"Integer\",\n        \"avg_star_rating\": \"Double\",\n    },\n    default_values={},\n)\n\nDaily_ReviewsType = Daily_Reviews.type_annotation()\n\n\n# defining type annotations\ndef calculate_daily_review_data(\n    date_begin: date, date_end: date, reviews: ReviewTableType\n) -\u003e Daily_ReviewsType:\n    ...\n\n# writing tests\ndef test_dates_are_filtered():\n    reviews_df = ReviewTable.create_df(\n        [\n            {\n                \"review_id\": 1,\n                \"tour_id\": 2,\n            },\n            {\n                \"review_id\": 2,\n                \"tour_id\": 2,\n            },\n        ],\n    )\n\n    result_df: RawReviewsType = get_raw_reviews(datetime.date(2021, 12, 15), reviews_df)\n    expected_df = RawReviews.create_df(\n        [\n            {\n                \"review_id\": 1,\n                \"tour_id\": 2,\n            },\n        ],\n    )\n\n\n```\n\n# Install\n\n```sh\npip install 'typed-pyspark~=0.0.4'\n```\n\n## Acknowledgements\n\nInspired by [dataenforce](https://github.com/CedricFR/dataenforce) which provides similar functionality for pandas.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetyourguide%2Ftypedpyspark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgetyourguide%2Ftypedpyspark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetyourguide%2Ftypedpyspark/lists"}