{"id":14982362,"url":"https://github.com/mrpowers-io/quinn","last_synced_at":"2026-06-09T10:00:51.144Z","repository":{"id":25290911,"uuid":"103657756","full_name":"mrpowers-io/quinn","owner":"mrpowers-io","description":"pyspark methods to enhance developer productivity 📣 👯 🎉","archived":false,"fork":false,"pushed_at":"2025-03-06T03:34:31.000Z","size":2073,"stargazers_count":687,"open_issues_count":25,"forks_count":95,"subscribers_count":18,"default_branch":"main","last_synced_at":"2026-05-23T05:05:16.724Z","etag":null,"topics":["apache-spark","pyspark"],"latest_commit_sha":null,"homepage":"https://mrpowers-io.github.io/quinn/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrpowers-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-09-15T13:02:42.000Z","updated_at":"2026-04-06T19:16:03.000Z","dependencies_parsed_at":"2024-01-12T01:13:15.015Z","dependency_job_id":"0f2af64d-392a-49fa-aae9-a8aac46d3460","html_url":"https://github.com/mrpowers-io/quinn","commit_stats":{"total_commits":317,"total_committers":31,"mean_commits":"10.225806451612904","dds":0.555205047318612,"last_synced_commit":"20156582034c5d25a52223b3c4ca992d37c656fa"},"previous_names":["mrpowers-io/quinn","mrpowers/quinn"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/mrpowers-io/quinn","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Fquinn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Fquinn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Fquinn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Fquinn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrpowers-io","download_url":"https://codeload.github.com/mrpowers-io/quinn/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Fquinn/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34101070,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","pyspark"],"created_at":"2024-09-24T14:05:16.401Z","updated_at":"2026-06-09T10:00:50.836Z","avatar_url":"https://github.com/mrpowers-io.png","language":"Python","funding_links":[],"categories":["Python","Packages"],"sub_categories":["General Purpose Libraries"],"readme":"# Quinn\n\n![![image](https://github.com/MrPowers/quinn/workflows/build/badge.svg)](https://github.com/MrPowers/quinn/actions/workflows/ci.yml/badge.svg)\n![![image](https://github.com/MrPowers/quinn/workflows/build/badge.svg)](https://github.com/MrPowers/quinn/actions/workflows/lint.yaml/badge.svg)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/quinn)\n[![PyPI version](https://badge.fury.io/py/quinn.svg)](https://badge.fury.io/py/quinn)\n\nPyspark helper methods to maximize developer productivity.\n\nQuinn provides DataFrame validation functions, useful column functions / DataFrame transformations, and performant helper functions.\n\n![quinn](https://github.com/MrPowers/quinn/raw/master/quinn.png)\n\n## Documentation\n\nYou can find official documentation [here](https://mrpowers.github.io/quinn/).\n\n## Setup\n\nQuinn is [uploaded to PyPi](https://pypi.org/project/quinn/) and can be installed with this command:\n\n```\npip install quinn\n```\n\n## Quinn Helper Functions\n\n```python\nimport quinn\n```\n\n### DataFrame Validations\n\n**validate_presence_of_columns()**\n\nRaises an exception unless `source_df` contains the `name`, `age`, and `fun` column.\n\n```python\nquinn.validate_presence_of_columns(source_df, [\"name\", \"age\", \"fun\"])\n```\n\n**validate_schema()**\n\nRaises an exception unless `source_df` contains all the `StructFields` defined in the `required_schema`.\n\n```python\nquinn.validate_schema(source_df, required_schema)\n```\n\n**validate_absence_of_columns()**\n\nRaises an exception if `source_df` contains `age` or `cool` columns.\n\n```python\nquinn.validate_absence_of_columns(source_df, [\"age\", \"cool\"])\n```\n\n### Functions\n\n**single_space()**\n\nReplaces all multispaces with single spaces (e.g. changes `\"this has   some\"` to `\"this has some\"`.\n\n```python\nactual_df = source_df.withColumn(\n    \"words_single_spaced\",\n    quinn.single_space(col(\"words\"))\n)\n```\n\n**remove_all_whitespace()**\n\nRemoves all whitespace in a string (e.g. changes `\"this has some\"` to `\"thishassome\"`.\n\n```python\nactual_df = source_df.withColumn(\n    \"words_without_whitespace\",\n    quinn.remove_all_whitespace(col(\"words\"))\n)\n```\n\n**anti_trim()**\n\nRemoves all inner whitespace, but doesn't delete leading or trailing whitespace (e.g. changes `\" this has some \"` to `\" thishassome \"`.\n\n```python\nactual_df = source_df.withColumn(\n    \"words_anti_trimmed\",\n    quinn.anti_trim(col(\"words\"))\n)\n```\n\n**remove_non_word_characters()**\n\nRemoves all non-word characters from a string (e.g. changes `\"si%$#@!#$!@#mpsons\"` to `\"simpsons\"`.\n\n```python\nactual_df = source_df.withColumn(\n    \"words_without_nonword_chars\",\n    quinn.remove_non_word_characters(col(\"words\"))\n)\n```\n\n**multi_equals()**\n\n`multi_equals` returns true if `s1` and `s2` are both equal to `\"cat\"`.\n\n```python\nsource_df.withColumn(\n    \"are_s1_and_s2_cat\",\n    quinn.multi_equals(\"cat\")(col(\"s1\"), col(\"s2\"))\n)\n```\n\n**approx_equal()**\n\nThis function takes 3 arguments which are 2 Pyspark DataFrames and one integer values as threshold, and returns the Boolean column which tells if the columns are equal in the threshold.\n\n```\nlet the columns be\ncol1 = [1.2, 2.5, 3.1, 4.0, 5.5]\ncol2 = [1.3, 2.3, 3.0, 3.9, 5.6]\nthreshold = 0.2\n\nresult = approx_equal(col(\"col1\"), col(\"col2\"), threshold)\nresult.show()\n\n+-----+\n|value|\n+-----+\n| true|\n|false|\n| true|\n| true|\n| true|\n+-----+\n```\n\n**array_choice()**\n\nThis function takes a Column as a parameter and returns a PySpark column that contains a random value from the input column parameter\n\n```\ndf = spark.createDataFrame([(1,), (2,), (3,), (4,), (5,)], [\"values\"])\nresult = df.select(array_choice(col(\"values\")))\n\nThe output is :=\n+--------------+\n|array_choice()|\n+--------------+\n|             2|\n+--------------+\n\n```\n\n**regexp_extract_all()**\n\nThe regexp_extract_all takes 2 parameters String `s` and `regexp` which is a regular expression. This function finds all the matches for the string which satisfies the regular expression.\n\n```\nprint(regexp_extract_all(\"this is a example text message for testing application\",r\"\\b\\w*a\\w*\\b\"))\n\nThe output is :=\n['a', 'example', 'message', 'application']\n\n```\n\nWhere `r\"\\b\\w*a\\w*\\b\"` pattern checks for words containing letter `a`\n\n**week_start_date()**\n\nIt takes 2 parameters, column and week_start_day. It returns a Spark Dataframe column which contains the start date of the week. By default the week_start_day is set to \"Sun\".\n\nFor input `[\"2023-03-05\", \"2023-03-06\", \"2023-03-07\", \"2023-03-08\"]` the Output is\n\n```\nresult = df.select(\"date\", week_start_date(col(\"date\"), \"Sun\"))\nresult.show()\n+----------+----------------+\n|      date|week_start_date |\n+----------+----------------+\n|2023-03-05|      2023-03-05|\n|2023-03-07|      2023-03-05|\n|2023-03-08|      2023-03-05|\n+----------+----------------+\n```\n\n**week_end_date()**\n\nIt also takes 2 Paramters as Column and week_end_day, and returns the dateframe column which contains the end date of the week. By default the week_end_day is set to \"sat\"\n\n```\n+---------+-------------+\n      date|week_end_date|\n+---------+-------------+\n2023-03-05|   2023-03-05|\n2023-03-07|   2023-03-12|\n2023-03-08|   2023-03-12|\n+---------+-------------+\n\n```\n\n**uuid5()**\n\nThis function generates UUIDv5 in string form from the passed column and optionally namespace and optional extra salt.\nBy default namespace is NAMESPACE_DNS UUID and no extra string used to reduce hash collisions.\n\n```\n\ndf = spark.createDataFrame([(\"lorem\",), (\"ipsum\",)], [\"values\"])\nresult = df.select(quinn.uuid5(F.col(\"values\")).alias(\"uuid5\"))\nresult.show(truncate=False)\n\nThe output is :=\n+------------------------------------+\n|uuid5                               |\n+------------------------------------+\n|35482fda-c10a-5076-8da2-dc7bf22d6be4|\n|51b79c1d-d06c-5b30-a5c6-1fadcd3b2103|\n+------------------------------------+\n\n```\n\n### Transformations\n\n**snake_case_col_names()**\n\nConverts all the column names in a DataFrame to snake_case. It's annoying to write SQL queries when columns aren't snake cased.\n\n```python\nquinn.snake_case_col_names(source_df)\n```\n\n**sort_columns()**\n\nSorts the DataFrame columns in alphabetical order, including nested columns if sort_nested is set to True. Wide DataFrames are easier to navigate when they're sorted alphabetically.\n\n```python\nquinn.sort_columns(df=source_df, sort_order=\"asc\", sort_nested=True)\n```\n\n### DataFrame Helpers\n\n**with_columns_renamed()**\n\nRename ALL or MULTIPLE columns in a dataframe by implementing a common logic to rename the columns.\n\nConsider you have the following two dataframes for orders coming from a source A and a source B:\n\n```\norder_a_df.show()\n\n+--------+---------+--------+\n|order_id|order_qty|store_id|\n+--------+---------+--------+\n|     001|       23|    45AB|\n|     045|        2|    98HX|\n|     021|      142|    09AA|\n+--------+---------+--------+\n\norder_b_df.show()\n\n+--------+---------+--------+\n|order_id|order_qty|store_id|\n+--------+---------+--------+\n|     001|       23|    47AB|\n|     985|        2|    54XX|\n|    0112|       12|    09AA|\n+--------+---------+--------+\n```\n\nNow, you need to join these two dataframes. However, in Spark, when two dfs with identical column names are joined, you may start running into ambiguous column name issue due to multiple columns with the same name in the resulting df. So it's a best practice to rename all of these columns to reflect which df they originate from:\n\n```python\ndef add_suffix(s):\n    return s + '_a'\n\norder_a_df_renamed = quinn.with_columns_renamed(add_suffix)(order_a_df)\n\norder_a_df_renamed.show()\n```\n```\n+----------+-----------+----------+\n|order_id_a|order_qty_a|store_id_a|\n+----------+-----------+----------+\n|       001|         23|      45AB|\n|       045|          2|      98HX|\n|       021|        142|      09AA|\n+----------+-----------+----------+\n```\n\n**column_to_list()**\n\nConverts a column in a DataFrame to a list of values.\n\n```python\nquinn.column_to_list(source_df, \"name\")\n```\n\n**two_columns_to_dictionary()**\n\nConverts two columns of a DataFrame into a dictionary. In this example, `name` is the key and `age` is the value.\n\n```python\nquinn.two_columns_to_dictionary(source_df, \"name\", \"age\")\n```\n\n**to_list_of_dictionaries()**\n\nConverts an entire DataFrame into a list of dictionaries.\n\n```python\nquinn.to_list_of_dictionaries(source_df)\n```\n\n**show_output_to_df()**\n\n```python\nquinn.show_output_to_df(output_str, spark)\n```\n\nParses a spark DataFrame output string into a spark DataFrame. Useful for quickly pulling data from a log into a DataFrame. In this example, output_str is a string of the form:\n\n```\n+----+---+-----------+------+\n|name|age|     stuff1|stuff2|\n+----+---+-----------+------+\n|jose|  1|nice person|  yoyo|\n|  li|  2|nice person|  yoyo|\n| liz|  3|nice person|  yoyo|\n+----+---+-----------+------+\n```\n\n### Schema Helpers\n\n**schema_from_csv()**\n\nConverts a CSV file into a PySpark schema (aka `StructType`). The CSV must contain the column name and type.  The nullable and metadata columns are optional.\n\n```python\nquinn.schema_from_csv(\"schema.csv\")\n```\n\nHere's an example CSV file:\n\n```\nname,type\nperson,string\naddress,string\nphoneNumber,string\nage,int\n```\n\nHere's how to convert that CSV file to a PySpark schema using schema_from_csv():\n\n```python\nschema = schema_from_csv(spark, \"some_file.csv\")\n\nStructType([\n    StructField(\"person\", StringType(), True),\n    StructField(\"address\", StringType(), True),\n    StructField(\"phoneNumber\", StringType(), True),\n    StructField(\"age\", IntegerType(), True),\n])\n```\n\nHere's a more complex CSV file:\n\n```\nname,type,nullable,metadata\nperson,string,false,{\"description\":\"The person's name\"}\naddress,string\nphoneNumber,string,TRUE,{\"description\":\"The person's phone number\"}\nage,int,False\n```\n\nHere's how to read this CSV file into a PySpark schema:\n\n```python\nanother_schema = schema_from_csv(spark, \"some_file.csv\")\n\nStructType([\n    StructField(\"person\", StringType(), False, {\"description\": \"The person's name\"}),\n    StructField(\"address\", StringType(), True),\n    StructField(\"phoneNumber\", StringType(), True, {\"description\": \"The person's phone number\"}),\n    StructField(\"age\", IntegerType(), False),\n])\n```\n\n**print_schema_as_code()**\n\nConverts a Spark `DataType` to a string of Python code that can be evaluated as code using eval(). If the `DataType` is a `StructType`, this can be used to print an existing schema in a format that can be copy-pasted into a Python script, log to a file, etc. \n\nFor example:\n\n```python\n# Consider the below schema for fields\nfields = [\n    StructField(\"simple_int\", IntegerType()),\n    StructField(\"decimal_with_nums\", DecimalType(19, 8)),\n    StructField(\"array\", ArrayType(FloatType()))\n]\nschema = StructType(fields)\n\nprintable_schema: str = quinn.print_schema_as_code(schema)\nprint(printable_schema)\n```\n\n```\nStructType(\n\tfields=[\n\t\tStructField(\"simple_int\", IntegerType(), True),\n\t\tStructField(\"decimal_with_nums\", DecimalType(19, 8), True),\n\t\tStructField(\n\t\t\t\"array\",\n\t\t\tArrayType(FloatType()),\n\t\t\tTrue,\n\t\t),\n\t]\n)\n```\n\nOnce evaluated, the printable schema is a valid schema that can be used in dataframe creation, validation, etc.\n\n```python\nfrom chispa.schema_comparer import assert_basic_schema_equality\n\nparsed_schema = eval(printable_schema)\nassert_basic_schema_equality(parsed_schema, schema) # passes\n```\n\n`print_schema_as_code()` can also be used to print other `DataType` objects.\n\n `ArrayType`\n```python\narray_type = ArrayType(FloatType())\nprintable_type: str = quinn.print_schema_as_code(array_type)\nprint(printable_type)\n ```\n\n ```\nArrayType(FloatType())\n ```\n\n`MapType`\n```python\nmap_type = MapType(StringType(), FloatType())\nprintable_type: str = quinn.print_schema_as_code(map_type)\nprint(printable_type)\n ```\n\n ```\nMapType(\n        StringType(),\n        FloatType(),\n        True,\n)\n ```\n\n`IntegerType`, `StringType` etc.\n```python\ninteger_type = IntegerType()\nprintable_type: str = quinn.print_schema_as_code(integer_type)\nprint(printable_type)\n ```\n\n ```\nIntegerType()\n ```\n\n## Pyspark Core Class Extensions\n\n```\nimport pyspark.sql.functions as F\nimport quinn\n```\n\n### Column Extensions\n\n**is_falsy()**\n\nReturns a Column indicating whether all values in the Column are False or NULL: `True` if `has_stuff` is `None` or `False`.\n\n```python\nsource_df.withColumn(\"is_stuff_falsy\", quinn.is_falsy(F.col(\"has_stuff\")))\n```\n\n**is_truthy()**\n\nCalculates a boolean expression that is the opposite of is_falsy for the given Column: `True` unless `has_stuff` is `None` or `False`.\n\n```python\nsource_df.withColumn(\"is_stuff_truthy\", quinn.is_truthy(F.col(\"has_stuff\")))\n```\n\n**is_null_or_blank()**\n\nReturns a Boolean value which expresses whether a given column is NULL or contains only blank characters: `True` if `blah` is `null` or blank (the empty string or a string that only contains whitespace).\n\n```python\nsource_df.withColumn(\"is_blah_null_or_blank\", quinn.is_null_or_blank(F.col(\"blah\")))\n```\n\n**is_not_in()**\n\nTo see if a value is not in a list of values: `True` if `fun_thing` is not included in the `bobs_hobbies` list.\n\n```python\nsource_df.withColumn(\"is_not_bobs_hobby\", quinn.is_not_in(F.col(\"fun_thing\")))\n```\n\n**null_between()**\n\nTo see if a value is between two values in a null friendly way: `True` if `age` is between `lower_age` and `upper_age`. If `lower_age` is populated and `upper_age` is `null`, it will return `True` if `age` is greater than or equal to `lower_age`. If `lower_age` is `null` and `upper_age` is populate, it will return `True` if `age` is lower than or equal to `upper_age`.\n\n```python\nsource_df.withColumn(\"is_between\", quinn.null_between(F.col(\"age\"), F.col(\"lower_age\"), F.col(\"upper_age\")))\n```\n\n## Contributing\n\nWe are actively looking for feature requests, pull requests, and bug fixes.\n\nAny developer that demonstrates excellence will be invited to be a maintainer of the project.\n\n### Code Style\n\nWe are using [PySpark code-style](https://github.com/MrPowers/spark-style-guide/blob/main/PYSPARK_STYLE_GUIDE.md) and `sphinx` as docstrings format. For more details about `sphinx` format see [this tutorial](https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html). A short example of `sphinx`-formated docstring is placed below:\n\n```python\n\"\"\"[Summary]\n\n:param [ParamName]: [ParamDescription], defaults to [DefaultParamVal]\n:type [ParamName]: [ParamType](, optional)\n...\n:raises [ErrorType]: [ErrorDescription]\n...\n:return: [ReturnDescription]\n:rtype: [ReturnType]\n\"\"\"\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrpowers-io%2Fquinn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrpowers-io%2Fquinn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrpowers-io%2Fquinn/lists"}