{"id":43373410,"url":"https://github.com/ketgo/marshmallow-pyspark","last_synced_at":"2026-02-02T06:10:58.688Z","repository":{"id":52217332,"uuid":"235133347","full_name":"ketgo/marshmallow-pyspark","owner":"ketgo","description":"Marshmallow serializer integration with pyspark","archived":false,"fork":false,"pushed_at":"2023-12-29T19:40:02.000Z","size":65,"stargazers_count":12,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-25T11:41:57.549Z","etag":null,"topics":["data-cleaning","data-engineering","data-engineering-pipeline","data-pipelines","data-schemas","marshmallow","pyspark","schema","spark"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ketgo.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-20T15:27:57.000Z","updated_at":"2023-09-14T12:05:40.000Z","dependencies_parsed_at":"2024-11-20T10:55:29.275Z","dependency_job_id":"a953a14d-8de7-471e-8ddf-3d9b7d1c4772","html_url":"https://github.com/ketgo/marshmallow-pyspark","commit_stats":{"total_commits":40,"total_committers":3,"mean_commits":"13.333333333333334","dds":"0.050000000000000044","last_synced_commit":"03718cfb0059a7533c630d4077fbb3d62c27a5b5"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/ketgo/marshmallow-pyspark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ketgo%2Fmarshmallow-pyspark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ketgo%2Fmarshmallow-pyspark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ketgo%2Fmarshmallow-pyspark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ketgo%2Fmarshmallow-pyspark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ketgo","download_url":"https://codeload.github.com/ketgo/marshmallow-pyspark/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ketgo%2Fmarshmallow-pyspark/sbom","scorecard":{"id":556536,"data":{"date":"2025-08-11","repo":{"name":"github.com/ketgo/marshmallow-pyspark","commit":"f9443aeca5202273db9ebe89f6b19ca49079c95c"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.5,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":1,"reason":"Found 2/15 approved changesets -- score normalized to 1","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/ci.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/ci.yml:19: update your workflow using https://app.stepsecurity.io/secureworkflow/ketgo/marshmallow-pyspark/ci.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/ci.yml:21: update your workflow using https://app.stepsecurity.io/secureworkflow/ketgo/marshmallow-pyspark/ci.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/ci.yml:26: update your workflow using https://app.stepsecurity.io/secureworkflow/ketgo/marshmallow-pyspark/ci.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/ci.yml:37: update your workflow using https://app.stepsecurity.io/secureworkflow/ketgo/marshmallow-pyspark/ci.yml/master?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/ci.yml:31","Info:   0 out of   3 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 third-party GitHubAction dependencies pinned","Info:   0 out of   1 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 20 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-20T12:30:17.963Z","repository_id":52217332,"created_at":"2025-08-20T12:30:17.963Z","updated_at":"2025-08-20T12:30:17.963Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29006788,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-02T04:25:24.522Z","status":"ssl_error","status_checked_at":"2026-02-02T04:24:51.069Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","data-engineering","data-engineering-pipeline","data-pipelines","data-schemas","marshmallow","pyspark","schema","spark"],"created_at":"2026-02-02T06:10:58.208Z","updated_at":"2026-02-02T06:10:58.675Z","avatar_url":"https://github.com/ketgo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# marshmallow-pyspark\n\n[![Build Status](https://travis-ci.com/ketgo/marshmallow-pyspark.svg?token=oCVxhfjJAa2zDdszGjoy\u0026branch=master)](https://travis-ci.com/ketgo/marshmallow-pyspark)\n[![codecov.io](https://codecov.io/gh/ketgo/marshmallow-pyspark/coverage.svg?branch=master)](https://codecov.io/gh/ketgo/marshmallow-pyspark/coverage.svg?branch=master)\n[![Apache 2.0 licensed](https://img.shields.io/badge/License-Apache%202.0-yellow.svg)](https://raw.githubusercontent.com/ketgo/marshmallow-pyspark/master/LICENSE)\n\n[Marshmallow](https://marshmallow.readthedocs.io/en/stable/) is a popular package used for data serialization and validation. \nOne defines data schemas in marshmallow containing rules on how input data should be marshalled. Similar to marshmallow, \n[pyspark](https://spark.apache.org/docs/latest/api/python/index.html) also comes with its own schema definitions used to \nprocess data frames. This package enables users to utilize marshmallow schemas and its powerful data validation capabilities \nin pyspark applications. Such capabilities can be utilized in data-pipeline ETL jobs where data consistency and quality \nis of importance.\n\n## Install\n\nThe package can be install using `pip`:\n```bash\n$ pip install marshmallow-pyspark\n```\n\n## Usage\n\nData schemas can can define the same way as you would using marshmallow. A quick example is shown below:\n```python\nfrom marshmallow_pyspark import Schema\nfrom marshmallow import fields\n\n# Create data schema.\nclass AlbumSchema(Schema):\n    title = fields.Str()\n    release_date = fields.Date()\n\n# Input data frame to validate.\ndf = spark.createDataFrame([\n    {\"title\": \"valid_1\", \"release_date\": \"2020-1-10\"},\n    {\"title\": \"valid_2\", \"release_date\": \"2020-1-11\"},\n    {\"title\": \"invalid_1\", \"release_date\": \"2020-31-11\"},\n    {\"title\": \"invalid_2\", \"release_date\": \"2020-1-51\"},\n])\n\n# Get data frames with valid rows and error prone rows \n# from input data frame by validating using the schema.\nvalid_df, errors_df = AlbumSchema().validate_df(df)\n\n# Output of valid data frame\nvalid_df.show()\n#    +-------+------------+\n#    |  title|release_date|\n#    +-------+------------+\n#    |valid_1|  2020-01-10|\n#    |valid_2|  2020-01-11|\n#    +-------+------------+\n\n# Output of errors data frame\nerrors_df.show()\n#    +--------------------+\n#    |             _errors|\n#    +--------------------+\n#    |{\"row\": {\"release...|\n#    |{\"row\": {\"release...|\n#    +--------------------+\n```\n\n### More Options\n\nOn top of marshmallow supported options, the `Schema` class comes with two additional initialization arguments:\n\n- `error_column_name`: name of the column to store validation errors. Default value is `_errors`.\n\n- `split_errors`: split rows with validation errors as a separate data frame from valid rows. When set to `False` the \n   rows with errors are returned together with valid rows as a single data frame. The field values of all error rows are \n   set to `null`. For user convenience the original field values can be found in the `row` attribute of the error JSON. \n   Default value is `True`. \n\nAn example is shown below:\n```python\nfrom marshmallow import EXCLUDE\n\nschema = AlbumSchema(\n    error_column_name=\"custom_errors\",     # Use 'custom_errors' as name for errors column\n    split_errors=False,                     # Don't split the input data frame into valid and errors\n    unkown=EXCLUDE                          # Marshmallow option to exclude fields not present in schema\n)\n\n# Input data frame to validate.\ndf = spark.createDataFrame([\n    {\"title\": \"valid_1\", \"release_date\": \"2020-1-10\", \"garbage\": \"wdacfa\"},\n    {\"title\": \"valid_2\", \"release_date\": \"2020-1-11\", \"garbage\": \"5wacfa\"},\n    {\"title\": \"invalid_1\", \"release_date\": \"2020-31-11\", \"garbage\": \"3aqf\"},\n    {\"title\": \"invalid_2\", \"release_date\": \"2020-1-51\", \"garbage\": \"vda\"},\n])\n\nvalid_df, errors_df = schema.validate_df(df)\n\n# Output of valid data frame. Contains rows with errors as\n# the option 'split_errors' was set to False.\nvalid_df.show()\n#    +-------+------------+--------------------+\n#    |  title|release_date|             _errors|\n#    +-------+------------+--------------------+\n#    |valid_1|  2020-01-10|                    |\n#    |valid_2|  2020-01-11|                    |\n#    |       |            |{\"row\": {\"release...|\n#    |       |            |{\"row\": {\"release...|\n#    +-------+------------+--------------------+\n\n# The errors data frame will be set to None\nassert errors_df is None        # True\n```\n\nLastly, on top of passing marshmallow specific options in the schema, you can also pass them in the `validate_df` method.\nThese are options are passed to the marshmallow's `load` method:\n```python\nschema = AlbumSchema(\n    error_column_name=\"custom_errors\",     # Use 'custom_errors' as name for errors column\n    split_errors=False,                     # Don't split the input data frame into valid and errors\n)\n\nvalid_df, errors_df = schema.validate_df(df, unkown=EXCLUDE)\n```\n\n### Duplicates\n\nMarshmallow-pyspark comes with the ability to validate one or more schema fields for duplicate values. This is achieved\nby adding the field names to the `UNIQUE` attribute of the schema as shown:\n```python\nclass AlbumSchema(Schema):\n    # Unique valued field \"title\" in the schema\n    UNIQUE = [\"title\"]\n\n    title = fields.Str()\n    release_date = fields.Date()\n\n# Input data frame to validate.\ndf = spark.createDataFrame([\n        {\"title\": \"title_1\", \"release_date\": \"2020-1-10\"},\n        {\"title\": \"title_2\", \"release_date\": \"2020-1-11\"},\n        {\"title\": \"title_2\", \"release_date\": \"2020-3-11\"},  # duplicate title\n        {\"title\": \"title_3\", \"release_date\": \"2020-1-51\"},\n    ])\n\n# Validate data frame\nvalid_df, errors_df = AlbumSchema().validate_df(df)\n    \n# List of valid rows\nvalid_rows = [row.asDict(recursive=True) for row in valid_df.collect()]\n#\n#   [\n#        {'title': 'title_1', 'release_date': datetime.date(2020, 1, 10)},\n#        {'title': 'title_2', 'release_date': datetime.date(2020, 1, 11)}\n#   ]\n#\n\n# Rows with errors\nerror_rows = [row.asDict(recursive=True) for row in errors_df.collect()]\n# \n#   [\n#        {'_errors': '{\"row\": {\"release_date\": \"2020-3-11\", \"title\": \"title_2\", \"__count__title\": 2}, '\n#                    '\"errors\": [\"duplicate row\"]}'},\n#        {'_errors': '{\"row\": {\"release_date\": \"2020-1-51\", \"title\": \"title_3\", \"__count__title\": 1}, '\n#                    '\"errors\": {\"release_date\": [\"Not a valid date.\"]}}'}\n#    ]\n#\n``` \nThe technique to drop duplicates but keep first is discussed in this [link](https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first).\nIn case there are multiple unique fields in the schema just add them to the `UNIQUE`, e.g. `UNIQUE=[\"title\", \"release_date\"]`. \nYou can even specify uniqueness for combination of fields by grouping them in a list:\n```python\nclass AlbumSchema(Schema):\n    # Combined values of \"title\" and \"release_date\" should be unique\n    UNIQUE = [[\"title\", \"release_date\"]]\n\n    title = fields.Str()\n    release_date = fields.Date()\n\n# Input data frame to validate.\ndf = spark.createDataFrame([\n        {\"title\": \"title_1\", \"release_date\": \"2020-1-10\"},\n        {\"title\": \"title_2\", \"release_date\": \"2020-1-11\"},\n        {\"title\": \"title_2\", \"release_date\": \"2020-3-11\"},\n        {\"title\": \"title_3\", \"release_date\": \"2020-1-21\"},\n        {\"title\": \"title_3\", \"release_date\": \"2020-1-21\"},\n        {\"title\": \"title_4\", \"release_date\": \"2020-1-51\"},\n    ])\n\n# Validate data frame\nvalid_df, errors_df = AlbumSchema().validate_df(df)\n    \n# List of valid rows\nvalid_rows = [row.asDict(recursive=True) for row in valid_df.collect()]\n#\n#   [\n#        {'title': 'title_1', 'release_date': datetime.date(2020, 1, 10)},\n#        {'title': 'title_2', 'release_date': datetime.date(2020, 1, 11)},\n#        {'title': 'title_3', 'release_date': datetime.date(2020, 1, 21)}\n#   ]\n#\n\n# Rows with errors\nerror_rows = [row.asDict(recursive=True) for row in errors_df.collect()]\n# \n#   [\n#        {'_errors': '{\"row\": {\"release_date\": \"2020-1-21\", \"title\": \"title_3\", '\n#                    '\"__count__title\": 2, \"__count__release_date\": 2}, '\n#                    '\"errors\": [\"duplicate row\"]}'},\n#        {'_errors': '{\"row\": {\"release_date\": \"2020-1-51\", \"title\": \"title_4\", '\n#                    '\"__count__title\": 1, \"__count__release_date\": 1}, '\n#                    '\"errors\": {\"release_date\": [\"Not a valid date.\"]}}'},\n#        {'_errors': '{\"row\": {\"release_date\": \"2020-3-11\", \"title\": \"title_2\", '\n#                    '\"__count__title\": 2, \"__count__release_date\": 1}, '\n#                    '\"errors\": [\"duplicate row\"]}'}\n#    ]\n#\n```\n**WARNING**: Duplicate check requires data shuffle per unique field. Having large number of unique fields will effect \nspark job performance. By default `UNIQUE` is set to an empty list preventing any duplicate checks. \n\n### Fields\n\nMarshmallow comes with a variety of different fields that can be used to define schemas. Internally marshmallow-pyspark \nconvert these fields into pyspark SQL data types. The following table lists the supported marshmallow fields and their \nequivalent spark SQL data types:\n\n\n| Marshmallow | PySpark |\n| --- | --- |\n| `Raw` | user specified |\n| `String` | `StringType` |\n| `DateTime` | `TimestampType` |\n| `Date` | `DateType` |\n| `Boolean` | `BooleanType` |\n| `Integer` | `IntegerType` |\n| `Float` | `FloatType` |\n| `Number` | `DoubleType` |\n| `List` | `ArrayType` |\n| `Dict` | `MapType` |\n| `Nested` | `StructType` |\n\nBy default the `StringType` data type is used for marshmallow fields not in the above table. The `spark_schema` property\nof your defined schema can be used to check the converted spark SQL schema:\n```python\n# Gets the spark schema for the Album schema\nAlbumSchema().spark_schema\n# StructType(List(StructField(title,StringType,true),StructField(release_date,DateType,true),StructField(_errors,StringType,true)))\n```\n\n#### Custom Fields\n\nMarshmallow_pyspark comes with support for an additional `Raw` field. The `Raw` field does not perform any formatting\nand requires the user to specify the spark data type associated with the field. See the following example:\n```python\nfrom marshmallow_pyspark import Schema\nfrom marshmallow_pyspark.fields import Raw\nfrom marshmallow import fields\nfrom pyspark.sql.types import DateType\nfrom datetime import date\n\n\nclass AlbumSchema(Schema):\n    title = fields.Str()\n    # Takes python datetime.date objects and treats them as pyspark DateType\n    release_date = Raw(spark_type=DateType())\n\n# Input data frame to validate.\ndf = spark.createDataFrame([\n        {\"title\": \"title_1\", \"release_date\": date(2020, 1, 10)},\n        {\"title\": \"title_2\", \"release_date\": date(2020, 1, 11)},\n        {\"title\": \"title_3\", \"release_date\": date(2020, 3, 10)},\n    ])\n\n# Validate data frame\nvalid_df, errors_df = AlbumSchema().validate_df(df)\n    \n# List of valid rows\nvalid_rows = [row.asDict(recursive=True) for row in valid_df.collect()]\n#\n#   [\n#        {'title': 'title_1', 'release_date': datetime.date(2020, 1, 10)},\n#        {'title': 'title_2', 'release_date': datetime.date(2020, 1, 11)},\n#        {'title': 'title_3', 'release_date': datetime.date(2020, 3, 10)}\n#   ]\n#\n\n# Rows with errors\nerror_rows = [row.asDict(recursive=True) for row in errors_df.collect()]\n# \n#   []\n#\n```\n\nIt is also possible to add support for custom marshmallow fields, or those missing in the above table. In order to do so, \nyou would need to create a converter for the custom field. The converter can be built using the `ConverterABC` interface:\n```python\nfrom marshmallow_pyspark import ConverterABC\nfrom pyspark.sql.types import StringType\n\n\nclass EmailConverter(ConverterABC):\n    \"\"\"\n        Converter to convert marshmallow's Email field to a pyspark \n        SQL data type.\n    \"\"\"\n\n    def convert(self, ma_field):\n        return StringType()\n```  \nThe `ma_field` argument in the `convert` method is provided to handle nested fields. For an example you can checkout \n`NestedConverter`. Now the final step would be to add the converter to the `CONVERTER_MAP` attribute of your schema:\n```python\nfrom marshmallow_pyspark import Schema\nfrom marshmallow import fields\n\n\nclass User(Schema):\n    name = fields.String(required=True)\n    email = fields.Email(required=True)\n\n# Adding email converter to schema.\nUser.CONVERTER_MAP[fields.Email] = EmailConverter\n\n# You can now use your schema to validate the input data frame.\nvalid_df, errors_df = User().validate_df(input_df)\n```\n\n## Development\n\nTo hack marshmallow-pyspark locally run:\n\n```bash\n$ pip install -e .[dev]\t\t\t# to install all dependencies\n$ pytest --cov-config .coveragerc --cov=./\t\t\t# to get coverage report\n$ pylint marshmallow_pyspark\t\t\t# to check code quality with PyLint\n```\n\nOptionally you can use `make` to perform development tasks.\n\n## License\n\nThe source code is licensed under Apache License Version 2.\n\n## Contributions\n\nPull requests always welcomed! :)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fketgo%2Fmarshmallow-pyspark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fketgo%2Fmarshmallow-pyspark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fketgo%2Fmarshmallow-pyspark/lists"}