{"id":22477055,"url":"https://github.com/spratiher9/valido","last_synced_at":"2026-05-02T09:32:27.832Z","repository":{"id":106304945,"uuid":"438754476","full_name":"Spratiher9/Valido","owner":"Spratiher9","description":"PySpark ⚡ dataframe workflow ⚒ validator","archived":false,"fork":false,"pushed_at":"2021-12-19T14:51:18.000Z","size":16,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-01T20:46:13.934Z","etag":null,"topics":["apache","apache-spark","bigdata","databricks","decorators","pyspark","python3","spark","testing"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/valido/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Spratiher9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-15T19:51:02.000Z","updated_at":"2021-12-27T07:12:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"16bac3fe-f80d-4b59-9f6b-fe5275912ebb","html_url":"https://github.com/Spratiher9/Valido","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Spratiher9%2FValido","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Spratiher9%2FValido/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Spratiher9%2FValido/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Spratiher9%2FValido/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Spratiher9","download_url":"https://codeload.github.com/Spratiher9/Valido/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245897270,"owners_count":20690448,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","apache-spark","bigdata","databricks","decorators","pyspark","python3","spark","testing"],"created_at":"2024-12-06T14:09:24.500Z","updated_at":"2026-05-02T09:32:27.793Z","avatar_url":"https://github.com/Spratiher9.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca href=\"https://ibb.co/gZdDQ7S\"\u003e\u003cimg src=\"https://i.ibb.co/d4tQXcP/Valido.png\" alt=\"Valido\" border=\"0\"\u003e\u003c/a\u003e\n\n# Valido 🌩  \n\nPySpark ⚡ dataframe workflow ⚒ validator. \n\n![PyPI](https://img.shields.io/pypi/v/valido)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/valido)\n![test](https://github.com/Spratiher9/Valido/workflows/Valido/badge.svg)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n`Hit that ⭐ if you like the project and follow for more such powerful utilities`\n\n## Description\n\nIn projects using PySpark, it's very common to have functions that take Spark DataFrames as input or produce them as\noutput. It's hard to figure out quickly what these DataFrames contain. This library offers simple decorators to annotate\nyour functions so that they document themselves and that documentation is kept up-to-date by validating the input and\noutput on runtime.\n\nFor example,\n\n```python\n@df_in(columns=[\"Brand\", \"Price\"])  # the function expects a DataFrame as input parameter with columns Brand and Price\n@df_out(columns=[\"Brand\", \"Price\"])  # the function will return a DataFrame with columns Brand and Price\ndef filter_cars(car_df):\n    # before this code is executed, the input DataFrame is validated according to the above decorator\n    # filter some cars..\n    return filtered_cars_df\n```\n\n## Table of Contents\n\n* [Installation](#installation)\n* [Usage](#usage)\n* [Contributing](#contributing)\n* [Tests](#tests)\n* [License](#license)\n\n## Installation\n\nInstall with your favorite Python dependency manager (_pip_) like\n\n```sh\npip install valido\n```\n\n## Usage\n\nStart by importing the needed decorators:\n\n```python\nfrom valido import df_in, df_out\n```\n\nTo check a DataFrame input to a function, annotate the function with `@df_in`. For example the following function\nexpects to get a DataFrame with columns `Brand` and `Price`:\n\n```python\n@df_in(columns=[\"Brand\", \"Price\"])\ndef process_cars(car_df):\n# do stuff with cars\n```\n\nIf your function takes multiple arguments, specify the field to be checked with it's `name`:\n\n```python\n@df_in(name=\"car_df\", columns=[\"Brand\", \"Price\"])\ndef process_cars(year, style, car_df):\n# do stuff with cars\n```\n\n_Note:_\nSince this will evaluate it at runtime please use named arguments in the function call like this:\n```\nprocess_cars(year = 2021, style = \"Mazda\", car_df = mydf)\n```\n\nTo check that a function returns a DataFrame with specific columns, use `@df_out` decorator:\n\n```python\n@df_out(columns=[\"Brand\", \"Price\"])\ndef get_all_cars():\n    # get those cars\n    return all_cars_df\n```\n\nIn case one of the listed columns is missing from the DataFrame, a helpful assertion error is thrown:\n\n```shell\nAssertionError(\"Column Price missing from DataFrame. Got columns: ['Brand']\")\n```\n\nTo check both input and output, just use both annotations on the same function:\n\n```python\n@df_in(columns=[\"Brand\", \"Price\"])\n@df_out(columns=[\"Brand\", \"Price\"])\ndef filter_cars(car_df):\n    # filter some cars\n    return filtered_cars_df\n```\n\nIf you want to also check the data types of each column, you can replace the column array:\n\n```python\ncolumns = [\"Brand\", \"Price\"]\n```\n\nwith a dict:\n\n```python\ncolumns = {\"Brand\": \"string\", \"Price\": \"int\"}\n```\n\nThis will not only check that the specified columns are found from the DataFrame but also that their `dtype` is the\nexpected. In case of a wrong `dtype`, an error message similar to following will explain the mismatch:\n\n```shell\nAssertionError(\"Column Price has wrong dtype. Was int, expected double\")\n```\n\nYou can enable strict-mode for both `@df_in` and `@df_out`. This will raise an error if the DataFrame contains columns\nnot defined in the annotation:\n\n```python\n@df_in(columns=[\"Brand\"], strict=True)\ndef process_cars(car_df):\n# do stuff with cars\n```\n\nwill raise an error when `car_df` contains columns `[\"Brand\", \"Price\"]`:\n\n```shell\nAssertionError: DataFrame contained unexpected column(s): Price\n```\n\nTo quickly check what the incoming and outgoing dataframes contain, you can add a `@df_log` annotation to the function.\nFor example adding `@df_log` to the above `filter_cars` function will product log lines:\n\n```shell\nFunction filter_cars parameters contained a DataFrame: columns: ['Brand', 'Price']\nFunction filter_cars returned a DataFrame: columns: ['Brand', 'Price']\n```\n\nor with `@df_log(include_dtypes=True)` you get:\n\n```shell\nFunction filter_cars parameters contained a DataFrame: columns: ['Brand', 'Price'] with dtypes ['string', 'int']\nFunction filter_cars returned a DataFrame: columns: ['Brand', 'Price'] with dtypes ['string', 'int']\n```\n\n_Note_:\n`@df_log(include_dtypes=True)` also takes the `name` parameter like `df_in` for the multi-param functions validation\n\n## Contributing\n\nContributions are accepted. Include tests in PR's.\n\n## Development\n\nTo run the tests, clone the repository, install dependencies with _pip_ and run tests with _PyTest_:\n\n```shell\npython -m pytest --import-mode=append tests/\n```\n\n## License\n\nBSD 3-Clause License\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspratiher9%2Fvalido","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspratiher9%2Fvalido","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspratiher9%2Fvalido/lists"}