{"id":20956406,"url":"https://github.com/bluebrain/data-validation-framework","last_synced_at":"2025-05-14T05:31:44.699Z","repository":{"id":38444664,"uuid":"436995864","full_name":"BlueBrain/data-validation-framework","owner":"BlueBrain","description":"Simple framework to create data validation workflows.","archived":false,"fork":false,"pushed_at":"2024-11-18T11:41:48.000Z","size":513,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-18T12:52:31.768Z","etag":null,"topics":["data","data-analysis","python","validation","validation-tool"],"latest_commit_sha":null,"homepage":"https://data-validation-framework.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BlueBrain.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-10T13:52:43.000Z","updated_at":"2024-11-18T11:41:44.000Z","dependencies_parsed_at":"2023-01-30T01:15:32.042Z","dependency_job_id":"93d799e2-9d6c-4d98-b7c2-c41a94b5ce88","html_url":"https://github.com/BlueBrain/data-validation-framework","commit_stats":null,"previous_names":[],"tags_count":36,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlueBrain%2Fdata-validation-framework","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlueBrain%2Fdata-validation-framework/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlueBrain%2Fdata-validation-framework/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlueBrain%2Fdata-validation-framework/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BlueBrain","download_url":"https://codeload.github.com/BlueBrain/data-validation-framework/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225277022,"owners_count":17448607,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-analysis","python","validation","validation-tool"],"created_at":"2024-11-19T01:25:51.412Z","updated_at":"2024-11-19T01:25:52.004Z","avatar_url":"https://github.com/BlueBrain.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Version](https://img.shields.io/pypi/v/data-validation-framework)](https://github.com/BlueBrain/data-validation-framework/releases)\n[![Build status](https://github.com/BlueBrain/data-validation-framework/actions/workflows/run-tox.yml/badge.svg?branch=main)](https://github.com/BlueBrain/data-validation-framework/actions)\n[![Coverage](https://codecov.io/github/BlueBrain/data-validation-framework/coverage.svg?branch=main)](https://codecov.io/github/BlueBrain/data-validation-framework?branch=main)\n[![License](https://img.shields.io/badge/License-Apache%202-blue)](https://github.com/BlueBrain/data-validation-framework/blob/main/LICENSE.txt)\n[![Documentation status](https://readthedocs.org/projects/data-validation-framework/badge/?version=latest)](https://data-validation-framework.readthedocs.io/)\n\n\n# Data Validation Framework\n\nThis project provides simple tools to create data validation workflows.\nThe workflows are based on the [luigi](https://luigi.readthedocs.io/en/stable) library.\n\nThe main objective of this framework is to gather in a same place both the specifications that the\ndata must follow and the code that actually tests the data. This avoids having multiple documents\nto store the specifications and a repository to store the code.\n\n\n## Installation\n\nThis package should be installed using pip:\n\n```bash\npip install data-validation-framework\n```\n\n## Usage\n\n### Building a workflow\n\nBuilding a new workflow is simple, as you can see in the following example:\n\n```python\nimport luigi\nimport data_validation_framework as dvf\n\n\nclass ValidationTask1(dvf.task.ElementValidationTask):\n    \"\"\"Use the class docstring to describe the specifications of the ValidationTask1.\"\"\"\n\n    output_columns = {\"col_name\": None}\n\n    @staticmethod\n    def validation_function(row, output_path, *args, **kwargs):\n        # Return the validation result for one row of the dataset\n        if row[\"col_name\"] \u003c= 10:\n            return dvf.result.ValidationResult(is_valid=True)\n        else:\n            return dvf.result.ValidationResult(\n                is_valid=False,\n                ret_code=1,\n                comment=\"The value should always be \u003c= 10\"\n            )\n\n\ndef external_validation_function(df, output_path, *args, **kwargs):\n    # Update the dataset inplace here by setting values to the 'is_valid' column.\n    # The 'ret_code' and 'comment' values are optional, they will be added to the report\n    # in order to help the user to understand why the dataset did not pass the validation.\n\n    # We can use the value from kwargs[\"param_value\"] here.\n    if int(kwargs[\"param_value\"]) \u003c= 10:\n        df[\"is_valid\"] = True\n    else:\n        df[\"is_valid\"] = False\n        df[\"ret_code\"] = 1\n        df[\"comment\"] = \"The value should always be \u003c= 10\"\n\n\nclass ValidationTask2(dvf.task.SetValidationTask):\n    \"\"\"In some cases you might want to keep the docstring to describe what a developer\n    needs to know, not the end-user. In this case, you can use the ``__specifications__``\n    attribute to store the specifications.\"\"\"\n\n    a_parameter = luigi.Parameter()\n\n    __specifications__ = \"\"\"Use the __specifications__ to describe the specifications of the\n    ValidationTask2.\"\"\"\n\n    def inputs(self):\n        return {ValidationTask1: {\"col_name\": \"new_col_name_in_current_task\"}}\n\n    def kwargs(self):\n        return {\"param_value\": self.a_parameter}\n\n    validation_function = staticmethod(external_validation_function)\n\n\nclass ValidationWorkflow(dvf.task.ValidationWorkflow):\n    \"\"\"Use the global workflow specifications to give general context to the end-user.\"\"\"\n\n    def inputs(self):\n        return {\n            ValidationTask1: {},\n            ValidationTask2: {},\n        }\n```\n\nWhere the `ValidationWorkflow` class only defines the sub-tasks that should be called for the\nvalidation. The sub-tasks can be either a `dvf.task.ElementValidationTask` or a\n`dvf.task.SetValidationTask`. In both cases, you can define relations between these sub-tasks\nsince one could need the result of another one to run properly. This is defined in two steps:\n\n1. in the required task, a `output_columns` attribute should be defined so that the next tasks\n   can know what data is available, as shown in the previous example for the `ValidationTask1`.\n2. in the task that requires another task, a `inputs` method should be defined, as shown in the\n   previous example for the `ValidationTask2`.\n\nThe sub-classes of `dvf.task.ElementValidationTask` should return a\n`dvf.result.ValidationResult` object. The sub-classes of `dvf.task.SetValidationTask` should\nreturn a `Pandas.DataFrame` object with at least the following columns\n`[\"is_valid\", \"ret_code\", \"comment\", \"exception\"]` and with the same index as the input dataset.\n\n## Generate the specifications of a workflow\n\nThe specifications that the data should follow can be generated with the following luigi command:\n\n```bash\nluigi --module test_validation ValidationWorkflow --log-level INFO --local-scheduler --result-path out --ValidationTask2-a-parameter 15 --specifications-only\n```\n\n## Running a workflow\n\nThe workflow can be run with the following luigi command (note that the module `test_validation`\nmust be available in your `sys.path`):\n\n\n```bash\nluigi --module test_validation ValidationWorkflow --log-level INFO --local-scheduler --dataset-df dataset.csv --result-path out --ValidationTask2-a-parameter 15\n```\n\nThis workflow will generate the following files:\n\n* `out/report_ValidationWorkflow.pdf`: the PDF validation report.\n* `out/ValidationTask1/report.csv`: The CSV containing the validity values of the task\n  `ValidationTask1`.\n* `out/ValidationTask2/report.csv`: The CSV containing the validity values of the task\n  `ValidationTask2`.\n* `out/ValidationWorkflow/report.csv`: The CSV containing the validity values of the complete\n  workflow.\n\n.. note::\n\n    As any `luigi \u003chttps://luigi.readthedocs.io/en/stable\u003e`_ workflow, the values can be stored\n    into a `luigi.cfg` file instead of being passed to the CLI.\n\n## Advanced features\n\n### Require a regular Luigi task\n\nIn some cases, one want to execute a regular Luigi task in a validation workflow. In this case, it\nis possible to use the `extra_requires()` method to pass these extra requirements. In the\nvalidation task it is then possible to get the targets of these extra requirements using the\n`extra_input()` method.\n\n```python\nclass TestTaskA(luigi.Task):\n\n    def run(self):\n        # Do something and write the 'target.file'\n\n    def output(self):\n        return target.OutputLocalTarget(\"target.file\")\n\nclass TestTaskB(task.SetValidationTask):\n\n    output_columns = {\"extra_target_path\": None}\n\n    def kwargs(self):\n        return {\"extra_task_target_path\": self.extra_input().path}\n\n    def extra_requires(self):\n        return TestTaskA()\n\n    @staticmethod\n    def validation_function(df, output_path, *args, **kwargs):\n        df[\"is_valid\"] = True\n        df[\"extra_target_path\"] = kwargs[\"extra_task_target_path\"]\n```\n\n## Funding \u0026 Acknowledgment\n\nThe development of this software was supported by funding to the Blue Brain Project, a research\ncenter of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH\nBoard of the Swiss Federal Institutes of Technology.\n\nFor license and authors, see `LICENSE.txt` and `AUTHORS.md` respectively.\n\nCopyright © 2022-2023 Blue Brain Project/EPFL\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbluebrain%2Fdata-validation-framework","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbluebrain%2Fdata-validation-framework","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbluebrain%2Fdata-validation-framework/lists"}