{"id":35023748,"url":"https://github.com/avrtt/paysage","last_synced_at":"2026-05-14T16:11:20.757Z","repository":{"id":285373460,"uuid":"957481645","full_name":"avrtt/paysage","owner":"avrtt","description":"Pandas add-on library: find data quality issues and clean/improve dataframes in one line using scikit-learn transformer","archived":false,"fork":false,"pushed_at":"2025-03-31T12:44:13.000Z","size":54,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-28T19:59:53.633Z","etag":null,"topics":["data-analysis","data-cleaning","data-compression","data-profiling","data-quality","data-quality-checks","data-reporting","pandas","pandas-dataframe","schema-validation","scikit-learn","scikit-learn-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/avrtt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-30T13:39:34.000Z","updated_at":"2025-03-31T12:44:16.000Z","dependencies_parsed_at":"2025-03-31T11:41:01.582Z","dependency_job_id":null,"html_url":"https://github.com/avrtt/paysage","commit_stats":null,"previous_names":["avrtt/paysage"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/avrtt/paysage","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avrtt%2Fpaysage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avrtt%2Fpaysage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avrtt%2Fpaysage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avrtt%2Fpaysage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/avrtt","download_url":"https://codeload.github.com/avrtt/paysage/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avrtt%2Fpaysage/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33032638,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T13:14:54.681Z","status":"online","status_checked_at":"2026-05-14T02:00:06.663Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-cleaning","data-compression","data-profiling","data-quality","data-quality-checks","data-reporting","pandas","pandas-dataframe","schema-validation","scikit-learn","scikit-learn-transformer"],"created_at":"2025-12-27T06:08:07.682Z","updated_at":"2026-05-14T16:11:20.751Z","avatar_url":"https://github.com/avrtt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"ver. 1.02 (beta) - \u003cu\u003eDocumentation\u003c/u\u003e **(WIP)**: **[🇺🇸 EN]()** | **[🇷🇺 RU]()**\n\n\u003e This library is planned to be merged with [Gnomych](https://github.com/avrtt/gnomych), my other data analysis toolkit. Gnomych classes will become a part of the main branch since 1.2 version.\n\n\u003cbr/\u003e\n\n**Paysage** is a minimalistic toolkit of useful Pandas extras for ensuring data quality in your DataFrames. It simplifies cleaning, profiling and improving your datasets so that you can focus on delivering more reliable data insights. \n\n- [Introduction](#introduction)\n- [Key components](#key-components)\n- [Use cases](#use-cases)\n  - [🔎 Data quality report](#1-data-quality-report-dq_report)\n  - [⚖️ Data comparison report](#2-data-comparison-report-dc_report)\n  - [🔧 Data cleaning](#3-data-cleaning-with-fix_dq)\n  - [📋 Schema validation](#4-schema-validation-with-dataschemachecker)\n- [Installation](#installation)\n- [Usage](#usage)\n- [API](#api-overview)\n- [To do](#to-do)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Introduction\n**Paysage** is a Python library that helps with data quality analysis and enhancement. Designed for speed and scalability, it integrates with `pandas` and `scikit-learn` to:\n- quickly clean and enhance your DFs\n- provide data quality reports\n- assist in data preprocessing before modeling\n- integrate with existing `sklearn` pipelines\n\nWith paysage, you can easily assess issues such as missing values, outliers and data inconsistencies while also benefiting from data profiling and transformation tools.\n\n## Key components\n```mermaid\ngraph TD\n    A[Data] --\u003e B[dq_report]\n    B --\u003e|uses| C[classify_columns]\n    C --\u003e|uses| D[left_subtract]\n    C --\u003e|uses| E[EDA_find_remove_columns_with_infinity]\n    B --\u003e F[Generate DQ Report]\n    F --\u003e|output| G[HTML Report via write_to_html]\n    B --\u003e H[Identify Issues]\n    H --\u003e|input| I[Fix_DQ Transformer]\n    I --\u003e|transforms| J[Clean Data]\n    K[DataSchemaChecker] --\u003e|validates| L[Data Schema]\n    K --\u003e|transforms| M[Adjusted Data]\n    N[dc_report] --\u003e|uses| O[dq_report Train]\n    N --\u003e|uses| P[dq_report Test]\n    O --\u003e Q[Compare Distributions]\n    P --\u003e Q\n    Q --\u003e R[DC Report]\n    R --\u003e|output| G\n\n    classDef report fill:#141d47,stroke:#007bff;\n    classDef transformer fill:#6e0711,stroke:#28a745;\n    classDef function fill:#690700,stroke:#ffc107;\n    classDef helper fill:#085252,stroke:#dc3545;\n    class B,N function;\n    class C,D,E helper;\n    class I,K transformer;\n    class G,R report;\n```\n\n`paysage` is organized into several core modules:\n\n- 🔎 **dq_report**  \n  Generates a detailed data quality report (inline or HTML) that inspects your dataset for issues such as missing values, outliers, duplicates, correlation anomalies and potential data leakage.\n- ⚖️ **dc_report**  \n  Compares two DataFrames (commonly train vs. test) and highlights their differences. It includes statistical tests like the KS Test to compare distributions and examines discrepancies in missing/unique values.\n- 🔧 **Fix_DQ**  \n  A `scikit-learn` compatible transformer that automatically detects and corrects data quality issues. It handles a variety of problems — from ID columns and zero-variance features to outliers and imbalanced classes — in just one line of code.\n- 📋 **DataSchemaChecker**  \n  Validates whether your DataFrame adheres to a specified schema. It checks data types, reports mismatches and attempts to coerce columns to the desired types.\n\n## Use cases\n\n### 1. Data quality report (`dq_report`)\n\nThe `dq_report` function generates a comprehensive report (either inline or as an HTML file) by scanning your dataset for:\n- ID columns, zero-variance features and rare categories (less than 5% representation)\n- Infinite values and mixed data types\n- Outliers (using the Inter Quartile Range)\n- High cardinality and highly correlated features (correlation \u003e 0.8)\n- Duplicate rows/columns and skewed distributions (skew \u003e 1.0)\n- Imbalanced classes and potential feature leakage\n\nFor very large datasets, a sample of 100K rows is used by default. If you prefer full-data analysis, load your data into a DataFrame before passing it to the function.\n\n### 2. Data comparison report (`dc_report`)\n\nThe `dc_report` tool accepts two DataFrames (e.g., train and test) and generates a comparison report by:\n- Running `dq_report` on each DataFrame to compare their data quality metrics\n- Calculating the Kolmogorov-Smirnov test statistic for numeric features\n- Comparing missing and unique value percentages, with special notes when discrepancies arise\n- Allowing target columns to be excluded from the comparison\n\n\u003e Tip: for large datasets, consider sampling your data before using this report tool.\n\n### 3. Data cleaning with `Fix_DQ`\n\nThe `Fix_DQ` class is a one-stop solution to clean your data before modeling. It's implemented as a `scikit-learn` transformer and can be integrated into your ML pipelines. During the `fit` process, it:\n- Removes ID and zero-variance columns\n- Groups rare categories into a \"Rare\" label\n- Handles infinite values, mixed data types and outliers\n- Identifies high cardinality and highly correlated features (dropping one of each pair)\n- Removes duplicate rows/columns and applies transformations to skewed distributions\n\n\u003e **Integrate with GridSearchCV:** Use `Fix_DQ` in your hyperparameter tuning pipelines (e.g., with GridSearchCV) to select the best data cleaning strategy alongside your model.\n\n### 4. Schema validation with `DataSchemaChecker`\n\nThe `DataSchemaChecker` transformer ensures that your data conforms to a predefined schema. You simply:\n1. Define a schema (a dictionary mapping column names to expected data types)\n2. Use the `fit` method to identify discrepancies between your data and the schema\n3. Call `transform` to automatically adjust column data types, skipping those that cannot be coerced and reporting errors\n\nExample schema:\n\n```python\nschema = {\n    'name': 'string',\n    'age': 'float32',\n    'gender': 'object',\n    'income': 'float64',\n    'date': 'date',\n    'target': 'integer'\n}\n```\n\n## Installation\n \n`paysage` requires only `pandas`, `numpy` and `scikit-learn` — all of which are commonly included in Python3 Anaconda distributions.\n\nClone and navigate:\n```\ngit clone git@github.com:avrtt/paysage.git \u0026\u0026 cd paysage\n```\n\nCreate and activate a virtual environment (**recommended**):\n```\npython -m venv venv\nsource venv/bin/activate\n```\n\nIf you're using newer Python versions (3.12+), install `setuptools` ([here's why](https://docs.python.org/3.12/whatsnew/3.12.html)):\n```\npip install setuptools\n```\n\nFinally, install the library:\n```\npip install -r requirements.txt\n```\n\n## Usage\n\n### Quick data profiling\n\nTo generate a data quality report:\n\n```python\nfrom paysage import dq_report\n\ndqr = dq_report(data, target='your_target_column', html=False, csv_engine=\"pandas\", verbose=1)\n```\n\nThis will display the report inline (or generate an HTML file, if configured).\n\n### Comparing two datasets\n\nTo compare train and test DataFrames:\n\n```python\nfrom paysage import dc_report\n\ndc_report(train, test, exclude=[], html=True, verbose=1)\n```\n\nThe function returns a DataFrame outlining differences between your datasets.\n\n### Data cleaning with `Fix_DQ`\n\nClean your training and test datasets with a single transformer:\n\n```python\nfrom paysage import Fix_DQ\n\n# initialize transformer with default parameters\nfdq = Fix_DQ()\n\n# clean the training data\nX_train_transformed = fdq.fit_transform(X_train)\n\n# apply the same transformation to test data\nX_test_transformed = fdq.transform(X_test)\n```\n\n### Schema validation\n\nValidate and adjust your DataFrame to match a specific schema:\n\n```python\nfrom paysage import DataSchemaChecker\n\n# define your schema\nschema = {\n    'name': 'string',\n    'age': 'float32',\n    'gender': 'object',\n    'income': 'float64',\n    'date': 'date',\n    'target': 'integer'\n}\n\n# validate training data and then transform test data\nds = DataSchemaChecker(schema=schema)\nds.fit_transform(X_train)\nX_test_transformed = ds.transform(X_test)\n```\n\n## API overview\n\n`paysage` is built with a simple API designed to uncover and fix data quality issues quickly.\n\n### 🔎 dq_report\n\n**Inputs:**\n- `data`: File path (string) or a pandas DataFrame.\n- `target`: (Optional) Column name as a string to focus on target-related issues.\n- `html`: (Boolean) Set to `True` for HTML output.\n- `csv_engine`: (String) Choose between `pandas`, `arrow`, or `parquet` for CSV reading.\n- `verbose`: (Integer) Use `0` for a summary report and `1` for a detailed report.\n\n**Output:**  \nA DataFrame highlighting data quality issues.\n\n### ⚖️ dc_report\n\n**Inputs:**\n- `train`: Training DataFrame.\n- `test`: Test DataFrame.\n- `exclude`: List of columns to exclude from the comparison.\n- `html`: (Boolean) Toggle HTML output.\n- `verbose`: (Integer) Toggle between summary and detailed reports.\n\n**Output:**  \nA DataFrame that outlines the differences between the two datasets.\n\n### 🔧 Fix_DQ\n\nA `scikit-learn` transformer that cleans your data by addressing issues such as:\n- ID and zero-variance columns removal.\n- Rare category grouping.\n- Infinite value replacement.\n- Mixed data type handling.\n- Outlier detection.\n- Duplicate row/column removal.\n- Skewed distribution transformations.\n\n**Additional Parameters:**\n- `quantile`: Threshold for IQR-based outlier detection.\n- `cat_fill_value`: Default fill value (or a dictionary) for missing categorical data.\n- `num_fill_value`: Default fill value (or a dictionary) for missing numerical data.\n- `rare_threshold`: Threshold to identify rare categories.\n- `correlation_threshold`: Correlation limit (default is 0.8) for dropping one of two highly correlated features.\n\n### 📋 DataSchemaChecker\n\n**Inputs:**\n- `schema`: A dictionary mapping column names to expected data types.\n\n**Methods:**\n- `fit`: Checks and reports discrepancies between the DataFrame and the schema.\n- `transform`: Attempts to coerce columns to match the schema, reporting any conversion errors.\n\n## To do\n- (!) Documentation\n- Split code by modules, create utils.py for helper functions\n- Implement more classes for better logic\n- Implement more advanced methods\n- Add `jsonschema\u003e=3.0.0`\n- Implement unit tests\n\n## Contributing\nContributions are welcome. Feel free to open PRs and issues.\n\n## License\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favrtt%2Fpaysage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Favrtt%2Fpaysage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favrtt%2Fpaysage/lists"}