{"id":40216279,"url":"https://github.com/maximtrp/scikit-na","last_synced_at":"2026-01-19T21:37:05.643Z","repository":{"id":57464424,"uuid":"368288746","full_name":"maximtrp/scikit-na","owner":"maximtrp","description":"Missing Data Analysis in Python","archived":false,"fork":false,"pushed_at":"2025-09-14T19:52:32.000Z","size":1460,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-10T02:46:23.772Z","etag":null,"topics":["analysis","data-analysis","data-science","data-visualization","missing-data","missing-values","pandas","python","statistics","visualization"],"latest_commit_sha":null,"homepage":"https://scikit-na.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maximtrp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-05-17T18:42:57.000Z","updated_at":"2025-09-14T19:52:35.000Z","dependencies_parsed_at":"2024-04-18T17:11:07.745Z","dependency_job_id":"be4cdc0e-c6a1-4819-b4fa-250701045035","html_url":"https://github.com/maximtrp/scikit-na","commit_stats":{"total_commits":47,"total_committers":2,"mean_commits":23.5,"dds":"0.021276595744680882","last_synced_commit":"60dbb5e4edd92f5d5b3a06df848388d2749ca812"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/maximtrp/scikit-na","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximtrp%2Fscikit-na","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximtrp%2Fscikit-na/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximtrp%2Fscikit-na/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximtrp%2Fscikit-na/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maximtrp","download_url":"https://codeload.github.com/maximtrp/scikit-na/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maximtrp%2Fscikit-na/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28585594,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-19T20:45:59.482Z","status":"ssl_error","status_checked_at":"2026-01-19T20:45:41.500Z","response_time":67,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","data-analysis","data-science","data-visualization","missing-data","missing-values","pandas","python","statistics","visualization"],"created_at":"2026-01-19T21:37:04.026Z","updated_at":"2026-01-19T21:37:05.637Z","avatar_url":"https://github.com/maximtrp.png","language":"Python","funding_links":["https://www.buymeacoffee.com/maximtrp"],"categories":[],"sub_categories":[],"readme":"![scikit-na logo](https://raw.githubusercontent.com/maximtrp/scikit-na/main/img/logo.png)\n\n---\n[![Test](https://github.com/maximtrp/scikit-na/actions/workflows/python-test.yml/badge.svg)](https://github.com/maximtrp/scikit-na/actions/workflows/python-test.yml)\n[![Coverage](https://app.codacy.com/project/badge/Coverage/122fd9ccc0da40a4a6cfce8eac592fd2)](https://app.codacy.com/gh/maximtrp/scikit-na/dashboard)\n[![Documentation](https://readthedocs.org/projects/scikit-na/badge/?version=latest)](https://readthedocs.org/projects/scikit-na/builds/)\n[![Downloads](https://static.pepy.tech/badge/scikit-na)](https://pepy.tech/project/scikit-na)\n[![PyPI](https://img.shields.io/pypi/v/scikit-na)](https://pypi.org/project/scikit-na/)\n\n**scikit-na** is a comprehensive Python package for missing data (NA) analysis and exploration. It provides statistical functions, interactive visualizations, and export capabilities to help data scientists understand and handle missing values in their datasets.\n\n## Why scikit-na?\n\n- **Comprehensive Analysis**: Get detailed statistics on missing data patterns\n- **Interactive Reports**: Generate widget-based reports for Jupyter notebooks  \n- **Multiple Export Formats**: Share results as CSV, JSON, HTML, or Excel files\n- **Statistical Modeling**: Build logistic regression models to understand missingness\n- **Rich Visualizations**: Create heatmaps, correlation plots, and distribution charts\n- **Hypothesis Testing**: Test for missing completely at random (MCAR) patterns\n\n![Visualizations](https://raw.githubusercontent.com/maximtrp/scikit-na/main/img/titanic_vis.png)\n\n## Features\n\n- Interactive report (based on [ipywidgets](https://ipywidgets.readthedocs.io/))\n- Export functionality (CSV, JSON, HTML, XLSX formats)\n- Descriptive statistics  \n- Regression modeling\n- Hypotheses tests\n- Data visualization\n\n## Donate\n\nIf you find this package useful, please consider donating any amount of money.\nThis will help me spend more time on supporting open-source software.\n\n\u003ca href=\"https://www.buymeacoffee.com/maximtrp\" target=\"_blank\"\u003e\u003cimg src=\"https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png\" alt=\"Buy Me A Coffee\" style=\"height: 60px !important;width: 217px !important;\" \u003e\u003c/a\u003e\n\n## Installation\n\n### Basic installation\n\n```bash\npip install scikit-na\n```\n\n### With optional dependencies\n\n```bash\n# For export functionality (Excel support)\npip install scikit-na[export]\n\n# For development\npip install scikit-na[dev]\n\n# Install from source\npip install git+https://github.com/maximtrp/scikit-na.git\n```\n\n## Quick Start\n\n```python\nimport scikit_na as na\nimport pandas as pd\n\n# Load your data\ndata = pd.read_csv('your_dataset.csv')\n\n# Get missing data summary\nsummary = na.summary(data)\nprint(summary)\n\n# Create interactive report\nreport = na.report(data)\n\n# Export results\nna.export_summary(data, 'missing_data_analysis.csv', format='csv')\n```\n\n## Examples\n\nThe following examples use the Titanic dataset (from Kaggle) that contains NA values in three columns: Age, Cabin, and Embarked.\n\n### Core Functions\n\n| Function | Description |\n|----------|-------------|\n| `na.summary()` | Comprehensive missing data statistics |\n| `na.correlate()` | Correlations between missing values |\n| `na.describe()` | Descriptive stats grouped by missingness |\n| `na.model()` | Logistic regression for missing patterns |\n| `na.test_hypothesis()` | Statistical tests for MCAR |\n| `na.report()` | Interactive widget-based report |\n| `na.export_summary()` | Export analysis to files |\n| `na.export_report()` | Export interactive reports |\n\n### Summary\n\n#### Per each column\n\nBy default, `summary()` function returns the results for each column.\n\n```python\nimport scikit_na as na\nimport pandas as pd\n\ndata = pd.read_csv('titanic_dataset.csv')\n\n# Excluding three columns without NA to fit the table here\nna.summary(data, columns=data.columns.difference(['SibSp', 'Parch', 'Ticket']))\n```\n\n|                       |   Age | Cabin | Embarked | Fare | Name | PassengerId | Pclass | Sex | Survived |\n| :-------------------- | ----: | ----: | -------: | ---: | ---: | ----------: | -----: | --: | -------: |\n| na_count              |   177 |   687 |        2 |    0 |    0 |           0 |      0 |   0 |        0 |\n| na_pct_per_col        | 19.87 |  77.1 |     0.22 |    0 |    0 |           0 |      0 |   0 |        0 |\n| na_pct_total          | 20.44 | 79.33 |     0.23 |    0 |    0 |           0 |      0 |   0 |        0 |\n| na_unique_per_col     |    19 |   529 |        2 |    0 |    0 |           0 |      0 |   0 |        0 |\n| na_unique_pct_per_col | 10.73 |    77 |      100 |    0 |    0 |           0 |      0 |   0 |        0 |\n| rows_after_dropna     |   714 |   204 |      889 |  891 |  891 |         891 |    891 | 891 |      891 |\n| rows_after_dropna_pct | 80.13 |  22.9 |    99.78 |  100 |  100 |         100 |    100 | 100 |      100 |\n\n_NA unique_ is the number of NA values per each column that are unique for it,\ni.e. do not intersect with NA values in the other columns (or that will remain\nin dataset if we drop NA values in the other columns).\n\n#### Whole dataset\n\nWe can also get a summary of missing data for the whole dataset:\n\n```python\nna.summary(data, per_column=False)\n```\n\n|                  | dataset |\n| :--------------- | ------: |\n| total_columns    |      12 |\n| total_rows       |     891 |\n| na_rows          |     708 |\n| non_na_rows      |     183 |\n| total_cells      |   10692 |\n| na_cells         |     866 |\n| na_cells_pct     |     8.1 |\n| non_na_cells     |    9826 |\n| non_na_cells_pct |    91.9 |\n\n### Correlations\n\nTo calculate correlations between columns in terms of missing data, just call\n`correlate()` function with your DataFrame as the first argument:\n\n```python\nna.correlate(data, method=\"spearman\").round(3)\n```\n\n|          | Embarked |    Age |  Cabin |\n| :------- | -------: | -----: | -----: |\n| Embarked |        1 | -0.024 | -0.087 |\n| Age      |   -0.024 |      1 |  0.144 |\n| Cabin    |   -0.087 |  0.144 |      1 |\n\nThis method can be used to uncover hidden patterns in missing data across many\ncolumns in a dataset. Columns with no missing data are automatically excluded.\n\nThere is a function to visualize correlations with a heatmap:\n\n```python\nna.altair\\\n    .plot_corr(data, corr_kws={'method': 'spearman'})\n    .properties(width=150, height=150)\n```\n\n![NA correlations](https://raw.githubusercontent.com/maximtrp/scikit-na/main/img/titanic_correlations.svg)\n\n### Visualization\n\n#### Heatmap\n\nNow, let's visualize NA values on a heatmap. We will be using\n[Altair](https://altair-viz.github.io/) + [Vega](https://vega.github.io/vega-lite/)\nbackend:\n\n```python\nna.altair.plot_heatmap(data)\n```\n\n![NA heatmap](https://raw.githubusercontent.com/maximtrp/scikit-na/main/img/titanic_na_heatmap.svg)\n\nDroppables are those values that will be dropped if we simply use\n`pandas.DataFrame.dropna()` on the _entire dataset_.\n\n#### Stairs plot\n\nStairs plot is one more useful visualization of dataset shrinkage on applying\n`pandas.Series.dropna()` method to each column sequentially (sorted by the\nnumber of NA values, by default):\n\n```python\nna.altair.plot_stairs(data)\n```\n\n![NA stairsplot](https://raw.githubusercontent.com/maximtrp/scikit-na/main/img/titanic_na_stairsplot.svg)\n\nAfter dropping all NAs in `Cabin` column, we are left with 21 more NAs (in `Age`\nand `Embarked` columns). This plot also shows tooltips with exact numbers of NA\nvalues that are dropped per each column.\n\n#### Histogram\n\nYou may need to adjust some parameters before a histogram starts looking as you expect:\n\n```python\nchart = na.altair.plot_hist(data, col='Pclass', col_na='Age')\\\n    .properties(width=200, height=200)\nchart.configure_axisX(labelAngle = 0)\n```\n\n![NA histogram](https://raw.githubusercontent.com/maximtrp/scikit-na/main/img/titanic_hist.svg)\n\n### Regression model\n\nWe can build a logistic regression model with `Age` as a dependent variable and\n`Fare`, `Parch`, `Pclass`, `SibSp`, `Survived` as independent variables.\nInternally, `pandas.Series.isna()` method is called on `Age` column, and the\nresulting boolean values are converted to integers (`True`/`False` becomes\n`1`/`0`). Finally, fitting a logistic model is done by\n[statsmodels](https://www.statsmodels.org) package:\n\n```python\n# Selecting columns with numeric data\n# Dropping \"PassengerId\" column\nsubset = data.loc[:, data.dtypes != object].drop(columns=['PassengerId'])\nmodel = na.model(subset, col_na='Age')\nmodel.summary()\n```\n\n```\nOptimization terminated successfully.\nCurrent function value: 0.467801\nIterations 7\n                        Logit Regression Results\n===============================================================================\nDep. Variable:                    Age   No. Observations:                   891\nModel:                          Logit   Df Residuals:                       885\nMethod:                           MLE   Df Model:                             5\nDate:                Sat, 05 Jun 2021   Pseudo R-squ.:                  0.06164\nTime:                        17:51:31   Log-Likelihood:                 -416.81\nconverged:                       True   LL-Null:                        -444.19\nCovariance Type:            nonrobust   LLR p-value:                  1.463e-10\n===============================================================================\n                coef    std err          z      P\u003e|z|      [0.025      0.975]\n-------------------------------------------------------------------------------\n(intercept)    -2.7294      0.429     -6.369      0.000      -3.569      -1.890\nFare            0.0010      0.003      0.376      0.707      -0.004       0.006\nParch          -0.8874      0.223     -3.984      0.000      -1.324      -0.451\nPclass          0.5953      0.147      4.046      0.000       0.307       0.884\nSibSp           0.2548      0.095      2.684      0.007       0.069       0.441\nSurvived       -0.1026      0.198     -0.519      0.604      -0.490       0.285\n===============================================================================\n```\n\n### Interactive report\n\nUse `scikit_na.report()` function to show interactive report interface:\n\n```python\nna.report(data)\n```\n\n![Report](https://raw.githubusercontent.com/maximtrp/scikit-na/main/img/report_summary.png)\n\n### Export functionality\n\nExport your analysis results to various formats for sharing and further processing:\n\n#### Export summary statistics\n\n```python\n# Export to CSV\nna.export_summary(data, filename='missing_data_summary.csv', format='csv')\n\n# Export to JSON\nna.export_summary(data, filename='summary.json', format='json')\n\n# Export to Excel\nna.export_summary(data, filename='analysis.xlsx', format='xlsx')\n```\n\n#### Export interactive reports\n\n```python\n# Export complete report to HTML\nna.export_report(data, filename='missing_data_report.html', format='html')\n\n# Export with custom columns\nna.export_report(\n    data, \n    columns=['Age', 'Cabin', 'Embarked'],\n    filename='focused_analysis.html', \n    format='html'\n)\n```\n\nThe export functionality supports:\n- **CSV**: Summary statistics in tabular format\n- **JSON**: Structured data for programmatic access  \n- **HTML**: Interactive reports for web viewing\n- **XLSX**: Excel-compatible spreadsheets\n\n## API Reference\n\n### Statistical Functions\n- `summary(data, columns=None, per_column=True, round_dec=2)` - Missing data statistics\n- `correlate(data, columns=None, drop=True, **kwargs)` - Correlation analysis  \n- `describe(data, col_na, columns=None, na_mapping=None)` - Grouped descriptive stats\n- `model(data, col_na, columns=None, intercept=True, **kwargs)` - Logistic regression\n- `test_hypothesis(data, col_na, test_fn, columns=None, **kwargs)` - Hypothesis testing\n- `stairs(data, columns=None, **kwargs)` - Dataset shrinkage analysis\n\n### Visualization Functions\n- `altair.plot_heatmap(data, **kwargs)` - Missing data heatmap\n- `altair.plot_corr(data, **kwargs)` - Correlation heatmap  \n- `altair.plot_stairs(data, **kwargs)` - Stairs plot\n- `altair.plot_hist(data, col, col_na, **kwargs)` - Missing data histogram\n\n### Export Functions  \n- `export_summary(data, filename, format, **kwargs)` - Export summary statistics\n- `export_report(data, filename, format, **kwargs)` - Export interactive reports\n\n### Interactive Reports\n- `report(data, columns=None, **kwargs)` - Generate interactive widget-based report\n\n## Contribution\n\nAny contribution is highly appreciated: pull requests, suggestions, or bug reports.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaximtrp%2Fscikit-na","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaximtrp%2Fscikit-na","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaximtrp%2Fscikit-na/lists"}