{"id":15026090,"url":"https://github.com/residentmario/missingno","last_synced_at":"2025-05-13T19:07:01.473Z","repository":{"id":37694771,"uuid":"54834492","full_name":"ResidentMario/missingno","owner":"ResidentMario","description":"Missing data visualization module for Python.","archived":false,"fork":false,"pushed_at":"2024-05-14T18:30:13.000Z","size":10860,"stargazers_count":4088,"open_issues_count":15,"forks_count":525,"subscribers_count":76,"default_branch":"master","last_synced_at":"2025-04-28T00:38:28.988Z","etag":null,"topics":["data-analysis","data-visualization","missing-data","pandas","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ResidentMario.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-03-27T15:18:50.000Z","updated_at":"2025-04-26T17:16:46.000Z","dependencies_parsed_at":"2022-07-12T22:30:34.351Z","dependency_job_id":"6fd1abf5-7cfe-45e6-8b41-6602f3661a1e","html_url":"https://github.com/ResidentMario/missingno","commit_stats":{"total_commits":177,"total_committers":19,"mean_commits":9.31578947368421,"dds":"0.29378531073446323","last_synced_commit":"570fa089ba6338e02342ed990bbc1b0bedc54314"},"previous_names":[],"tags_count":26,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ResidentMario%2Fmissingno","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ResidentMario%2Fmissingno/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ResidentMario%2Fmissingno/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ResidentMario%2Fmissingno/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ResidentMario","download_url":"https://codeload.github.com/ResidentMario/missingno/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254010810,"owners_count":21998993,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-visualization","missing-data","pandas","python"],"created_at":"2024-09-24T20:03:43.920Z","updated_at":"2025-05-13T19:07:01.427Z","avatar_url":"https://github.com/ResidentMario.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# missingno [![PyPi version](https://img.shields.io/pypi/v/missingno.svg)](https://pypi.python.org/pypi/missingno/) [![](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/) ![t](https://img.shields.io/badge/status-maintained-yellow.svg) [![](https://img.shields.io/github/license/ResidentMario/missingno.svg)](https://github.com/ResidentMario/missingno/blob/master/LICENSE.md) [![](https://img.shields.io/badge/doi-10.21105/joss.00547+-blue.svg)](https://joss.theoj.org/papers/10.21105/joss.00547)\n\nMessy datasets? Missing values? `missingno` provides a small toolset of flexible and easy-to-use missing data\nvisualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. Just `pip install missingno` to get started.\n\n## quickstart\n\nThis quickstart uses a sample of the [NYPD Motor Vehicle Collisions Dataset](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95) dataset.\n\n```python\nimport pandas as pd\ncollisions = pd.read_csv(\"https://raw.githubusercontent.com/ResidentMario/missingno-data/master/nyc_collision_factors.csv\")\n```\n\n### `matrix`\n\nThe `msno.matrix` nullity matrix is a data-dense display which lets you quickly visually pick out patterns in\n data completion.\n\n```python\nimport missingno as msno\n%matplotlib inline\nmsno.matrix(collisions.sample(250))\n```\n\n![alt text][two_hundred_fifty]\n\n[two_hundred_fifty]: https://i.imgur.com/gWuXKEr.png\n\nAt a glance, date, time, the distribution of injuries, and the contribution factor of the first vehicle appear to be completely populated, while geographic information seems mostly complete, but spottier.\n\nThe sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.\n\nThis visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.\n\nIf you are working with time-series data, you can [specify a periodicity](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases)\nusing the `freq` keyword parameter:\n\n```python\nnull_pattern = (np.random.random(1000).reshape((50, 20)) \u003e 0.5).astype(bool)\nnull_pattern = pd.DataFrame(null_pattern).replace({False: None})\nmsno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')\n```\n\n![alt text][ts_matrix]\n\n[ts_matrix]: https://i.imgur.com/VLvWpsV.png\n\n### `bar`\n\n`msno.bar` is a simple visualization of nullity by column:\n\n```python\nmsno.bar(collisions.sample(1000))\n```\n\n![alt text][bar]\n\n[bar]: https://i.imgur.com/2BxEfOr.png\n\nYou can switch to a logarithmic scale by specifying `log=True`. `bar` provides the same information as `matrix`, but in a simpler format.\n\n### `heatmap`\n\nThe `missingno` correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:\n\n```python\nmsno.heatmap(collisions)\n```\n\n![alt text][heatmap]\n\n[heatmap]: https://i.imgur.com/JalSKyE.png\n\nIn this example, it seems that reports which are filed with an `OFF STREET NAME` variable are less likely to have complete geographic data.\n\nNullity correlation ranges from `-1` (if one variable appears the other definitely does not) to `0` (variables appearing or not appearing have no effect on one another) to `1` (if one variable appears the other definitely also does).\n\nThe exact algorithm used is:\n\n```python\nimport numpy as np\n\n# df is a pandas.DataFrame instance\ndf = df.iloc[:, [i for i, n in enumerate(np.var(df.isnull(), axis='rows')) if n \u003e 0]]\ncorr_mat = df.isnull().corr()\n```\n\nVariables that are always full or always empty have no meaningful correlation, and so are silently removed from the visualization\u0026mdash;in this case for instance the datetime and injury number columns, which are completely filled, are not included.\n\nEntries marked `\u003c1` or `\u003e-1` have a correlation that is close to being exactingly negative or positive, but is still not quite perfectly so. This points to a small number of records in the dataset which are erroneous. For example, in this dataset the correlation between `VEHICLE CODE TYPE 3` and `CONTRIBUTING FACTOR VEHICLE 3` is `\u003c1`, indicating that, contrary to our expectation, there are a few records which have one or the other, but not both. These cases will require special attention.\n\nThe heatmap works great for picking out data completeness relationships between variable pairs, but its explanatory power is limited when it comes to larger relationships and it has no particular support for extremely large datasets.\n\n### `dendrogram`\n\nThe dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap:\n\n```python\nmsno.dendrogram(collisions)\n```\n\n![alt text][dendrogram]\n\n[dendrogram]: https://i.imgur.com/oIiR4ct.png\n\nThe dendrogram uses a [hierarchical clustering algorithm](http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html)\n(courtesy of `scipy`) to bin variables against one another by their nullity correlation (measured in terms of\nbinary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.\n\nThe exact algorithm used is:\n\n```python\nfrom scipy.cluster import hierarchy\nimport numpy as np\n\n# df is a pandas.DataFrame instance\nx = np.transpose(df.isnull().astype(int).values)\nz = hierarchy.linkage(x, method)\n```\n\nTo interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence\u0026mdash;one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.\n\nCluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually *are* or *ought to be* match each other in nullity (for example, as `CONTRIBUTING FACTOR VEHICLE 2` and `VEHICLE TYPE CODE 2` ought to), then the height of the cluster leaf tells you, in absolute terms, how often the records are \"mismatched\" or incorrectly filed\u0026mdash;that is, how many values you would have to fill in or drop, if you are so inclined.\n\nAs with `matrix`, only up to 50 labeled columns will comfortably display in this configuration. However the\n`dendrogram` more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.\n\n## configuration\n\nFor more advanced configuration details for your plots, refer to the `CONFIGURATION.md` file in this repository.\n\n## contributing\n\nFor thoughts on features or bug reports see [Issues](https://github.com/ResidentMario/missingno/issues). If you're interested in contributing to this library, see details on doing so in the `CONTRIBUTING.md` file in this repository. If doing so, keep in mind that `missingno` is currently in a maintenance state, so while bugfixes are welcome, I am unlikely to review or land any new major library features.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fresidentmario%2Fmissingno","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fresidentmario%2Fmissingno","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fresidentmario%2Fmissingno/lists"}