{"id":20285414,"url":"https://github.com/mramshaw/ml_with_missing_data","last_synced_at":"2026-04-07T14:01:21.209Z","repository":{"id":43879866,"uuid":"163690116","full_name":"mramshaw/ML_with_Missing_Data","owner":"mramshaw","description":"How to handle missing or incomplete data","archived":false,"fork":false,"pushed_at":"2024-07-17T06:14:15.000Z","size":490,"stargazers_count":1,"open_issues_count":26,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-12-28T01:50:46.071Z","etag":null,"topics":["incomplete-data","machine-learning","matplotlib","ml","numpy","pandas","python","python3","scikit-learn","seaborn","sklearn"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mramshaw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-31T18:31:57.000Z","updated_at":"2023-11-01T03:03:41.000Z","dependencies_parsed_at":"2024-01-21T18:27:44.847Z","dependency_job_id":"ae47dc49-480f-48a7-8b0e-916b3ab942ce","html_url":"https://github.com/mramshaw/ML_with_Missing_Data","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mramshaw/ML_with_Missing_Data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_with_Missing_Data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_with_Missing_Data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_with_Missing_Data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_with_Missing_Data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mramshaw","download_url":"https://codeload.github.com/mramshaw/ML_with_Missing_Data/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_with_Missing_Data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31515151,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T03:10:19.677Z","status":"ssl_error","status_checked_at":"2026-04-07T03:10:13.982Z","response_time":105,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["incomplete-data","machine-learning","matplotlib","ml","numpy","pandas","python","python3","scikit-learn","seaborn","sklearn"],"created_at":"2024-11-14T14:26:31.003Z","updated_at":"2026-04-07T14:01:21.183Z","avatar_url":"https://github.com/mramshaw.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ML with Missing Data\n\n[![Known Vulnerabilities](http://snyk.io/test/github/mramshaw/ML_with_Missing_Data/badge.svg?style=plastic\u0026targetFile=requirements.txt)](http://snyk.io/test/github/mramshaw/ML_with_Missing_Data?style=plastic\u0026targetFile=requirements.txt)\n\nHow to handle missing or incomplete data\n\n## Motivation\n\nOne subject that often crops up is how to handle missing or incomplete data.\n\nI decided to try this tutorial to get some background on the issue. The\ngeneral approach will be as follows:\n\n1. Describe the data\n2. Check for missing values\n3. Fill in any missing values\n4. Compare the filled-in values with the original values\n\nFollowing on from my [ML with SciPy](http://github.com/mramshaw/ML_with_SciPy)\nexercise, I make sure to carefully examine the structure of the data first!\n\n## Table of Contents\n\nThe table of contents is as follows:\n\n* [Missing Data](#missing-data)\n* [Data](#data)\n* [Summarize the dataset](#summarize-the-dataset)\n* [Reference](#reference)\n    * [cross_val_score](#cross_val_score)\n    * [distplot](#distplot)\n    * [dropna](#dropna)\n    * [fillna](#fillna)\n    * [isnull](#isnull)\n    * [mean](#mean)\n    * [median](#median)\n    * [mode](#mode)\n    * [replace](#replace)\n* [More on processing missing data](#more-on-processing-missing-data)\n* [To Do](#to-do)\n* [Credits](#credits)\n\n## Missing Data\n\nThis is a long-standing issue. If a sensitive or troublesome field is left as\noptional, it will tend to be either: left blank, or else populated with values\nsuch as __N/A__ (meaning possibly \"Not Applicable\" or \"Not Available\"). So, using\nSICs (Sales Industry Codes - which are generally three digits) as an example,\nif this field is made mandatory - and validated for being numeric - the easy data\nentry options will tend to be either \"000\" or \"999\" (although other options for\n\"unknown\" Sales Industry Codes are of course possible). But none of these values\nmake for good data analysis.\n\n[The essential problem is that data entry personnel generally lack both\n the training and the data to correctly determine the missing fields.\n Plus they are generally paid by volume, so it is not really in their\n best interests to spend a lot of time on their data-entry problems.]\n\n## Data\n\nWe will use the [Pima Indians Diabetes dataset](http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes).\n\nAs it no longer seems to be available, we will use the tutorial author's\n[version](http://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv).\n\nThis data is known to have missing values. It consists of:\n\n1. Number of times pregnant\n2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test\n3. Diastolic blood pressure (mm Hg)\n4. Triceps skinfold thickness (mm)\n5. 2-Hour serum insulin (mu U/ml)\n6. Body mass index (weight in kg/(height in m)^2)\n7. Diabetes pedigree function\n8. Age (years)\n9. Class variable (0 or 1)\n\n## Summarize the dataset\n\nThis looks as follows:\n\n```bash\n$ python missing_data.py \nRows, columns = (768, 9)\n\nThe first 20 observations\n-------------------------\n     0    1   2   3    4     5      6   7  8\n0    6  148  72  35    0  33.6  0.627  50  1\n1    1   85  66  29    0  26.6  0.351  31  0\n2    8  183  64   0    0  23.3  0.672  32  1\n3    1   89  66  23   94  28.1  0.167  21  0\n4    0  137  40  35  168  43.1  2.288  33  1\n5    5  116  74   0    0  25.6  0.201  30  0\n6    3   78  50  32   88  31.0  0.248  26  1\n7   10  115   0   0    0  35.3  0.134  29  0\n8    2  197  70  45  543  30.5  0.158  53  1\n9    8  125  96   0    0   0.0  0.232  54  1\n10   4  110  92   0    0  37.6  0.191  30  0\n11  10  168  74   0    0  38.0  0.537  34  1\n12  10  139  80   0    0  27.1  1.441  57  0\n13   1  189  60  23  846  30.1  0.398  59  1\n14   5  166  72  19  175  25.8  0.587  51  1\n15   7  100   0   0    0  30.0  0.484  32  1\n16   0  118  84  47  230  45.8  0.551  31  1\n17   7  107  74   0    0  29.6  0.254  31  1\n18   1  103  30  38   83  43.3  0.183  33  0\n19   1  115  70  30   96  34.6  0.529  32  1\n```\n\nExamining the first 20 observations, we can see zeroes\n(but no troublesome \"99\" or \"999\" values - perhaps medical\npersonnel are closer to the data) in a number of columns.\nIt is only reasonable that there should be zeroes in the\nfirst and last columns. So we will check for zeroes in all\nof the other columns:\n\n```bash\nNumber of zero values\n---------------------\n1      5\n2     35\n3    227\n4    374\n5     11\n6      0\n7      0\ndtype: int64\n```\n\nIt looks like the only problems areas are columns 1,\n2, 3, 4 and 5.\n\nAccording to the tutorial, it is standard practice in Python (specifically Pandas,\nNumPy and Scikit-Learn) to mark missing values as NaN.\n\nFirstly, check for missing values using the Pandas [isnull](#isnull) function before\ndoing any data munging:\n\n```bash\nNumber of missing fields (original)\n-----------------------------------\n0    0\n1    0\n2    0\n3    0\n4    0\n5    0\n6    0\n7    0\n8    0\ndtype: int64\n\nStatistics (original)\n---------------------\n                0           1           2           3           4           5  \\\ncount  768.000000  768.000000  768.000000  768.000000  768.000000  768.000000   \nmean     3.845052  120.894531   69.105469   20.536458   79.799479   31.992578   \nstd      3.369578   31.972618   19.355807   15.952218  115.244002    7.884160   \nmin      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   \n25%      1.000000   99.000000   62.000000    0.000000    0.000000   27.300000   \n50%      3.000000  117.000000   72.000000   23.000000   30.500000   32.000000   \n75%      6.000000  140.250000   80.000000   32.000000  127.250000   36.600000   \nmax     17.000000  199.000000  122.000000   99.000000  846.000000   67.100000   \n\n                6           7           8  \ncount  768.000000  768.000000  768.000000  \nmean     0.471876   33.240885    0.348958  \nstd      0.331329   11.760232    0.476951  \nmin      0.078000   21.000000    0.000000  \n25%      0.243750   24.000000    0.000000  \n50%      0.372500   29.000000    0.000000  \n75%      0.626250   41.000000    1.000000  \nmax      2.420000   81.000000    1.000000  \n```\n\nNow we will use the Pandas [replace](#replace) function to replace our troublesome zero values with __NaN__.\n\nAnd check again for zero (missing) values:\n\n```bash\nNumber of missing fields (zero fields flagged as NaN)\n-----------------------------------------------------\n0      0\n1      5\n2     35\n3    227\n4    374\n5     11\n6      0\n7      0\n8      0\ndtype: int64\n```\n\nAnd columns 1, 2, 3, 4 and 5 have missing values.\n\nLets get the stats for the columns we will be filling:\n\n```bash\nStatistics (pre-fill)\n---------------------\n                1           2           3           4           5\ncount  763.000000  733.000000  541.000000  394.000000  757.000000\nmean   121.686763   72.405184   29.153420  155.548223   32.457464\nstd     30.535641   12.382158   10.476982  118.775855    6.924988\nmin     44.000000   24.000000    7.000000   14.000000   18.200000\n25%     99.000000   64.000000   22.000000   76.250000   27.500000\n50%    117.000000   72.000000   29.000000  125.000000   32.300000\n75%    141.000000   80.000000   36.000000  190.000000   36.600000\nmax    199.000000  122.000000   99.000000  846.000000   67.100000\n```\n\nNote that the counts for our troublesome columns have changed as the\n(probably) missing fields are ignored - plus the means and standard\ndeviations have changed.\n\nLets fill in the missing values with the average (mean) value for that feature.\n\nAnd check again for missing values (there shouldn't be any):\n\n```bash\nNumber of missing fields (post-fill)\n------------------------------------\n0    0\n1    0\n2    0\n3    0\n4    0\n5    0\n6    0\n7    0\n8    0\ndtype: int64\n```\n\nNow lets get the stats for the columns we filled-in:\n\n```bash\nStatistics (post-fill)\n----------------------\n                1           2           3           4           5\ncount  768.000000  768.000000  768.000000  768.000000  768.000000\nmean   121.681605   72.254807   26.606479  118.660163   32.450805\nstd     30.436016   12.115932    9.631241   93.080358    6.875374\nmin     44.000000   24.000000    7.000000   14.000000   18.200000\n25%     99.750000   64.000000   20.536458   79.799479   27.500000\n50%    117.000000   72.000000   23.000000   79.799479   32.000000\n75%    140.250000   80.000000   32.000000  127.250000   36.600000\nmax    199.000000  122.000000   99.000000  846.000000   67.100000\n```\n\nThe means for columns 3 and 4 are different (in both of these columns\nzero was actually the __mode__ - or most common value), but otherwise\nit's mainly the distributions that have shifted as the zero values\nhave been adjusted:\n\n![Column 1](images/Column_1.png)\n\n![Column 2](images/Column_2.png)\n\n![Column 3](images/Column_3.png)\n\n![Column 4](images/Column_4.png)\n\n![Column 5](images/Column_5.png)\n\n[Column 5 only had 11 missing values. As it is fairly normally-distributed,\n the mode, median and mean distributions seem to be almost identical.]\n\nNote that we cannot use a dataset with NaN values for k-fold cross validation:\n\n```bash\nAccuracy (with NaN values)\n--------------------------\n/home/owner/.local/lib/python2.7/site-packages/sklearn/model_selection/_validation.py:542: FutureWarning: From version 0.22, errors during fit will result in a cross validation score of NaN by default. Use error_score='raise' if you want an exception raised or error_score=np.nan to adopt the behavior from version 0.22.\n  FutureWarning)\n\nInput contains NaN, infinity or a value too large for dtype('float64').\n```\n\n[Throws a __ValueException__, the value of which is shown.]\n\nNow we will use the Pandas [dropna](#dropna) function to drop any entries that contain __NaN__ values.\n\n```bash\nRows, columns (NaN values dropped) = (392, 9)\n\nStatistics (NaN values dropped)\n-------------------------------\n                1           2           3           4           5\ncount  392.000000  392.000000  392.000000  392.000000  392.000000\nmean   122.627551   70.663265   29.145408  156.056122   33.086224\nstd     30.860781   12.496092   10.516424  118.841690    7.027659\nmin     56.000000   24.000000    7.000000   14.000000   18.200000\n25%     99.000000   62.000000   21.000000   76.750000   28.400000\n50%    119.000000   70.000000   29.000000  125.500000   33.200000\n75%    143.000000   78.000000   37.000000  190.000000   37.100000\nmax    198.000000  110.000000   63.000000  846.000000   67.100000\n```\n\nAnd almost half of our entries have now been dropped.\n\nLet's compare our __k-fold cross validation__ with dropped and filled values:\n\n```bash\nAccuracy (with NaN values dropped)\n----------------------------------\n0.78582892934\n\nAccuracy (with NaN values filled)\n---------------------------------\n0.766927083333\n```\n\n[These are exactly the same as the tutorial's published values.]\n\nAnd finally let's use `seaborn` to graph our original values versus dropped values versus filled values:\n\n![Column 1 dropped](images/Column_1_dropped.png)\n\n![Column 2 dropped](images/Column_2_dropped.png)\n\n![Column 3 dropped](images/Column_3_dropped.png)\n\n![Column 4 dropped](images/Column_4_dropped.png)\n\n![Column 5 dropped](images/Column_5_dropped.png)\n\n## Reference\n\nVarious useful links (and comments) are listed below.\n\n#### cross_val_score\n\n    https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html\n\nWill throw a `ValueError` for missing data:\n\n    ValueError: Input contains NaN, infinity or a value too large for dtype('float64').\n\n#### distplot\n\n    http://seaborn.pydata.org/generated/seaborn.distplot.html\n\nWill throw a `ValueError` for missing data:\n\n    ValueError: array must not contain infs or NaNs\n\n#### dropna\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html\n\nDefault behavior is to drop entries where ___Any___ field is NaN.\n\n#### fillna\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html\n\nFill in NA / NaN values.\n\n#### isnull\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html\n\nDetects missing values - such as `NaN` in numeric arrays, `None` or `NaN` in object arrays, `NaT` in datetimelike.\n\n#### mean\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html\n\nNote that the default value for __skipna__ is ___True___, which means invalid data\nwill be ignored when calculating the column mean.\n\n#### median\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html\n\nNote that the default value for __skipna__ is ___True___, which means invalid data\nwill be ignored when calculating the column median.\n\n#### mode\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html\n\nNote that multiple values may be returned for the selected axis.\nAlso that the default value for __numeric\\_only__ is ___False___.\n\n#### replace\n\n    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html\n\nNote that the value to be replaced can also be specified by a regex.\nAlso that the default value for __inplace__ is ___False___.\n\n## More on processing missing data\n\nmissing data with `pandas`:\n\n    http://pandas.pydata.org/pandas-docs/stable/missing_data.html\n\nmissing data with `sklearn`:\n\n    http://scikit-learn.org/stable/modules/impute.html#impute\n\n## To Do\n\n- [x] Add a Snyk.io vulnerability scan badge\n- [x] Graph before and after (mean, median and mode) values\n- [x] Conform code to `pylint`, `pycodestyle` and `pydocstyle` standards\n- [ ] Fix annoying `sklearn` __FutureWarning__ warnings\n- [ ] Generate a [Monte Carlo](http://en.wikipedia.org/wiki/Monte_Carlo_method) style missing-data dataset\n      and evaluate how it performs (in comparison to its non-missing-data original)\n- [ ] Finish tutorial\n\n## Credits\n\nI (mainly) followed this excellent tutorial:\n\n    http://machinelearningmastery.com/handle-missing-data-python/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmramshaw%2Fml_with_missing_data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmramshaw%2Fml_with_missing_data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmramshaw%2Fml_with_missing_data/lists"}