{"id":22420289,"url":"https://github.com/ehsan-ashik/missing-value-imputation-comparison","last_synced_at":"2025-08-12T21:17:59.991Z","repository":{"id":258088163,"uuid":"873359936","full_name":"ehsan-ashik/missing-value-imputation-comparison","owner":"ehsan-ashik","description":"R project for comparing different Missing Value Imputation (MVI)* approaches across three datasets.","archived":false,"fork":false,"pushed_at":"2024-10-17T04:18:27.000Z","size":436,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-01T10:11:51.759Z","etag":null,"topics":["expectation-maximization","k-nn","mice-algorithm","missing-value-imputation"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ehsan-ashik.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-16T03:16:43.000Z","updated_at":"2024-10-17T04:18:30.000Z","dependencies_parsed_at":"2024-10-19T03:33:56.082Z","dependency_job_id":null,"html_url":"https://github.com/ehsan-ashik/missing-value-imputation-comparison","commit_stats":null,"previous_names":["ehsan-ashik/missing-value-imputation-comparison"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ehsan-ashik%2Fmissing-value-imputation-comparison","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ehsan-ashik%2Fmissing-value-imputation-comparison/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ehsan-ashik%2Fmissing-value-imputation-comparison/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ehsan-ashik%2Fmissing-value-imputation-comparison/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ehsan-ashik","download_url":"https://codeload.github.com/ehsan-ashik/missing-value-imputation-comparison/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245785715,"owners_count":20671631,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["expectation-maximization","k-nn","mice-algorithm","missing-value-imputation"],"created_at":"2024-12-05T16:18:31.259Z","updated_at":"2025-03-27T04:42:29.372Z","avatar_url":"https://github.com/ehsan-ashik.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Comparison of Different Missing Value Imputation (MVI) Techniques\r\n\r\nThis is a R project for *CSE 5717 - Big Data Analytics* where different **Missing Value Imputation (MVI)** approaches has been compared for three datsets. \r\n\r\n## Data cleanup for Big Data Analytics\r\n\r\n* A fundamental challenge for any big data analytics or data mining tasks is to ensure data quality.\r\n\r\n* Raw data is often incomplete, inconsistent, and inaccurate - combined with:\r\n  1. Noise\r\n  2. Outliers\r\n  3. Inconsistencies\r\n  4. Missing values\r\n\r\n* Raw data must be processed and shaped into quality data – requiring preprocessing and cleanup, a crucial step of data mining.\r\n\r\n## Missing Values\r\n\r\n* Introduces incompleteness in the raw data.\r\n* Different from empty data - \r\n  * Empty: no value that can be assigned\r\n  * Missing: available values exists but missing in the data\r\n\r\n* Many statistical techniques are not robust in analyzing with missing data.\r\n* Often gets challenging to identify the root cause of missingness and produce appropriate actions.\r\n* A key part of data cleanup and data preprocessing before starting the mining process.\r\n\r\n# How to deal with Missing Values?\r\n\r\n**Strategy 1: Ignoring Missing values**\r\n* Delete data with missing values before mining\r\n* Two strategies of deletion:\r\n  * Listwise deletion: a record is deleted if any of the attribute has missing data in it\r\n  * Pairwise deletion: deletion based on variable of interest\r\n* Several limitations:\r\n  * If missing rate is high, significantly reduces size of the data\r\n  * May affect model strength and lead to inaccurate conclusion\r\n  * Even the rate is small, often small amount of data contribute to valuable information.\r\n\r\n**Strategy 2: Missing Value Imputation (MVI)**\r\n* Calculates plausible values based on different strategies and impute missing value with calculated values\r\n* When missing rate is high, deletion is often not a feasible solution\r\n* Leverages information in the observed portion and variability in the data to deal with missing values\r\n* A powerful tool to ensure data quality for mining tasks\r\n* However, can get challenging to select the best possible approach of imputations as several factors are required to be considered as wrong imputation approach can mislead and distort mining results\r\n\r\n\r\n## Missingness Mechanism\r\n\r\n**Why missingness occurred in the dataset?**\r\n\r\n* Three mechanisms to explain missingness in the data:\r\n  * Missing Completely At Random (MCAR) - Assumes that the pattern of missingness is completely at random and does not depend on the observed or unobserved portion of data. \r\n  * Missing At Random (MAR) - Assumes that the pattern of missingness depends on the observed portion of the data and not on the unobserved part of the data.\r\n  * Missing Not At Random (MNAR) - Assumes that probability of missing value depends on unobserved data and external factors that was not considered.\r\n\r\n* Helps planning for proper methods and tools to tackle missing value\r\n* Help identifying whether deleting missing data is a viable option\r\n\r\n\r\n## Missing Rate\r\n\r\n* Indicate the ratio of missing value compared to the observed data\r\n* Helps identify the appropriate strategy in dealing with missing values\r\n* If low (5 – 10%), discarding often does not significantly affect the accuracy of mining results.\r\n\r\n## MVI Techniques\r\n\r\n**Technique 1: Constant Value Imputation**\r\n\r\n* Replace missing values with some constant (e.g., global missing flag etc.).\r\n\r\n* Works comparatively better for categorical data.\r\n\r\n* Disadvantage: Does not consider variability in the data (not “smart”).\r\n\r\n\r\n**Technique 2: Mean Imputation**\r\n\r\n* Replaces the missing values with sample mean of the observed values\r\n\r\n* Works better with continuous data that are approximately normally distributed\r\n\r\n* Disadvantage: Does not consider correlations across variables\r\n\r\n**Technique 3: Median Imputation**\r\n\r\n* Replaces the missing values with sample median of the observed values\r\n\r\n* Works better when distribution of the missing variable is skewed in nature\r\n\r\n* Disadvantage: Same! Does not consider correlations across variables\r\n\r\n**Technique 4: Expectation Maximization (EM) Imputation**\r\n\r\n* An iterative approach; calculates the likelihood estimates for the incomplete data\r\n* Two steps –\r\n  * *E-Step*: Attempts to estimate the missing data in the variables.\r\n  * *M-Step*: Attempts to optimize the parameters to best explain the data.\r\n\r\n* Iteratively alternate between the steps until parameter estimates converge\r\n\r\n\r\n**Technique 5: k-NN Imputation**\r\n* Uses the k-Nearest Neighbor algorithm to predict the missing data\r\n* Pseudocode:\r\n  1. Start with a suitable value for k – number of nearest neighbors\r\n  2. Compute the similarity of observed data with the missing data using distance functions, e.g., Euclidian distance\r\n  3. Choose the k smallest distance rows as the k nearest neighbor of the missing record\r\n  4. Calculate the weights of the k-nearest values and estimate the missing value as the weighted average of k nearest neighbor\r\n\r\n* Disadvantage: Time consuming when the size of the data grows. \r\n* Disadvantage: Finding the optimal k value is often difficult.\r\n\r\n**Technique 6: Multiple Imputation using Chained Equation (MICE)**\r\n* Uses many imputed values to substitute a missing value instead of single imputation\r\n* 3 steps:\r\n  * *Generation*: In an iterative approach, a total of m imputed datasets are created \r\n  * *Analyze*: m datasets are examined, and parameter of interest is estimated\r\n  * *Combination*: the best result is obtained by combining the m datasets\r\n\r\n* Advantage: More unbiased compared to other methods\r\n* Disadvantage: Can be time consuming\r\n\r\n\r\n## MVI method comparison \r\n\r\n* Comparing five MVI approaches:\r\n  1. Mean Imputation\r\n  2. Median Imputation\r\n  3. EM Imputation\r\n  4. k-NN Imputation\r\n  5. MICE Imputation\r\n\r\n* Using three *UCI Machine Learning numeric multivariate datasets* \r\n\r\n* Considering two level of missing rate: *5%* and *55%*\r\n\r\n* Evaluating *Normalized Root Mean Square (NRMSE)* as the evaluation metric: lower means better.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fehsan-ashik%2Fmissing-value-imputation-comparison","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fehsan-ashik%2Fmissing-value-imputation-comparison","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fehsan-ashik%2Fmissing-value-imputation-comparison/lists"}