{"id":19278246,"url":"https://github.com/tslu1s/mlimputer","last_synced_at":"2025-04-22T00:31:46.102Z","repository":{"id":65742777,"uuid":"598216115","full_name":"TsLu1s/mlimputer","owner":"TsLu1s","description":"MLimputer: Missing Data Imputation Framework for Machine Learning","archived":false,"fork":false,"pushed_at":"2025-01-30T15:24:40.000Z","size":4423,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-12T01:54:21.600Z","etag":null,"topics":["automated-machine-learning","data-science","imputation-algorithm","imputation-methods","imputation-optimizer","machine-learning","missing-data","missing-data-handling","missing-data-imputation","null-imputation","predictive-imputation","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TsLu1s.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-02-06T16:40:27.000Z","updated_at":"2025-03-03T18:54:36.000Z","dependencies_parsed_at":"2024-01-07T16:35:37.073Z","dependency_job_id":"75dde2b6-1401-4a5a-999b-f8f5bdded078","html_url":"https://github.com/TsLu1s/mlimputer","commit_stats":{"total_commits":64,"total_committers":2,"mean_commits":32.0,"dds":0.109375,"last_synced_commit":"39d38ba2566c012f071a7696be0762ed4c3d6142"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TsLu1s%2Fmlimputer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TsLu1s%2Fmlimputer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TsLu1s%2Fmlimputer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TsLu1s%2Fmlimputer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TsLu1s","download_url":"https://codeload.github.com/TsLu1s/mlimputer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250157823,"owners_count":21384331,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automated-machine-learning","data-science","imputation-algorithm","imputation-methods","imputation-optimizer","machine-learning","missing-data","missing-data-handling","missing-data-imputation","null-imputation","predictive-imputation","python"],"created_at":"2024-11-09T21:08:54.396Z","updated_at":"2025-04-22T00:31:46.081Z","avatar_url":"https://github.com/TsLu1s.png","language":"Python","readme":"[![LinkedIn][linkedin-shield]][linkedin-url]\n[![Contributors][contributors-shield]][contributors-url]\n[![Stargazers][stars-shield]][stars-url]\n[![MIT License][license-shield]][license-url]\n[![Downloads][downloads-shield]][downloads-url]\n[![Month Downloads][downloads-month-shield]][downloads-month-url]\n\n[contributors-shield]: https://img.shields.io/github/contributors/TsLu1s/MLimputer.svg?style=for-the-badge\u0026logo=github\u0026logoColor=white\n[contributors-url]: https://github.com/TsLu1s/MLimputer/graphs/contributors\n[stars-shield]: https://img.shields.io/github/stars/TsLu1s/MLimputer.svg?style=for-the-badge\u0026logo=github\u0026logoColor=white\n[stars-url]: https://github.com/TsLu1s/MLimputer/stargazers\n[license-shield]: https://img.shields.io/github/license/TsLu1s/MLimputer.svg?style=for-the-badge\u0026logo=opensource\u0026logoColor=white\n[license-url]: https://github.com/TsLu1s/MLimputer/blob/main/LICENSE\n[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge\u0026logo=linkedin\u0026colorB=555\n[linkedin-url]: https://www.linkedin.com/in/luísfssantos/\n[downloads-shield]: https://static.pepy.tech/personalized-badge/mlimputer?period=total\u0026units=international_system\u0026left_color=grey\u0026right_color=blue\u0026left_text=Total%20Downloads\n[downloads-url]: https://pepy.tech/project/mlimputer\n[downloads-month-shield]: https://static.pepy.tech/personalized-badge/mlimputer?period=month\u0026units=international_system\u0026left_color=grey\u0026right_color=blue\u0026left_text=Month%20Downloads\n[downloads-month-url]: https://pepy.tech/project/mlimputer\n\n\u003cbr\u003e\n\u003cp align=\"center\"\u003e\n  \u003ch2 align=\"center\"\u003e MLimputer: Missing Data Imputation Framework for Machine Learning\n  \u003cbr\u003e\n  \n## Framework Contextualization \u003ca name = \"ta\"\u003e\u003c/a\u003e\n\nThe `MLimputer` project constitutes an complete and integrated pipeline to automate the handling of missing values in datasets through regression prediction and aims at reducing bias and increase the precision of imputation results when compared to more classic imputation methods.\nThis package provides multiple algorithm options to impute your data, in which every observed data column with existing missing values is fitted with a robust preprocessing approach and subsequently predicted.\n\nThe architecture design includes three main sections, these being: missing data analysis, data preprocessing and supervised model imputation which are organized in a customizable pipeline structure.\n\nThis project aims at providing the following application capabilities:\n\n* General applicability on tabular datasets: The developed imputation procedures are applicable on any data table associated with any Supervised ML scopes, based on missing data columns to be imputed.\n    \n* Robustness and improvement of predictive results: The application of the MLimputer preprocessing aims at improve the predictive performance through customization and optimization of existing missing values imputation in the dataset input columns. \n   \n#### Main Development Tools \u003ca name = \"pre1\"\u003e\u003c/a\u003e\n\nMajor frameworks used to built this project: \n\n* [Pandas](https://pandas.pydata.org/)\n* [Sklearn](https://scikit-learn.org/stable/)\n* [CatBoost](https://catboost.ai/)\n    \n## Where to get it \u003ca name = \"ta\"\u003e\u003c/a\u003e\n    \nBinary installer for the latest released version is available at the Python Package Index [(PyPI)](https://pypi.org/project/mlimputer/).   \n\n## Installation  \n\nTo install this package from Pypi repository run the following command:\n\n```\npip install mlimputer\n```\n\n# MLImputer - Usage Examples\n    \nThe first needed step after importing the package is to load a dataset (split it) and define your choosen imputation model.\nThe imputation model options for handling the missing data in your dataset are the following:\n* `RandomForest`\n* `ExtraTrees`\n* `GBR`\n* `KNN`\n* `XGBoost`\n* `Lightgbm`\n* `Catboost`\n\nAfter creating a `MLimputer` object with your imputation selected model, you can then fit the missing data through the `fit_imput` method. From there you can impute the future datasets with `transform_imput` (validate, test ...) with the same data properties. Note, as it shows in the example bellow, you can also customize your model imputer parameters by changing it's configurations and then, implementing them in the `imputer_configs` parameter.\n\nThrough the `cross_validation` function you can also compare the predictive performance evalution of multiple imputations, allowing you to validate which imputation model fits better your future predictions.\n\n```py\n\nfrom mlimputer.imputation import MLimputer\nimport mlimputer.model_selection as ms\nfrom mlimputer.parameters import imputer_parameters\nimport pandas as pd\nimport numpy as np\nfrom sklearn.model_selection import train_test_split\nimport warnings\nwarnings.filterwarnings(\"ignore\", category=Warning) #-\u003e For a clean console\n\ndata = pd.read_csv('csv_directory_path') # Dataframe Loading Example\n# Important note: If Classification, target should be categorical.  -\u003e data[target]=data[target].astype('object')\n\ntrain,test = train_test_split(data, train_size=0.8)\ntrain,test = train.reset_index(drop=True), test.reset_index(drop=True) # \u003c- Required\n\n# All model imputation options -\u003e  \"RandomForest\",\"ExtraTrees\",\"GBR\",\"KNN\",\"XGBoost\",\"Lightgbm\",\"Catboost\"\n\n# Customizing Hyperparameters Example\n\nhparameters = imputer_parameters()\nprint(hparameters)\nhparameters[\"KNN\"][\"n_neighbors\"] = 5\nhparameters[\"RandomForest\"][\"n_estimators\"] = 30\n    \n# Imputation Example 1 : KNN\n\nmli_knn = MLimputer(imput_model = \"KNN\", imputer_configs = hparameters)\nmli_knn.fit_imput(X = train)\ntrain_knn = mli_knn.transform_imput(X = train)\ntest_knn = mli_knn.transform_imput(X = test)\n\n# Imputation Example 2 : RandomForest\n\nmli_rf = MLimputer(imput_model = \"RandomForest\", imputer_configs = hparameters)\nmli_rf.fit_imput(X = train)\ntrain_rf = mli_rf.transform_imput(X = train)\ntest_rf = mli_rf.transform_imput(X = test)\n    \n#(...)\n\n## Export Imputation Metadata\nimport pickle \noutput = open(\"imputer_rf.pkl\", 'wb')\npickle.dump(mli_rf, output)\n\n```\n\n## Performance Evaluation\nThe MLimputer framework includes a robust evaluation module that enables users to assess and compare the performance of different imputation strategies. This evaluation process is crucial for selecting the most effective imputation approach for your specific dataset and use case.\n\n### Evaluation Process Overview\nThe framework implements a comprehensive two-stage evaluation approach:\n1. Cross-Validation Assessment: Evaluates multiple imputation models using k-fold cross-validation to ensure robust performance metrics.\n2. Test Set Validation: Validates the selected imputation strategy on a separate test set to confirm generalization capability.\n\n### Implementation Example:\nThe following example demonstrates how to evaluate imputation models and select the best performing approach for your data:\n\n```py\nimport mlimputer.evaluation as Evaluator                   \nfrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\nfrom sklearn.tree import DecisionTreeClassifier\nfrom xgboost import XGBRegressor\n\n# Define evaluation parameters\nimputation_models = [\"RandomForest\", \"ExtraTrees\", \"GBR\", \"KNN\",]\n                    #\"XGBoost\", \"Lightgbm\", \"Catboost\"]   # List of imputation models to evaluate\nn_splits = 3  # Number of splits for cross-validation\n\n# Selected models for classification and regression\nif train[target].dtypes == \"object\":                                      \n            models = [RandomForestClassifier(), DecisionTreeClassifier()]\nelse:\n    models = [XGBRegressor(), RandomForestRegressor()]\n\n# Initialize the evaluator\nevaluator = Evaluator(\n    imputation_models = imputation_models,  \n    train = train,\n    target = target,\n    n_splits = n_splits,     \n    hparameters = hparameters)\n\n# Perform evaluations\ncv_results = evaluator.evaluate_imputation_models(\n    models = models)\n\nbest_imputer = evaluator.get_best_imputer()  # Get best-performing imputation model\n\ntest_results = evaluator.evaluate_test_set(\n    test = test,\n    imput_model = best_imputer,\n    models = models)\n\n```\n    \n## License\n\nDistributed under the MIT License. See [LICENSE](https://github.com/TsLu1s/TSForecasting/blob/main/LICENSE) for more information.\n\n## Contact \n \nLuis Santos - [LinkedIn](https://www.linkedin.com/in/lu%C3%ADsfssantos/)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftslu1s%2Fmlimputer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftslu1s%2Fmlimputer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftslu1s%2Fmlimputer/lists"}