{"id":51081442,"url":"https://github.com/royxlead/multi-objective-feature-selection","last_synced_at":"2026-06-23T18:32:39.738Z","repository":{"id":353014568,"uuid":"1217614373","full_name":"royxlead/multi-objective-feature-selection","owner":"royxlead","description":"NSGA-II multi-objective feature selection on medical tabular data. 9 of 30 features at 94.74% accuracy - matching full-feature baselines with 70% feature reduction.","archived":false,"fork":false,"pushed_at":"2026-06-15T15:54:13.000Z","size":1030,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-15T17:25:14.606Z","etag":null,"topics":["deap","evolutionary-algorithms","feature-selection","interpretable-ml","medical-ml","multi-objective-optimization","nsga2","pareto-front","random-forest","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/royxlead.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-22T03:57:56.000Z","updated_at":"2026-06-15T15:57:54.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/royxlead/multi-objective-feature-selection","commit_stats":null,"previous_names":["royxlead/multi-objective-evolutionary-feature-selection-python","royxlead/multi-objective-feature-selection"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/royxlead/multi-objective-feature-selection","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/royxlead%2Fmulti-objective-feature-selection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/royxlead%2Fmulti-objective-feature-selection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/royxlead%2Fmulti-objective-feature-selection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/royxlead%2Fmulti-objective-feature-selection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/royxlead","download_url":"https://codeload.github.com/royxlead/multi-objective-feature-selection/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/royxlead%2Fmulti-objective-feature-selection/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34702913,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-23T02:00:07.161Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deap","evolutionary-algorithms","feature-selection","interpretable-ml","medical-ml","multi-objective-optimization","nsga2","pareto-front","random-forest","scikit-learn"],"created_at":"2026-06-23T18:32:38.569Z","updated_at":"2026-06-23T18:32:39.727Z","avatar_url":"https://github.com/royxlead.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multi-Objective Feature Selection\n### NSGA-II for Medical Tabular Data\n\n\u003cp align=\"left\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Python-3.10%2B-3776AB?style=flat-square\u0026logo=python\u0026logoColor=white\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Algorithm-NSGA--II-14b8a6?style=flat-square\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Domain-Medical%20Tabular-f59e0b?style=flat-square\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Scikit--learn-F7931E?style=flat-square\u0026logo=scikitlearn\u0026logoColor=white\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/License-MIT-6366f1?style=flat-square\" /\u003e\n\u003c/p\u003e\n\n\u003e Applying NSGA-II multi-objective evolutionary optimization to feature selection on medical tabular data. Simultaneously optimizes two competing objectives - maximizing classifier accuracy and minimizing feature count - surfacing the full Pareto-optimal trade-off curve rather than collapsing it into a single metric.\n\n---\n\n## Table of Contents\n\n- [The Problem](#the-problem)\n- [Key Result](#key-result)\n- [Why Multi-Objective](#why-multi-objective)\n- [Algorithm: NSGA-II](#algorithm-nsga-ii)\n- [Experimental Setup](#experimental-setup)\n- [Results](#results)\n- [Baseline Comparison](#baseline-comparison)\n- [Repository Structure](#repository-structure)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Research Context](#research-context)\n- [Citation](#citation)\n\n---\n\n## The Problem\n\nIn medical ML, feature selection is not a single-objective problem. Fewer features means lower data collection cost, reduced patient burden, more interpretable models, and lower overfitting risk. More features means higher potential accuracy. The optimal trade-off is dataset-dependent and cannot be determined by a single scalar metric.\n\nStandard approaches (RFE, importance-based selection, PCA) collapse this trade-off into a single number and return one solution. They do not tell you what accuracy you give up by removing two more features, or what you gain by adding three.\n\n**NSGA-II surfaces the entire trade-off curve. Practitioners choose any point on the Pareto front based on their deployment constraints.**\n\n---\n\n## Key Result\n\n\u003e **9 features out of 30 (70% reduction) at 94.74% test accuracy** - matching the best full-feature baseline (Grid Search RandomForest, 30 features, 94.74%) while discarding 21 features entirely.\n\n---\n\n## Why Multi-Objective\n\nSingle-objective feature selection methods used as baselines:\n\n| Method | Test Accuracy | Features Used | Trade-off Visible |\n|---|---|---|---|\n| Grid Search RF | 94.74% | 30 | No |\n| RF Importance + RF | 94.74% | 30 | No |\n| RFE + RandomForest | 94.74% | 7 | No |\n| PCA + RandomForest | 93.86% | 10 components | No |\n| Random Search RF | 91.23% | 30 | No |\n| **NSGA-II (this work)** | **94.74%** | **9** | **Yes - full Pareto front** |\n\nRFE reaches 7 features at the same accuracy, but returns a single solution with no information about what the curve looks like between 7 and 30 features. NSGA-II returns the full front - every Pareto-optimal (accuracy, feature count) pair discovered during evolution.\n\n---\n\n## Algorithm: NSGA-II\n\nNon-dominated Sorting Genetic Algorithm II maintains a population of candidate (feature subset, classifier hyperparameter) pairs and evolves them across generations toward the Pareto front using four mechanisms:\n\n**Non-dominated sorting** ranks solutions by Pareto dominance. A solution A dominates B if A is no worse than B on all objectives and strictly better on at least one. Solutions are sorted into fronts: Front 1 contains all non-dominated solutions, Front 2 contains solutions dominated only by Front 1, and so on.\n\n**Crowding distance** preserves diversity along each front by measuring how isolated a solution is from its neighbors in objective space. Solutions in sparse regions are preferred during selection, preventing the population from collapsing to a single point on the front.\n\n**Tournament selection** selects parents for the next generation. Candidates are compared first by front rank, then by crowding distance.\n\n**Elitism** combines parents and offspring into a pool of size 2N, then selects the top N by rank and crowding distance. Best solutions are never lost.\n\n```\nInitialize population of (feature mask, RF hyperparameters)\nPopulation size: 64, Generations: 30\n         |\n         v\n+---------------------+\n|  Evaluate           |   CV training accuracy + feature count\n|  (two objectives)   |   RandomForest with evolved hyperparameters\n+----------+----------+\n           |\n           v\n+---------------------+\n|  Non-dominated Sort |   Assign Pareto front rank to each individual\n|  + Crowding Distance|   Measure isolation within each front\n+----------+----------+\n           |\n           v\n+---------------------+\n|  Tournament Select  |   Rank first, crowding distance as tiebreaker\n|  Crossover + Mutate |   Uniform mask crossover (p=0.9)\n|                     |   Bit-flip mutation (p=0.02 per feature)\n|                     |   Blend crossover on hyperparams (alpha=0.3)\n|                     |   Gaussian hyperparameter mutation (sigma=0.1)\n+----------+----------+\n           |\n     Next Generation\n           |\n     (30 generations)\n           |\n           v\n    Pareto Front of (feature subset, classifier) pairs\n```\n\n### Evolved Hyperparameters\n\nNSGA-II jointly evolves feature masks and RandomForest hyperparameters:\n\n| Hyperparameter | Search Range |\n|---|---|\n| n_estimators | 60 - 400 |\n| max_depth | 2 - 24 |\n| min_samples_split | 2 - 20 |\n| min_samples_leaf | 1 - 10 |\n| max_features | 0.15 - 1.0 (fraction) |\n\nThe framework also supports SVM (RBF/linear, C and gamma evolved in log space) and optionally XGBoost. The Pareto-optimal solution in this run converged on RandomForest.\n\n---\n\n## Experimental Setup\n\n**Dataset:** Wisconsin Breast Cancer (scikit-learn built-in, UCI origin)\n\n| Property | Value |\n|---|---|\n| Samples | 569 |\n| Features | 30 (continuous) |\n| Task | Binary classification: malignant vs. benign |\n| Class balance | Balanced weighting (class_weight=\"balanced\") |\n\n**NSGA-II Configuration:**\n\n| Parameter | Value |\n|---|---|\n| Population size | 64 |\n| Generations | 30 |\n| Crossover probability | 0.9 |\n| Mutation probability (individual) | 0.3 |\n| Bit-flip probability (per feature) | 0.02 |\n| Hyperparameter mutation | Gaussian perturbation, sigma=0.1 |\n| Crossover type | Uniform (feature mask) + Blend alpha=0.3 (hyperparameters) |\n| Selection | NSGA-II via DEAP selNSGA2 |\n\n---\n\n## Results\n\n### Pareto-Optimal Solution\n\nThe Pareto front converged to a solution that matches full-feature baselines with 70% of the feature space removed:\n\n| Metric | Value |\n|---|---|\n| Test Accuracy | **94.74%** |\n| Features Selected | **9 / 30** |\n| Feature Reduction | **70%** |\n| F1-Score | 0.958 |\n| Precision | 0.971 |\n| Recall | 0.944 |\n| CV Training Accuracy | 96.05% |\n\n**Evolved classifier configuration:**\n- Algorithm: RandomForestClassifier\n- n_estimators: 132\n- max_depth: 24\n- min_samples_split: 3\n- min_samples_leaf: 4\n- max_features: 0.57\n\n### What the Pareto Front Shows\n\nThe front maps every discovered (feature count, accuracy) trade-off point from the minimal viable subset to the full feature set. A practitioner deploying to a resource-constrained environment can choose a point further left on the front. A practitioner where accuracy is critical can choose a point further right. Both choices are informed by the same optimization run.\n\n---\n\n## Baseline Comparison\n\n| Method | Test Accuracy | Features | Notes |\n|---|---|---|---|\n| Grid Search RF | 94.74% | 30 | Exhaustive hyperparameter search, all features |\n| RF Importance + RF | 94.74% | 30 | Importance-ranked, no feature reduction |\n| RFE + RandomForest | 94.74% | 7 | Single solution, no trade-off curve |\n| PCA + RandomForest | 93.86% | 10 components | Components, not original features - loses interpretability |\n| Random Search RF | 91.23% | 30 | Suboptimal hyperparameters |\n| **NSGA-II (this work)** | **94.74%** | **9** | **Full Pareto front, joint feature + hyperparameter optimization** |\n\nNSGA-II matches the best baselines on accuracy while returning not a single solution but a complete trade-off curve - something no single-objective method can produce.\n\n---\n\n## Repository Structure\n\n```\nmulti-objective-feature-selection/\n|\n+-- run_experiment.py   # Main NSGA-II implementation + experiment runner\n+-- requirements.txt\n+-- LICENSE\n+-- README.md\n```\n\n---\n\n## Installation\n\n```bash\ngit clone https://github.com/royxlead/multi-objective-feature-selection.git\ncd multi-objective-feature-selection\n\npip install -r requirements.txt\n```\n\n**Core dependencies:** scikit-learn · DEAP · NumPy · Matplotlib\n\n---\n\n## Usage\n\n```bash\npython run_experiment.py\n```\n\nThe script loads the Wisconsin Breast Cancer dataset, runs NSGA-II for 30 generations across a population of 64, evaluates all Pareto-optimal solutions on the held-out test set, and plots the final Pareto front in objective space (accuracy vs. feature count).\n\nTo swap the dataset, replace the data loading block with any binary classification dataset in sklearn-compatible format (X, y arrays). The optimizer is dataset-agnostic.\n\n---\n\n## Research Context\n\nMulti-objective feature selection is an active research area in medical ML, where interpretability requirements structurally conflict with accuracy maximization. Regulatory frameworks (EU AI Act, FDA guidance on clinical decision support) increasingly require explainability, which directly rewards feature economy.\n\nNSGA-II (Deb et al., 2002) is one of the most cited multi-objective evolutionary algorithms, with strong theoretical guarantees on Pareto front convergence and diversity preservation via crowding distance. This implementation extends the standard NSGA-II formulation by jointly evolving feature masks and classifier hyperparameters in a single optimization loop - avoiding the two-stage bias introduced by selecting features first and tuning hyperparameters second.\n\n**Connection to related work:** The unsupervised confidence metric in [Unsupervised Confidence Estimation\n](https://github.com/royxlead/unsupervised-confidence-estimation) and the drift monitoring in [Production Drift Detection\n](https://github.com/royxlead/production-drift-detection) both operate on model outputs. Feature selection determines what goes into the model. These three projects form a coherent pipeline: select features carefully, monitor for drift, and quantify uncertainty in deployment.\n\n---\n\n## Citation\n\n```bibtex\n@software{roy2025nsga2featureselection,\n  author = {Roy, Sourav},\n  title  = {multi-objective-feature-selection: NSGA-II for Medical Tabular Data},\n  year   = {2026},\n  url    = {https://github.com/royxlead/multi-objective-feature-selection}\n}\n```\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003csub\u003eBuilt by \u003ca href=\"https://github.com/royxlead\"\u003eSourav Roy\u003c/a\u003e · Founding AI/ML Engineer · Yuga AI\u003c/sub\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froyxlead%2Fmulti-objective-feature-selection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Froyxlead%2Fmulti-objective-feature-selection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Froyxlead%2Fmulti-objective-feature-selection/lists"}