{"id":15912331,"url":"https://github.com/ulf1/maxjoshua","last_synced_at":"2025-07-09T23:33:07.227Z","repository":{"id":34988278,"uuid":"194404179","full_name":"ulf1/maxjoshua","owner":"ulf1","description":"Feature selection for hard voting classifier and NN sparse weight initialization.","archived":false,"fork":false,"pushed_at":"2023-08-12T19:48:45.000Z","size":841,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-12T11:06:53.566Z","etag":null,"topics":["binary-classification","bootstrapping","ensemble-learning","feature-selection","hard-voting-classifier","neural-network-initialization","python","sparse-neural-network"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ulf1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["ulf1"]}},"created_at":"2019-06-29T12:37:47.000Z","updated_at":"2024-09-18T09:33:01.000Z","dependencies_parsed_at":"2024-10-28T14:19:51.820Z","dependency_job_id":null,"html_url":"https://github.com/ulf1/maxjoshua","commit_stats":{"total_commits":53,"total_committers":1,"mean_commits":53.0,"dds":0.0,"last_synced_commit":"5c48dee98cce2d9b966ac6d7e0d18913698305b9"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/ulf1/maxjoshua","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ulf1%2Fmaxjoshua","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ulf1%2Fmaxjoshua/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ulf1%2Fmaxjoshua/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ulf1%2Fmaxjoshua/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ulf1","download_url":"https://codeload.github.com/ulf1/maxjoshua/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ulf1%2Fmaxjoshua/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264505262,"owners_count":23618911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binary-classification","bootstrapping","ensemble-learning","feature-selection","hard-voting-classifier","neural-network-initialization","python","sparse-neural-network"],"created_at":"2024-10-06T16:04:03.939Z","updated_at":"2025-07-09T23:33:07.204Z","avatar_url":"https://github.com/ulf1.png","language":"Python","funding_links":["https://github.com/sponsors/ulf1"],"categories":[],"sub_categories":[],"readme":"[![PyPI version](https://badge.fury.io/py/maxjoshua.svg)](https://badge.fury.io/py/maxjoshua)\n\n# maxjoshua\nFeature selection for hard voting classifier and NN sparse weight initialization.\n\n## Preface\nI am naming this software package in memory of my late nephew Max Joshua Hamster (* 2005 to † June 18, 2022).\n\n## Usage\n\n### Forward Selection for Hard Voting Classifier\nLoad toy data set and convert features to binary.\n```py\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.preprocessing import scale\nX = scale(load_breast_cancer().data, axis=0) \u003e 0  # convert to binary features\ny = load_breast_cancer().target\n```\n\nSelect binary features. Each row in the `results` list contains the `n_select` column indices of `X`, the notice if the binary features were negated, and the sum of absolute MCC correlation coeffcients between the selected features.\n```py\nimport maxjoshua as mh\nidx, neg, rho, results = mh.binsel(\n    X, y, preselect=0.8, oob_score=True, subsample=0.5, \n    n_select=5, unique=True, n_draws=100, random_state=42)\n```\n\n**Algorithm**. \nThe task is to select e.g. `n_select` features from a pool of many features.\nThese features might be the prediction of binary classifiers. \nThe selected features are then combined into one hard-voting classifier.\n\nA voting classifier should have the following properties\n\n* each voter (a binary feature) should be highly correlated to the target variable\n* the selected features should be uncorrelated.\n\nThe algorithm works as follows \n\n1. Generate multiple correlation matrices by bootstrapping. This includes `corr(X_i, X_j)` as well as `corr(Y, X_i)` computation. Also store the oob samples for evaluation.\n2. For each correlation matrix do ...\n    a. Preselect the `i*` with the highest `abs(corr(Y, X_i))` estimates (e.g. pick the `n_pre=?` highest absolute correlations)\n    b. Slice a correlation matrix `corr(X_i*, X_j*)` and find the least correlated combination of `n_select` features. (see [`korr.mincorr`](https://github.com/kmedian/korr/blob/master/korr/mincorr.py))\n    c. Compute the out-of-bag (OOB) performance (see step 1) of the hard-voter with the selected `n_select=?` features\n3. Select the feature combination with the best OOB performance as final model.\n\n\n### Forward Selection for Linear Regression\nLoad toy dataset.\n```py\nfrom sklearn.preprocessing import scale\nfrom sklearn.datasets import fetch_california_housing\nhousing = fetch_california_housing()\nX = scale(housing[\"data\"], axis=0)\ny = scale(housing[\"target\"])\n```\n\nSelect real-numbered features. Each row in the `results` list contains the `n_select` column indices of `X`, the ridge regression coefficents `beta` and the RMSE `loss`.\nWarning! Please note that the features `X` and the target `y` must be scaled because `mh.fltsel` uses an L2-penalty on `beta` coefficients, and doesn't used an intercept term to shift `y`.\n```py\nimport maxjoshua as mh\nfrom sklearn.preprocessing import scale\n\nidx, beta, loss, results = mh.fltsel(\n    scale(X), scale(y), preselect=0.8, oob_score=True, subsample=0.5, \n    n_select=5, unique=True, n_draws=100, random_state=42, l2=0.01)\n```\n\n\n### Initialize Sparse NN Layer\nThe idea is to run `mh.fltsel` to generate an ensemble of linear models, and combine them in a sparse linear neural network layer, i.e., the number of output neurons is the ensemble size.\nIn case of small datasets, the sparse NN layer is non-trainable because because each submodel was already estimated and selected with two-way data splits in `mh.fltsel` (see `oob_scores` and `subsample`). \nThe sparse NN layers basically produces submodel predictions for meta model in the next layer, i.e., a simple dense linear layer.\nThe inputs of the sparse NN layer must be normalized for which a layer normalization layers is trained.\n\n```py\nimport maxjoshua as mh\nimport tensorflow as tf\nimport sklearn.preprocessing\n\n# create toy dataset\nimport sklearn.datasets\nX, y = sklearn.datasets.make_regression(\n    n_samples=1000, n_features=100, n_informative=20, n_targets=3)\n\n# feature selection\n# - always scale the inputs and targets -\nindices, values, num_in, num_out = mh.pretrain_submodels(\n    sklearn.preprocessing.scale(X), \n    sklearn.preprocessing.scale(y), \n    num_out=64, n_select=3)\n\n# specify the model\nmodel = tf.keras.models.Sequential([\n    # sub-models\n    mh.SparseLayerAsEnsemble(\n        num_in=num_in, \n        num_out=num_out, \n        sp_indices=indices, \n        sp_values=values,\n        sp_trainable=False,\n        norm_trainable=True,\n    ),\n    # meta model\n    tf.keras.layers.Dense(\n        units=3, use_bias=False,\n        # kernel_constraint=tf.keras.constraints.NonNeg()\n    ),\n    # scale up\n    mh.InverseTransformer(\n        units=3,\n        init_bias=y.mean(), \n        init_scale=y.std()\n    )\n])\nmodel.compile(\n    optimizer=tf.keras.optimizers.Adam(\n        learning_rate=3e-4, beta_1=.9, beta_2=.999, epsilon=1e-7, amsgrad=True),\n    loss='mean_squared_error'\n)\n\n# train\nhistory = model.fit(X, y, epochs=3)\n```\n\n\n\n\n## Appendix\n\n### Installation\nThe `maxjoshua` [git repo](http://github.com/ulf1/maxjoshua) is available as [PyPi package](https://pypi.org/project/maxjoshua)\n\n```sh\npip install maxjoshua\n```\n\n### Install a virtual environment\n\n```sh\npython3.7 -m venv .venv\nsource .venv/bin/activate\npip install --upgrade pip\npip install -r requirements.txt\npip install -r requirements-dev.txt\npip install -r requirements-demo.txt\n```\n\n(If your git repo is stored in a folder with whitespaces, then don't use the subfolder `.venv`. Use an absolute path without whitespaces.)\n\n### Python commands\n\n* Jupyter for the examples: `jupyter lab`\n* Check syntax: `flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')`\n* Run Unit Tests: `pytest`\n\nPublish\n\n```sh\npandoc README.md --from markdown --to rst -s -o README.rst\npython setup.py sdist \ntwine upload -r pypi dist/*\n```\n\n### Clean up \n\n```\nfind . -type f -name \"*.pyc\" | xargs rm\nfind . -type d -name \"__pycache__\" | xargs rm -r\nrm -r .venv\n```\n\n## Support\nPlease [open an issue](https://github.com/ulf1/maxjoshua/issues/new) for support.\n\n\n## Contributing\nPlease contribute using [Github Flow](https://guides.github.com/introduction/flow/). Create a branch, add commits, and [open a pull request](https://github.com/ulf1/maxjoshua/compare/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fulf1%2Fmaxjoshua","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fulf1%2Fmaxjoshua","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fulf1%2Fmaxjoshua/lists"}