https://github.com/gauravpandeylab/ensemble_integration
Integrating multimodal data through heterogeneous ensembles
https://github.com/gauravpandeylab/ensemble_integration
dataintegration model-interpretation multimodal protein-function-prediction
Last synced: about 2 months ago
JSON representation
Integrating multimodal data through heterogeneous ensembles
- Host: GitHub
- URL: https://github.com/gauravpandeylab/ensemble_integration
- Owner: GauravPandeyLab
- License: other
- Created: 2019-07-19T05:24:23.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2023-12-27T21:26:20.000Z (over 1 year ago)
- Last Synced: 2023-12-27T22:33:39.569Z (over 1 year ago)
- Topics: dataintegration, model-interpretation, multimodal, protein-function-prediction
- Language: Python
- Homepage:
- Size: 44.3 MB
- Stars: 0
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: license.md
Awesome Lists containing this project
README
# Ensemble Integration (EI): Integrating multimodal data through interpretable heterogeneous ensembles
The latest version of EI fully written in python is implemented [here](https://github.com/GauravPandeyLab/eipy), or you may install it by `pip install ensemble-integration` with [full documentation](https://eipy.readthedocs.io/en/latest/).
Ensemble Integration (EI) is a customizable pipeline for generating diverse ensembles of heterogeneous classifiers, as well as the accompanying metadata needed for ensemble learning approaches utilizing ensemble diversity for improved performance. It also fairly evaluates the performance of several ensemble learning methods including ensemble selection [Caruana2004], and stacked generalization (stacking) [Wolpert1992]. Though other tools exist, we are unaware of a similarly modular, scalable pipeline designed for large-scale ensemble learning. EI was developed to support research by Yan Chak Li, Linhua Wang, and Gaurav Pandey.
EI is designed for generating extremely large ensembles (taking days or weeks to generate) and thus consists of an initial data generation phase tuned for multicore and distributed computing environments. The output is a set of compressed CSV files containing the class distribution produced by each classifier that serves as input to a later ensemble learning phase.
More details of EI can be found in our [Biorxiv preprint](https://www.biorxiv.org/content/10.1101/2020.05.29.123497v3):
Full citation:
Yan Chak Li, Linhua Wang, Jeffrey N Law, T M Murali, Gaurav Pandey, Integrating multimodal data through interpretable heterogeneous ensembles, Bioinformatics Advances, Volume 2, Issue 1, 2022, vbac065, https://doi.org/10.1093/bioadv/vbac065
This repository is protected by [CC BY-NC 4.0](https://github.com/GauravPandeyLab/ensemble_integration/blob/master/license.md).
## Configurations
### Install Java and groovy.
This can be done using sdkman (https://sdkman.io/).
### Install python libraries:
python==3.7.4
scikit-learn==0.22
xgboost==1.2.0
numpy==1.19.5
pandas==0.25.3
argparse==1.1
scipy==1.3.1### Download weka.jar from github/or the link below:
curl -O -L https://prdownloads.sourceforge.net/weka/weka-3-8-5-azul-zulu-linux.zip
## Data
Under the data path, 2 files and a list of feature folders are expected:
1. classifiers.txt
This file specifies the list of base classifiers. See the sample_data/classifiers.txt as an example.2. weka.properties
This file specifies the list of weka properties that are parsed to the training/testing of base classifiers. See the sample_data/weka.properties as an example.3. Folders for feature sets
This is a list of folders under the main data path. Each of them originally contains only one file named as data.arff. The .arff files are the input feature matrices and labels for training/testing Weka base classifiers. Indices and labels of .arff files should be aligned across all feature sets.`sample_folder` of this repository is an example for reference.
# Sample data
We uploaded the sample data used in the paper to [zenodo](https://doi.org/10.5281/zenodo.6972512).
The compressed zip files `PFP.zip` contains the input data used for EI.
For PFP, since the raw data is very large (around 2139 * 2GB), we uploaded 5 samples of the GO terms which have been transformed into the format for EI. The remaining terms can be generated by the STRING DB (`PFP/STRING_csv`) & GO annotation files (`GO_annotation.tsv`) using `generate_data.py`
For example, you may generate the input data for predicting `GO:0000166` by the following command:
python processing_scripts/generate_data.py --outcome GO:0000166
Due to IRB constraints, we are currently unable to publicly share the COVID-19 EHR dataset used in our study. However, we shared the model built based on the dataset for application in `covid19-model-built.zip` which can load by using `load_models.py` [(more detail here)](#saving-and-loading-ei-models).
## Evaluate/Model Selection of EI models by nested CV
### Train base classifiers
Arguments of train_base.py:
--path, -P: Path of the multimodal data
--queue, -Q: LSF queue to submit the job
--node, -N: number of node requested to HPC
--time, -T: number of hours requested to HPC
--memory, -M: memory requsted in MB to HPC
--classpath, -CP: Path of 'weka.jar' (default:'./weka.jar')
--hpc: use HPC cluster or not
--fold, -F: number of cross-validation foldOption 1: Without access to Minerva, EI can be run sequentially.
python train_base.py --path [data path] --hpc False
Option 2: Run the pipeline in parallel on Minerva HPC
python train_base.py --path [data path] --node [#node] --queue [queue] --time [hour:min] --memory [memory]
### Train and evaluate EI
Arguments of ensemble.py:
--path, -P: Path of the multimodal data
--fold, -F: cross-validation foldRun the following command:
python ensemble.py --path [data path]
F-max scores of these models will be printed and written in the `performance.csv` file and saved to the `analysis` folder under the data path.
The prediction scores by the ensemble methods will be saved in `predictions.csv` file in `analysis` folder under the data path.
## Model interpretation by EI
Similar to the above step, we will run `train_base.py` and `ensemble.py` again, with option `--rank True`, to train the EI by the whole dataset. All these results will be created in `path/model_built` folder.
We first generate the local feature ranks (LFR) by the following:
python train_base.py --path [path] --rank TrueThis step will generate a new folder `feature_rank` under the data path, which contains a dataset merged with a pseudo test set only for interpretation purposes.
From the `path/analysis/performance.csv` generated before (`--rank=False`), we may determine the performance of the ensembles by the Nested-CV setup. We suggest using the best-performing ensemble for EI, eg `S.LR`, `CES`, `Mean` etc. So we may generate the local model rank (LMR) by the following:
python ensemble.py --path [path] --rank True --ens [ensemble algorithm]
After these two steps for calculating LFR and LMR, we may run the ensemble feature ranking by the following:
python ensemble_ranking.py --path [path] --ens [ensemble algorithm]
## Saving and loading EI models
### Saving local & ensemble models of EI
We may save both local models and EI models for further inference by setting `--writeModel True` for both `train_base.py` and `ensemble.py`
Local models were saved by:
python train_base.py --path [path] --writeModel TrueBy default, the following command saves all the ensemble models of EI. We may save the specific ensemble model only (e.g. the best-performing ensemble for EI) by specifying `--ens` option:
python ensemble.py --path [path] --writeModel True --ens [ensemble algorithm, default:all ensemble algorithms]Loading local models and make base prediction to new dataset (the `model_path` would be the `path/model_built`):
python load_models.py --data_path [new dataset path] --model_path [model path] --local_predictor True
We suggest using the best-performing ensemble for EI (eg `S.LR`, `CES`, `Mean` etc.) known from Nested-CV setup. We can use the saved ensemble model to perform integrative prediction, after obtaining the base prediction of new dataset:
python load_models.py --data_path [new dataset path] --model_path [model path] --ens [ensemble model]
After this step, `prediction_scores.csv` containing predictions of new dataset is generated in `data_path/analysis` folder.## More information about the implementation of EI
We used 10 standard binary classification algorithms, such as support vector machine (SVM), random forest (RF) and logistic regression (LR), as implemented in Weka to derive local predictive models from each individual data modality.Here are the base classifier included in `classifier.txt`, which are used in `train_base.py`.
| Base Classifier Name | Weka Class Name |
|-----------------|-----------------|
|AdaBoost | weka.classifiers.meta.AdaBoostM1 |
| Decision Tree | weka.classifiers.trees.J48 |
| Gradient Boosting | weka.classifiers.meta.LogitBoost |
| K-nearest Neighbors | weka.classifiers.lazy.IBk |
| Logistic Regression | weka.classifiers.functions.Logistic -M 100 |
| Voted Perceptron | weka.classifiers.functions.VotedPerceptron |
| Naive Bayes | weka.classifiers.bayes.NaiveBayes |
| Random Forest | weka.classifiers.trees.RandomForest |
| Support Vector Machine | weka.classifiers.functions.SMO -C 1.0E-3 |
| Rule-based classification | weka.classifiers.rules.PART |We then applied the mean aggregation, ensemble selection method, and stacking to these local models to generate the final EI model.
Here are the meta-classifiers used in stacking, which are used in `ensemble.py`.
| Meta-classifier Name |Python Class Name|Short Name|
|----------------------|-----------------|----------|
| AdaBoost | sklearn.ensemble.AdaBoostClassifier | S.AB |
| Decision Tree | sklearn.tree.DecisionTreeClassifier | S.DT |
| Gradient Boosting | sklearn.ensemble.GradientBoostingClassifier | S.GB |
| K-nearest Neighbors | sklearn.neighbors.KNeighborsClassifier| S.KNN |
| Logistic Regression | sklearn.linear_model.LogisticRegression | S.LR |
| Naive Bayes | sklearn.naive_bayes.GaussianNB| S.NB |
| Random Forest | sklearn.ensemble.RandomForestClassifier | S.RF|
| Support Vector Machine | sklearn.svm.SVC(kernel='linear')| S.SVM |
| XGBoost | xgboost.XGBClassifier | S.XGB |