Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tsudalab/shimr

Sparse High-order Interaction Model with Rejection option
https://github.com/tsudalab/shimr

Last synced: about 1 month ago
JSON representation

Sparse High-order Interaction Model with Rejection option

Host: GitHub
URL: https://github.com/tsudalab/shimr
Owner: tsudalab
Created: 2018-03-12T08:13:24.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2019-03-02T12:19:38.000Z (almost 6 years ago)
Last Synced: 2024-05-31T16:46:17.661Z (7 months ago)
Language: C
Size: 15.4 MB
Stars: 8
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## SHIMR (Sparse High-order Interaction Model with Rejection option)
SHIMR (https://peerj.com/articles/6543/) is basically a forward feature selection with simultaneous sample reduction method to iteratively search for higher order feature interactions from the power set of complex features by maximizing the classification gain.

Sample reduction is achieved by incorporating the notion of "Classification with rejection option" which essentially minimizes the classification uncertainty, specifically in case of noisy data. One potential application of this method could be in clinical diagnosis (or prognosis) to serve as a highly reliable computer assisted diagnosis (CAD) model. Below one can see that SHIMR has the ability to identify the ambiguous low confidence zones (close to the decision boundary) and refrain from taking any decision (R: reject) for those data points (encircled). High rejection rate (rr) conforms to high prediction probability of the classified samples and hence more reliability in prediction.

Our visualization module complements SHIMR by generating a simple and easily comprehensible visual representation of the model generated by SHIMR. For more details please refer to our paper published in PeerJ (https://peerj.com/articles/6543/).

Below is a visualization of SHIMR when applied on "Breast Cancer Wisconsin (Diagnostic) Data Set" from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). Our visualization module can clearly represent the weighted combination of simple rules based classification model generated by SHIMR.

### Comparing SHIMR with CORELS using ProPublica datasets.
SHIMR results on ProPublica data (Without rejection)

======= Training Results =======
d=0.5
No of rules selected = 8
correctly_classified:4375, misclassified: 2114, rejected: 0
TP:1618, TN: 2757, FP:826, FN: 1288, SN/RC:0.56, PR:0.66, SP: 0.77
roc_auc: 0.68
area_pr: 0.59
accuracy: 0.67
rejection rate: 0.0

======= Testing Results =======
d=0.5
No of rules selected = 8
correctly_classified:499, misclassified: 222, rejected: 0
TP:210, TN: 289, FP:88, FN: 134, SN/RC:0.61, PR:0.7, SP: 0.77
roc_auc: 0.71
area_pr: 0.63
accuracy: 0.69
rejection rate: 0.0

SHIMR results on ProPublica data (With rejection)
======= Training Results =======
d=0.45
No of rules selected = 17
correctly_classified:3973, misclassified: 1742, rejected: 774
TP:1448, TN: 2525, FP:667, FN: 1075, SN/RC:0.57, PR:0.68, SP: 0.79
roc_auc: 0.68
area_pr: 0.58
accuracy: 0.7
rejection rate: 0.12

======= Testing Results =======
d=0.45
No of rules selected = 17
correctly_classified:458, misclassified: 172, rejected: 91
TP:187, TN: 271, FP:68, FN: 104, SN/RC:0.64, PR:0.73, SP: 0.8
roc_auc: 0.72
area_pr: 0.64
accuracy: 0.73
rejection rate: 0.13

CORELS result on ProPublica data
SN_tr = 0.49, SP_tr= 0.78, accuracy_tr = 0.65
SN_te = 0.55, SP_te= 0.8, accuracy_te = 0.68

To reproduce the results generated by SHIMR using ProPublica datasets, please run “main_Propublica.py” script.

```
$ python main_ProPublica.py

```

To reproduce the results generated by CORELS using ProPublica datasets first install CORELS locally as instructed in (https://github.com/nlarusstone/corels). Then copy “get_Test_Scores_CORELS.py” script into the “src” folder of corels and run the following script from inside the “src” folder.

```
$ python get_Test_Scores_CORELS.py

```

## Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
## Prerequisites
"SHIMR" has the following two dependencies

1) CPLEX Optimizer

2) Linear Time Closed Itemset Miner (LCM v5.3)

Coded by Takeaki Uno, e-mail:[email protected],
homepage: http://research.nii.ac.jp/~uno/code/lcm.html

Apart from that the current implementation is in python which is tested with the following python setup

1) Python 3.4.5

2) scikit-learn==0.19.1

3) scipy==1.0.0

4) numpy==1.14.1

5) pandas==0.22.0

5) matplotlib==2.0.0

### Download

"IBM ILOG CPLEX Optimization Studio" from https://www-01.ibm.com/software/websphere/products/optimization/cplex-studio-community-edition/

"LCM ver. 5.3" from http://research.nii.ac.jp/~uno/codes.htm

## Installing
A step by step instructions that will guide you to get a working copy of "SHIMR" in your own development environment.

A. Create a virtual environment

Download "anaconda" from https://www.continuum.io/downloads

1) Install Anaconda

```
$ bash Anaconda-latest-Linux-x86_64.sh (Linux) or
$ bash Anaconda-latest-MacOSX-x86_64.sh (Mac)
```

2) Activate anaconda environment

```
source anaconda/bin/activate anaconda/
```

3) Create a new environment and activate it

```
$ conda create -n r_boost python=3.4.5
$ source activate r_boost
$ pip install -r requirements.txt
```

B. Install "IBM ILOG CPLEX Optimization Studio"

1) Download "cplex_studioXXX.linux-x86.bin" (Linux) or "cplex_studioXXX.osx.bin" (Mac) file

Make sure the .bin file is executable. If necessary, change its permission using the chmod command from the directory where the .bin is located:

```
$ chmod +x cplex_studioXXX.linux-x86.bin
```

2) Enter the following command to start the installation process:

```
$ ./cplex_studioXXX.linux-x86.bin
```

3) Provide the follwing installation path:

```
$ /home/user/ibm/ILOG/CPLEX_StudioXXX
```
4) Change directory to CPLEX installation path

```
$ cd /home/username/ibm2/ILOG/CPLEX_StudioXXX/cplex/python/3.4/x86-64_linux (Linux) or
$ cd /Users/username/Applications/IBM/ILOG/CPLEX_StudioXXX/cplex/python/3.4/x86-64_osx/ (Mac)
```

5) Install python version of CPLEX
```
$ python setup.py install
```

C. Install "LCM ver. 5.3"
```
1) Unzip the 'lcm53.zip' directory
2) cd lcm53
3) make
```

## Running the tests
To test SHIMR we included "Breast Cancer Wisconsin (Diagnostic) Data Set"
from UCI Machine Learning Repository under the Data folder.
Please run 'code/main_WDBC.ipynp' in an interactive mode to see the sparse high order interactions of features generated by
our visualization module. SHIMR can also be tested from command line by running 'main.py'. Please run it with the help flag [- h] to check the argument requirements of SHIMR to run from command line.

```
python main.py -h
```

```
usage: main.py [-h] [-d D] [-n_bins N_BINS] [-c_pos C_POS] [-c_neg C_NEG]
[-size_u SIZE_U] [-r] [-v] [-pd] [-pa]
f_data

Usage of SHIMR

positional arguments:
f_data File path of input data to SHIMR. File format should be
".npy". The file should contain data in the format of
"[data_train, data_test, Feature_dict, class_labels_dict]".
Feature_dict is an ordered dictionary (collections.OrderedDict())
to provide a short name of feature (Key) if it has long name (Value).
A typical example can be wdbc_dict["Rad_M"]= "Radius Mean".
class_labels_dict is a class labels dictionary. A typical example can
be "class_labels_dict={-1:"Benign", +1:"Malignant", 0:"Rejected"}".

optional arguments:
-h, --help show this help message and exit
-d D Set rejection cost
-n_bins N_BINS Set number of bins
-c_pos C_POS Set regularization parameter value for positive class
-c_neg C_NEG Set regularization parameter value for negative class
-size_u SIZE_U Set the order of feature interaction
-r To apply rejection option
-v To generate visualization
-pd To display the plot (default: File saved)
-pa To generate visualization for all subjects

```

## Visualization module
Motivation of our visualization module came from "UpSet: Visualizing Intersecting Sets" (http://caleydo.org/tools/upset/) and its python implementation (https://github.com/ImSoErgodic/py-upset).

## Published Article
"An interpretable machine learning model for diagnosis of Alzheimer's disease" (https://peerj.com/articles/6543/).