Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/qinhanmin2014/sklearn-svm-guide
Rapidly obtain acceptable results using SVM (based on scikit-learn)
https://github.com/qinhanmin2014/sklearn-svm-guide
machine-learning scikit-learn svm
Last synced: about 1 month ago
JSON representation
Rapidly obtain acceptable results using SVM (based on scikit-learn)
- Host: GitHub
- URL: https://github.com/qinhanmin2014/sklearn-svm-guide
- Owner: qinhanmin2014
- License: mit
- Created: 2019-07-10T13:32:31.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-07-23T10:15:17.000Z (over 5 years ago)
- Last Synced: 2024-10-28T17:43:55.400Z (3 months ago)
- Topics: machine-learning, scikit-learn, svm
- Language: Terra
- Size: 33.7 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# sklearn-svm-guide
Rapidly obtain acceptable results using SVM (based on scikit-learn)## common procedure
- Conduct simple scaling on the data
* sklearn.preprocessing.MinMaxScaler/StandardScaler
- Consider the RBF kernel
* sklearn.svm.SVC default
- Use cross-validation to find the best parameter C and gamma
* sklearn.model_selection.GridSearchCV## common example
```python
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
X_train, y_train = load_digits(return_X_y=True)
sc = MinMaxScaler(feature_range=(-1, 1))
Xt_train = sc.fit_transform(X_train)
params = {"C": np.logspace(-5, 15, num=11, base=2),
"gamma": np.logspace(3, -15, num=10, base=2)}
clf = GridSearchCV(SVC(), params, n_jobs=-1)
scores = cross_val_score(clf, Xt_train, y_train)
print(np.mean(scores), "+/-", np.std(scores))
```## experiment A: Examples of the Proposed Procedure
- Datasets 1: Astroparticle (from the reference)
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/A1_Astroparticle_Physics.ipynb)
* evaluate using test set accuracy
* default in libsvm and old default in scikit-learn: 66.93% (66.93% in the reference)
* new default in scikit=learn: 96.25%
* scale with MinMaxScaler: 96.15% (96.15% in the reference)
* **scale with MinMaxScaler & tune the parameters: 96.93% (96.87% in the reference)**
* scale with StandardScaler: 96.80%
* scale with StandardScaler & tune the parameters: 96.68%- Datasets 2: Bioinformatics (from the reference)
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/A2_Bioinformatics.ipynb)
* evaluate using cross validation accuracy
* default in libsvm and old default in scikit-learn: 56.53% (56.52% in the reference)
* new default in scikit=learn: 81.87%
* scale with MinMaxScaler: 78.27% (78.52% in the reference)
* **scale with MinMaxScaler & tune the parameters: 84.71% (85.17% in the reference)**
* scale with StandardScaler: 56.53%
* scale with StandardScaler & tune the parameters: 84.15%- Datasets 3: Astroparticle (from the reference)
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/A3_Vehicle.ipynb)
* evaluate using test set accuracy
* default in libsvm and old default in scikit-learn: 2.44% (2.44% in the reference)
* new default in scikit=learn: 36.59%
* scale with MinMaxScaler: 12.20% (12.20% in the reference)
* **scale with MinMaxScaler & tune the parameters: 80.49% (87.80% in the reference)**
* scale with StandardScaler: 65.85%
* scale with StandardScaler & tune the parameters: 78.05%- Datasets 4: Breast Cancer (from sklearn.datasets.load_breast_cancer)
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/AX_Breast_Cancer.ipynb)
* evaluate using cross validation accuracy
* default in libsvm and old default in scikit-learn: 62.74%
* new default in scikit=learn: 91.24%
* scale with MinMaxScaler: 96.13%
* scale with MinMaxScaler & tune the parameters: 97.54%
* **scale with StandardScaler: 97.54%**
* scale with StandardScaler & tune the parameters: 96.66%- Datasets 5: Digits (from sklearn.datasets.load_digits)
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/AX_Digits.ipynb)
* evaluate using cross validation accuracy
* default in libsvm and old default in scikit-learn: 44.88%
* new default in scikit=learn: 96.38%
* scale with MinMaxScaler: 95.72%
* **scale with MinMaxScaler & tune the parameters: 97.33%**
* scale with StandardScaler: 94.88%
* scale with StandardScaler & tune the parameters: 94.77%- Datasets 6: Wine (from sklearn.datasets.load_wine)
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/AX_Wine.ipynb)
* evaluate using cross validation accuracy
* default in libsvm and old default in scikit-learn: 42.77%
* new default in scikit=learn: 66.39%
* scale with MinMaxScaler: 96.68%
* scale with MinMaxScaler & tune the parameters: 96.68%
* **scale with StandardScaler: 98.33%**
* scale with StandardScaler & tune the parameters: 97.76%## experiment B: Common Mistakes in Scaling Training and Testing Data
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/B_Common_Mistakes_in_Scaling.ipynb)
* evaluate using test set accuracy
* wrong way: use different scaler for training and testing sets (MinMaxScaler): 69.23% (69.23% in the reference)
* wrong way: use different scaler for training and testing sets (StandardScaler): 78.21%
* right way: use same scaler for training and testing sets (MinMaxScaler): 87.50% (89.42% in the reference)
* **right way: use same scaler for training and testing sets (StandardScaler): 89.42%**## experiment C: When to Use Linear but not RBF Kernel
- Number of instances << number of features
* suggestion: use linear kernel
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/C1_Linear_not_RBF_Kernel.ipynb)
* RBF kernel cross validation accuracy 92.85% (97.22% in the reference)
* linear kernel cross validation accuracy 92.85% (98.61% in the reference)- Both numbers of instances and features are large
* suggestion: use linear kernel
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/C2_Linear_not_RBF_Kernel.ipynb)
* RBF kernel, cross validation accuracy 97.17% (96.81% in the reference), wall time 15min 17s
* linear kernel, corss validation accuracy 96.63% (97.01% in the reference), wall time <1s- Number of instances >> number of features
* suggestion: if linear kernel, set dual=False (default dual=True)
* [jupyter notebook](https://nbviewer.jupyter.org/github/qinhanmin2014/sklearn-svm-guide/blob/master/C3_Linear_not_RBF_Kernel.ipynb)
* dual=False, cross validation accuracy 68.51% (75.67% in the reference), wall time 35s
* dual=True, corss validation accuracy 68.51% (75.67% in the reference), wall time 10min 31s## experiment D: LIBLINEAR (LinearSVC)
- In classification, large values in data may cause the following problems:
(1) Features in larger numeric ranges may dominate those in smaller ranges;
(2) Optimization methods for training may take longer time.
The typical remedy is to scale data feature-wisely.
However, for document data, often a simple instance-wise normalization is enough.
Each instance becomes a unit vector
- Solvers in LIBLINEAR is not very sensitive to C. Once C is larger than certain value, the obtained models have similar performances.## reference
- A Practical Guide to Support Vector Classification, Chih-Wei Hsu et al.
- LIBLINEAR: A Library for Large Linear Classification, Rong-En Fan et al.
- LIBSVM: A Library for Support Vector Machines, Chih-Chung Chang et al.## LIBLINEAR
- https://www.csie.ntu.edu.tw/~cjlin/liblinear/
- https://github.com/cjlin1/liblinear## LIBSVM
- https://www.csie.ntu.edu.tw/~cjlin/libsvm/
- https://github.com/cjlin1/libsvm