https://github.com/sanghyukchun/sc_si
https://github.com/sanghyukchun/sc_si
robust
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/sanghyukchun/sc_si
- Owner: SanghyukChun
- Created: 2017-12-28T04:11:32.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2020-06-27T04:31:23.000Z (about 6 years ago)
- Last Synced: 2025-02-28T04:55:40.740Z (over 1 year ago)
- Topics: robust
- Language: Python
- Size: 15.6 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SC_SI
Python implementation of SC_SI (Subspace Clustering with Scalable and Iterative Algorithm).
SC_SI solves multiple R-PCA (Robust PCA) problem to clustering.
The algorithm is proposed at my master thesis
`Scalable Iterative Algorithm for Robust Subspace Clustering: Convergence and Initialization`
You can check full paper in the link:
- https://sanghyukchun.github.io/home/media/papers/chun2016scsi.pdf
- http://library.kaist.ac.kr/mobile/book/view.do?bibCtrlNo=649637
## Usage
It is almost same as scikit-learn K-means [link](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
```python
import numpy as np
from sc_si.clustering.sc_si import SC_SI, MiniBatchSC_SI
n_data = 10000
dim_data = 500
X = np.random.random((n_data, dim_data))
n_clusters = 10
n_components = 20
# if you have a large dataset, use `MiniBatchSC_SI` instead.
model = SC_SI(n_clusters=n_clusters, n_components=n_components, alpha=1.0, init='sc_si', n_init=3, max_iter=100, verbose=True)
labels = model.fit_predict(X)
n_data2 = 1000
X_unseen = np.random.random((n_data2, dim_data))
labels_unseen = model.predict(X_unseen)
```
## Optimization Hints
1. alpha handles 'robustness' of the objective function. The objective function is more robust with less alpha.
2. Use alpha=1.0. Theoretically alpha can be any number between (0, 2] but practically, I recommend to choose alpha as 1
3. Use default initialization named SC_IN if size of dataset is not too large. Otherwise, use 'random' initialization with large n_init. Or, sampling datasets to initialization
4. Use default svd_algorithm (subspace iteration). It is much faster and use less memory than exact SVD.
5. For datasets with less outliers, use large beta (e.g. 10) otherwise, set beta = alpha