https://github.com/sanghyukchun/sc_si

robust

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/sanghyukchun/sc_si
Owner: SanghyukChun
Created: 2017-12-28T04:11:32.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2020-06-27T04:31:23.000Z (about 6 years ago)
Last Synced: 2025-02-28T04:55:40.740Z (over 1 year ago)
Topics: robust
Language: Python
Size: 15.6 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # SC_SI

Python implementation of SC_SI (Subspace Clustering with Scalable and Iterative Algorithm).

SC_SI solves multiple R-PCA (Robust PCA) problem to clustering.

The algorithm is proposed at my master thesis

`Scalable Iterative Algorithm for Robust Subspace Clustering: Convergence and Initialization`

You can check full paper in the link:

- https://sanghyukchun.github.io/home/media/papers/chun2016scsi.pdf

- http://library.kaist.ac.kr/mobile/book/view.do?bibCtrlNo=649637

## Usage

It is almost same as scikit-learn K-means [link](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

```python

import numpy as np

from sc_si.clustering.sc_si import SC_SI, MiniBatchSC_SI

n_data = 10000

dim_data = 500

X = np.random.random((n_data, dim_data))

n_clusters = 10

n_components = 20

# if you have a large dataset, use `MiniBatchSC_SI` instead.

model = SC_SI(n_clusters=n_clusters, n_components=n_components, alpha=1.0, init='sc_si', n_init=3, max_iter=100, verbose=True)

labels = model.fit_predict(X)

n_data2 = 1000

X_unseen = np.random.random((n_data2, dim_data))

labels_unseen = model.predict(X_unseen)

```

## Optimization Hints

1. alpha handles 'robustness' of the objective function. The objective function is more robust with less alpha.

2. Use alpha=1.0. Theoretically alpha can be any number between (0, 2] but practically, I recommend to choose alpha as 1

3. Use default initialization named SC_IN if size of dataset is not too large. Otherwise, use 'random' initialization with large n_init. Or, sampling datasets to initialization

4. Use default svd_algorithm (subspace iteration). It is much faster and use less memory than exact SVD.

5. For datasets with less outliers, use large beta (e.g. 10) otherwise, set beta = alpha

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sanghyukchun/sc_si

Awesome Lists containing this project

README