https://github.com/ginberg/spark-sklearn

Last synced: 12 months ago
JSON representation

Host: GitHub
URL: https://github.com/ginberg/spark-sklearn
Owner: ginberg
Created: 2016-10-29T04:11:26.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2016-10-29T04:31:30.000Z (over 9 years ago)
Last Synced: 2025-07-09T07:02:44.404Z (12 months ago)
Language: Python
Size: 5.86 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.MD

Awesome Lists containing this project

README

          ## Parallelized GridSearchCV in Apache Spark with StratifiedShuffleSplit

I have run into an issue with using https://github.com/databricks/spark-sklearn with a StratifiedShuffleSplit cross validator. Therefore I have created this class.

### Use case

It focuses on problems that have a small amount of data and that can be run in parallel.

- for small datasets, it distributes the search for estimator parameters (`GridSearchCV` in scikit-learn), using Spark,

- for datasets that do not fit in memory, I recommend using the [distributed implementation in Spark ML](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html).

- StratifiedShuffleSplit is used as cross validator, this makes sure that every fold preserves the percentage of samples for each class and these folds are randomized. 

### Example

```python

from sklearn import svm, grid_search, datasets

from sklearn.model_selection import StratifiedShuffleSplit

from spark_gridsearch import GridSearchCVSSS

iris = datasets.load_iris()

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

svr = svm.SVC()

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.5)

# first argument (sc) is the sparkContext, this should be available. You might need to import it yourself. 

# I used it with jupyter notebook and pyspark where I don't need to import it in the notebook itself.

clf = GridSearchCVSSS(sc, svr, parameters, cv=sss)

clf.fit(iris.data, iris.target)

```

## License

This package is released under the Apache 2.0 license.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ginberg/spark-sklearn

Awesome Lists containing this project

README