https://github.com/ginberg/spark-sklearn
https://github.com/ginberg/spark-sklearn
Last synced: 12 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/ginberg/spark-sklearn
- Owner: ginberg
- Created: 2016-10-29T04:11:26.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-10-29T04:31:30.000Z (over 9 years ago)
- Last Synced: 2025-07-09T07:02:44.404Z (12 months ago)
- Language: Python
- Size: 5.86 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.MD
Awesome Lists containing this project
README
## Parallelized GridSearchCV in Apache Spark with StratifiedShuffleSplit
I have run into an issue with using https://github.com/databricks/spark-sklearn with a StratifiedShuffleSplit cross validator. Therefore I have created this class.
### Use case
It focuses on problems that have a small amount of data and that can be run in parallel.
- for small datasets, it distributes the search for estimator parameters (`GridSearchCV` in scikit-learn), using Spark,
- for datasets that do not fit in memory, I recommend using the [distributed implementation in Spark ML](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html).
- StratifiedShuffleSplit is used as cross validator, this makes sure that every fold preserves the percentage of samples for each class and these folds are randomized.
### Example
```python
from sklearn import svm, grid_search, datasets
from sklearn.model_selection import StratifiedShuffleSplit
from spark_gridsearch import GridSearchCVSSS
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.5)
# first argument (sc) is the sparkContext, this should be available. You might need to import it yourself.
# I used it with jupyter notebook and pyspark where I don't need to import it in the notebook itself.
clf = GridSearchCVSSS(sc, svr, parameters, cv=sss)
clf.fit(iris.data, iris.target)
```
## License
This package is released under the Apache 2.0 license.