Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/databricks/spark-sklearn
(Deprecated) Scikit-learn integration package for Apache Spark
https://github.com/databricks/spark-sklearn
apache-spark grid-search machine-learning parameter-tuning scikit-learn
Last synced: 5 days ago
JSON representation
(Deprecated) Scikit-learn integration package for Apache Spark
- Host: GitHub
- URL: https://github.com/databricks/spark-sklearn
- Owner: databricks
- License: apache-2.0
- Archived: true
- Created: 2015-09-02T18:44:51.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2019-12-03T18:37:45.000Z (about 5 years ago)
- Last Synced: 2025-01-12T05:34:28.498Z (13 days ago)
- Topics: apache-spark, grid-search, machine-learning, parameter-tuning, scikit-learn
- Language: Python
- Homepage:
- Size: 782 KB
- Stars: 1,075
- Watchers: 94
- Forks: 228
- Open Issues: 15
-
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
Deprecation
===========This project is deprecated.
We now recommend using scikit-learn and `Joblib Apache Spark Backend `_
to distribute scikit-learn hyperparameter tuning tasks on a Spark cluster:You need ``pyspark>=2.4.4`` and ``scikit-learn>=0.21`` to use Joblib Apache Spark Backend, which can be installed using ``pip``:
.. code:: bash
pip install joblibspark
The following example shows how to distributed ``GridSearchCV`` on a Spark cluster using ``joblibspark``.
Same applies to ``RandomizedSearchCV``... code:: python
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from joblibspark import register_spark
from sklearn.utils import parallel_backendregister_spark() # register spark backend
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')clf = GridSearchCV(svr, parameters, cv=5)
with parallel_backend('spark', n_jobs=3):
clf.fit(iris.data, iris.target)Scikit-learn integration package for Apache Spark
=================================================This package contains some tools to integrate the `Spark computing framework `_
with the popular `scikit-learn machine library `_. Among other things, it can:- train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the
`multicore implementation `_ included by default in ``scikit-learn``
- convert Spark's Dataframes seamlessly into numpy ``ndarray`` or sparse matrices
- (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectorsIt focuses on problems that have a small amount of data and that can be run in parallel.
For small datasets, it distributes the search for estimator parameters (``GridSearchCV`` in scikit-learn),
using Spark. For datasets that do not fit in memory, we recommend using the `distributed implementation in
`Spark MLlib `_.This package distributes simple tasks like grid-search cross-validation.
It does not distribute individual learning algorithms (unlike Spark MLlib).Installation
------------This package is available on PYPI:
::
pip install spark-sklearn
This project is also available as `Spark package `_.
The developer version has the following requirements:
- scikit-learn 0.18 or 0.19. Later versions may work, but tests currently are incompatible with 0.20.
- Spark >= 2.1.1. Spark may be downloaded from the `Spark website `_.
In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python
interpreter. See the `Spark guide `_
for more details.
- `nose `_ (testing dependency only)
- pandas, if using the pandas integration or testing. pandas==0.18 has been tested.If you want to use a developer version, you just need to make sure the ``python/`` subdirectory is in the
``PYTHONPATH`` when launching the pyspark interpreter:::
PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark
You can directly run tests:
::
cd python && ./run-tests.sh
This requires the environment variable ``SPARK_HOME`` to point to your local copy of Spark.
Example
-------Here is a simple example that runs a grid search with Spark. See the `Installation <#installation>`_ section
on how to install the package... code:: python
from sklearn import svm, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API.
Documentation
-------------`API documentation `_ is currently hosted on Github pages. To
build the docs yourself, see the instructions in ``docs/``... image:: https://travis-ci.org/databricks/spark-sklearn.svg?branch=master
:target: https://travis-ci.org/databricks/spark-sklearn