https://github.com/databricks/spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark
https://github.com/databricks/spark-sklearn

apache-spark grid-search machine-learning parameter-tuning scikit-learn

Last synced: 6 months ago
JSON representation

(Deprecated) Scikit-learn integration package for Apache Spark

Host: GitHub
URL: https://github.com/databricks/spark-sklearn
Owner: databricks
License: apache-2.0
Archived: true
Created: 2015-09-02T18:44:51.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2019-12-03T18:37:45.000Z (over 5 years ago)
Last Synced: 2025-01-12T05:34:28.498Z (6 months ago)
Topics: apache-spark, grid-search, machine-learning, parameter-tuning, scikit-learn
Language: Python
Homepage:
Size: 782 KB
Stars: 1,075
Watchers: 94
Forks: 228
Open Issues: 15
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

        Deprecation

===========

This project is deprecated.

We now recommend using scikit-learn and `Joblib Apache Spark Backend `_

to distribute scikit-learn hyperparameter tuning tasks on a Spark cluster:

You need ``pyspark>=2.4.4`` and ``scikit-learn>=0.21`` to use Joblib Apache Spark Backend, which can be installed using ``pip``:

.. code:: bash

    pip install joblibspark

The following example shows how to distributed ``GridSearchCV`` on a Spark cluster using ``joblibspark``.

Same applies to ``RandomizedSearchCV``.

.. code:: python

    from sklearn import svm, datasets

    from sklearn.model_selection import GridSearchCV

    from joblibspark import register_spark

    from sklearn.utils import parallel_backend

    register_spark() # register spark backend

    iris = datasets.load_iris()

    parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

    svr = svm.SVC(gamma='auto')

    clf = GridSearchCV(svr, parameters, cv=5)

    with parallel_backend('spark', n_jobs=3):

        clf.fit(iris.data, iris.target)

Scikit-learn integration package for Apache Spark

=================================================

This package contains some tools to integrate the `Spark computing framework `_

with the popular `scikit-learn machine library `_. Among other things, it can:

- train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the

  `multicore implementation `_ included by default in ``scikit-learn``

- convert Spark's Dataframes seamlessly into numpy ``ndarray`` or sparse matrices

- (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors

It focuses on problems that have a small amount of data and that can be run in parallel.

For small datasets, it distributes the search for estimator parameters (``GridSearchCV`` in scikit-learn),

using Spark. For datasets that do not fit in memory, we recommend using the `distributed implementation in

`Spark MLlib `_.

This package distributes simple tasks like grid-search cross-validation.

It does not distribute individual learning algorithms (unlike Spark MLlib).

Installation

------------

This package is available on PYPI:

::

	pip install spark-sklearn

This project is also available as `Spark package `_.

The developer version has the following requirements:

- scikit-learn 0.18 or 0.19. Later versions may work, but tests currently are incompatible with 0.20.

- Spark >= 2.1.1. Spark may be downloaded from the `Spark website `_.

  In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python

  interpreter. See the `Spark guide `_

  for more details.

- `nose `_ (testing dependency only)

- pandas, if using the pandas integration or testing. pandas==0.18 has been tested.

If you want to use a developer version, you just need to make sure the ``python/`` subdirectory is in the

``PYTHONPATH`` when launching the pyspark interpreter:

::

	PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark

You can directly run tests:

::

    cd python && ./run-tests.sh

This requires the environment variable ``SPARK_HOME`` to point to your local copy of Spark.

Example

-------

Here is a simple example that runs a grid search with Spark. See the `Installation <#installation>`_ section

on how to install the package.

.. code:: python

    from sklearn import svm, datasets

    from spark_sklearn import GridSearchCV

    iris = datasets.load_iris()

    parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

    svr = svm.SVC(gamma='auto')

    clf = GridSearchCV(sc, svr, parameters)

    clf.fit(iris.data, iris.target)

This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API.

Documentation

-------------

`API documentation `_ is currently hosted on Github pages. To

build the docs yourself, see the instructions in ``docs/``.

.. image:: https://travis-ci.org/databricks/spark-sklearn.svg?branch=master

    :target: https://travis-ci.org/databricks/spark-sklearn

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/databricks/spark-sklearn

Awesome Lists containing this project

README