https://github.com/joblib/joblib-spark

Joblib Apache Spark Backend
https://github.com/joblib/joblib-spark

Last synced: 28 days ago
JSON representation

Joblib Apache Spark Backend

Host: GitHub
URL: https://github.com/joblib/joblib-spark
Owner: joblib
License: apache-2.0
Created: 2019-11-20T19:02:44.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-08-14T11:13:34.000Z (10 months ago)
Last Synced: 2025-03-27T11:15:55.704Z (3 months ago)
Language: Python
Homepage:
Size: 98.6 KB
Stars: 245
Watchers: 7
Forks: 26
Open Issues: 20
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.rst
- License: LICENSE

Awesome Lists containing this project

awesome-list - Joblib Apache Spark Backend - Provides Apache Spark backend for joblib to distribute tasks on a Spark cluster. (Data Management & Processing / Database & Cloud Management)
awesome-spark - Joblib Apache Spark Backend - commit/joblib/joblib-spark.svg"> - [`joblib`](https://github.com/joblib/joblib) backend for running tasks on Spark clusters. (Packages / General Purpose Libraries)

README

        # Joblib Apache Spark Backend

This library provides Apache Spark backend for joblib to distribute tasks on a Spark cluster.

## Installation

`joblibspark` requires Python 3.6+, `joblib>=0.14` and `pyspark>=2.4` to run.

To install `joblibspark`, run:

```bash

pip install joblibspark

```

The installation does not install PySpark because for most users, PySpark is already installed.

If you do not have PySpark installed, you can install `pyspark` together with `joblibspark`:

```bash

pip install pyspark>=3.0.0 joblibspark

```

If you want to use `joblibspark` with `scikit-learn`, please install `scikit-learn>=0.21`.

## Examples

Run following example code in `pyspark` shell:

```python

from sklearn.utils import parallel_backend

from sklearn.model_selection import cross_val_score

from sklearn import datasets

from sklearn import svm

from joblibspark import register_spark

register_spark() # register spark backend

iris = datasets.load_iris()

clf = svm.SVC(kernel='linear', C=1)

with parallel_backend('spark', n_jobs=3):

  scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print(scores)

```

## Limitations

`joblibspark` does not generally support run model inference and feature engineering in parallel.

For example:

```python

from sklearn.feature_extraction import FeatureHasher

h = FeatureHasher(n_features=10)

with parallel_backend('spark', n_jobs=3):

    # This won't run parallelly on spark, it will still run locally.

    h.transform(...)

from sklearn import linear_model

regr = linear_model.LinearRegression()

regr.fit(X_train, y_train)

with parallel_backend('spark', n_jobs=3):

    # This won't run parallelly on spark, it will still run locally.

    regr.predict(diabetes_X_test)

```

Note: for `sklearn.ensemble.RandomForestClassifier`, there is a `n_jobs` parameter,

that means the algorithm support model training/inference in parallel,

but in its inference implementation, it bind the backend to built-in backends,

so the spark backend not work for this case.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/joblib/joblib-spark

Awesome Lists containing this project

README