Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/joblib/joblib-spark
Joblib Apache Spark Backend
https://github.com/joblib/joblib-spark
Last synced: 3 months ago
JSON representation
Joblib Apache Spark Backend
- Host: GitHub
- URL: https://github.com/joblib/joblib-spark
- Owner: joblib
- License: apache-2.0
- Created: 2019-11-20T19:02:44.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-07-18T20:59:11.000Z (4 months ago)
- Last Synced: 2024-07-19T01:21:41.301Z (4 months ago)
- Language: Python
- Homepage:
- Size: 95.7 KB
- Stars: 241
- Watchers: 9
- Forks: 26
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-list - Joblib Apache Spark Backend - Provides Apache Spark backend for joblib to distribute tasks on a Spark cluster. (Data Management & Processing / Database & Cloud Management)
- awesome-spark - Joblib Apache Spark Backend - commit/joblib/joblib-spark.svg"> - [`joblib`](https://github.com/joblib/joblib) backend for running tasks on Spark clusters. (Packages / General Purpose Libraries)
README
# Joblib Apache Spark Backend
This library provides Apache Spark backend for joblib to distribute tasks on a Spark cluster.
## Installation
`joblibspark` requires Python 3.6+, `joblib>=0.14` and `pyspark>=2.4` to run.
To install `joblibspark`, run:```bash
pip install joblibspark
```The installation does not install PySpark because for most users, PySpark is already installed.
If you do not have PySpark installed, you can install `pyspark` together with `joblibspark`:```bash
pip install pyspark>=3.0.0 joblibspark
```If you want to use `joblibspark` with `scikit-learn`, please install `scikit-learn>=0.21`.
## Examples
Run following example code in `pyspark` shell:
```python
from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_sparkregister_spark() # register spark backend
iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
scores = cross_val_score(clf, iris.data, iris.target, cv=5)print(scores)
```## Limitations
`joblibspark` does not generally support run model inference and feature engineering in parallel.
For example:```python
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10)
with parallel_backend('spark', n_jobs=3):
# This won't run parallelly on spark, it will still run locally.
h.transform(...)from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)with parallel_backend('spark', n_jobs=3):
# This won't run parallelly on spark, it will still run locally.
regr.predict(diabetes_X_test)
```Note: for `sklearn.ensemble.RandomForestClassifier`, there is a `n_jobs` parameter,
that means the algorithm support model training/inference in parallel,
but in its inference implementation, it bind the backend to built-in backends,
so the spark backend not work for this case.