Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lensacom/sparkit-learn
PySpark + Scikit-learn = Sparkit-learn
https://github.com/lensacom/sparkit-learn
apache-spark distributed-computing machine-learning python scikit-learn
Last synced: 6 days ago
JSON representation
PySpark + Scikit-learn = Sparkit-learn
- Host: GitHub
- URL: https://github.com/lensacom/sparkit-learn
- Owner: lensacom
- License: apache-2.0
- Created: 2014-10-15T14:01:10.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2020-12-31T01:56:49.000Z (about 4 years ago)
- Last Synced: 2025-01-10T19:11:13.581Z (13 days ago)
- Topics: apache-spark, distributed-computing, machine-learning, python, scikit-learn
- Language: Python
- Homepage:
- Size: 444 KB
- Stars: 1,153
- Watchers: 90
- Forks: 255
- Open Issues: 35
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-datascience - Sparkit-learn
README
Sparkit-learn
=============|Build Status| |PyPi| |Gitter| |Gitential|
**PySpark + Scikit-learn = Sparkit-learn**
GitHub: https://github.com/lensacom/sparkit-learn
About
=====Sparkit-learn aims to provide scikit-learn functionality and API on
PySpark. The main goal of the library is to create an API that stays
close to sklearn's.The driving principle was to *"Think locally, execute distributively."*
To accomodate this concept, the basic data block is always an array or a
(sparse) matrix and the operations are executed on block level.Requirements
============- **Python 2.7.x or 3.4.x**
- **Spark[>=1.3.0]**
- NumPy[>=1.9.0]
- SciPy[>=0.14.0]
- Scikit-learn[>=0.16]Run IPython from notebooks directory
====================================.. code:: bash
PYTHONPATH=${PYTHONPATH}:.. IPYTHON_OPTS="notebook" ${SPARK_HOME}/bin/pyspark --master local\[4\] --driver-memory 2G
Run tests with
==============.. code:: bash
./runtests.sh
Quick start
===========Sparkit-learn introduces three important distributed data format:
- **ArrayRDD:**
A *numpy.array* like distributed array
.. code:: python
from splearn.rdd import ArrayRDD
data = range(20)
# PySpark RDD with 2 partitions
rdd = sc.parallelize(data, 2) # each partition with 10 elements
# ArrayRDD
# each partition will contain blocks with 5 elements
X = ArrayRDD(rdd, bsize=5) # 4 blocks, 2 in each partitionBasic operations:
.. code:: python
len(X) # 20 - number of elements in the whole dataset
X.blocks # 4 - number of blocks
X.shape # (20,) - the shape of the whole datasetX # returns an ArrayRDD
# from PythonRDD...X.dtype # returns the type of the blocks
# numpy.ndarrayX.collect() # get the dataset
# [array([0, 1, 2, 3, 4]),
# array([5, 6, 7, 8, 9]),
# array([10, 11, 12, 13, 14]),
# array([15, 16, 17, 18, 19])]X[1].collect() # indexing
# [array([5, 6, 7, 8, 9])]X[1] # also returns an ArrayRDD!
X[1::2].collect() # slicing
# [array([5, 6, 7, 8, 9]),
# array([15, 16, 17, 18, 19])]X[1::2] # returns an ArrayRDD as well
X.tolist() # returns the dataset as a list
# [0, 1, 2, ... 17, 18, 19]
X.toarray() # returns the dataset as a numpy.array
# array([ 0, 1, 2, ... 17, 18, 19])# pyspark.rdd operations will still work
X.getNumPartitions() # 2 - number of partitions- **SparseRDD:**
The sparse counterpart of the *ArrayRDD*, the main difference is that the
blocks are sparse matrices. The reason behind this split is to follow the
distinction between *numpy.ndarray*s and *scipy.sparse* matrices.
Usually the *SparseRDD* is created by *splearn*'s transformators, but one can
instantiate too... code:: python
# generate a SparseRDD from a text using SparkCountVectorizer
from splearn.rdd import SparseRDD
from sklearn.feature_extraction.tests.test_text import ALL_FOOD_DOCS
ALL_FOOD_DOCS
#(u'the pizza pizza beer copyright',
# u'the pizza burger beer copyright',
# u'the the pizza beer beer copyright',
# u'the burger beer beer copyright',
# u'the coke burger coke copyright',
# u'the coke burger burger',
# u'the salad celeri copyright',
# u'the salad salad sparkling water copyright',
# u'the the celeri celeri copyright',
# u'the tomato tomato salad water',
# u'the tomato salad water copyright')# ArrayRDD created from the raw data
X = ArrayRDD(sc.parallelize(ALL_FOOD_DOCS, 4), 2)
X.collect()
# [array([u'the pizza pizza beer copyright',
# u'the pizza burger beer copyright'], dtype=' from PythonRDD...# it's type is the scipy.sparse's general parent
X.dtype
# scipy.sparse.base.spmatrix# slicing works just like in ArrayRDDs
X[2:4].collect()
# [<2x11 sparse matrix of type ''
# with 7 stored elements in Compressed Sparse Row format>,
# <2x11 sparse matrix of type ''
# with 9 stored elements in Compressed Sparse Row format>]# general mathematical operations are available
X.sum(), X.mean(), X.max(), X.min()
# (55, 0.45454545454545453, 2, 0)# even with axis parameters provided
X.sum(axis=1)
# matrix([[5],
# [5],
# [6],
# [5],
# [5],
# [4],
# [4],
# [6],
# [5],
# [5],
# [5]])# It can be transformed to dense ArrayRDD
X.todense()
# from PythonRDD...
X.todense().collect()
# [array([[1, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0],
# [1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0]]),
# array([[2, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0],
# [2, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]]),
# array([[0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0],
# [0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0]]),
# array([[0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0],
# [0, 0, 0, 0, 1, 0, 2, 1, 1, 0, 1]]),
# array([[0, 0, 2, 0, 1, 0, 0, 0, 2, 0, 0],
# [0, 0, 0, 0, 0, 0, 1, 0, 1, 2, 1]]),
# array([[0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1]])]# One can instantiate SparseRDD manually too:
sparse = sc.parallelize(np.array([sp.eye(2).tocsr()]*20), 2)
sparse = SparseRDD(sparse, bsize=5)
sparse
# from PythonRDD...sparse.collect()
# [<10x2 sparse matrix of type ''
# with 10 stored elements in Compressed Sparse Row format>,
# <10x2 sparse matrix of type ''
# with 10 stored elements in Compressed Sparse Row format>,
# <10x2 sparse matrix of type ''
# with 10 stored elements in Compressed Sparse Row format>,
# <10x2 sparse matrix of type ''
# with 10 stored elements in Compressed Sparse Row format>]- **DictRDD:**
A column based data format, each column with it's own type.
.. code:: python
from splearn.rdd import DictRDD
X = range(20)
y = list(range(2)) * 10
# PySpark RDD with 2 partitions
X_rdd = sc.parallelize(X, 2) # each partition with 10 elements
y_rdd = sc.parallelize(y, 2) # each partition with 10 elements
# DictRDD
# each partition will contain blocks with 5 elements
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
bsize=5,
dtype=[np.ndarray, np.ndarray]) # 4 blocks, 2/partition
# if no dtype is provided, the type of the blocks will be determined
# automatically# or:
import numpy as npdata = np.array([range(20), list(range(2))*10]).T
rdd = sc.parallelize(data, 2)
Z = DictRDD(rdd,
columns=('X', 'y'),
bsize=5,
dtype=[np.ndarray, np.ndarray])Basic operations:
.. code:: python
len(Z) # 8 - number of blocks
Z.columns # returns ('X', 'y')
Z.dtype # returns the types in correct order
# [numpy.ndarray, numpy.ndarray]Z # returns a DictRDD
# from PythonRDD...Z.collect()
# [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),
# (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),
# (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0])),
# (array([15, 16, 17, 18, 19]), array([1, 0, 1, 0, 1]))]Z[:, 'y'] # column select - returns an ArrayRDD
Z[:, 'y'].collect()
# [array([0, 1, 0, 1, 0]),
# array([1, 0, 1, 0, 1]),
# array([0, 1, 0, 1, 0]),
# array([1, 0, 1, 0, 1])]Z[:-1, ['X', 'y']] # slicing - DictRDD
Z[:-1, ['X', 'y']].collect()
# [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),
# (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),
# (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0]))]Basic workflow
--------------With the use of the described data structures, the basic workflow is
almost identical to sklearn's.Distributed vectorizing of texts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~SparkCountVectorizer
^^^^^^^^^^^^^^^^^^^^.. code:: python
from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkCountVectorizer
from sklearn.feature_extraction.text import CountVectorizerX = [...] # list of texts
X_rdd = ArrayRDD(sc.parallelize(X, 4)) # sc is SparkContextlocal = CountVectorizer()
dist = SparkCountVectorizer()result_local = local.fit_transform(X)
result_dist = dist.fit_transform(X_rdd) # SparseRDDSparkHashingVectorizer
^^^^^^^^^^^^^^^^^^^^^^.. code:: python
from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizerX = [...] # list of texts
X_rdd = ArrayRDD(sc.parallelize(X, 4)) # sc is SparkContextlocal = HashingVectorizer()
dist = SparkHashingVectorizer()result_local = local.fit_transform(X)
result_dist = dist.fit_transform(X_rdd) # SparseRDDSparkTfidfTransformer
^^^^^^^^^^^^^^^^^^^^^.. code:: python
from splearn.rdd import ArrayRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.pipeline import SparkPipelinefrom sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import PipelineX = [...] # list of texts
X_rdd = ArrayRDD(sc.parallelize(X, 4)) # sc is SparkContextlocal_pipeline = Pipeline((
('vect', HashingVectorizer()),
('tfidf', TfidfTransformer())
))
dist_pipeline = SparkPipeline((
('vect', SparkHashingVectorizer()),
('tfidf', SparkTfidfTransformer())
))result_local = local_pipeline.fit_transform(X)
result_dist = dist_pipeline.fit_transform(X_rdd) # SparseRDDDistributed Classifiers
~~~~~~~~~~~~~~~~~~~~~~~.. code:: python
from splearn.rdd import DictRDD
from splearn.feature_extraction.text import SparkHashingVectorizer
from splearn.feature_extraction.text import SparkTfidfTransformer
from splearn.svm import SparkLinearSVC
from splearn.pipeline import SparkPipelinefrom sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import PipelineX = [...] # list of texts
y = [...] # list of labels
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
dtype=[np.ndarray, np.ndarray])local_pipeline = Pipeline((
('vect', HashingVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LinearSVC())
))
dist_pipeline = SparkPipeline((
('vect', SparkHashingVectorizer()),
('tfidf', SparkTfidfTransformer()),
('clf', SparkLinearSVC())
))local_pipeline.fit(X, y)
dist_pipeline.fit(Z, clf__classes=np.unique(y))y_pred_local = local_pipeline.predict(X)
y_pred_dist = dist_pipeline.predict(Z[:, 'X'])Distributed Model Selection
~~~~~~~~~~~~~~~~~~~~~~~~~~~.. code:: python
from splearn.rdd import DictRDD
from splearn.grid_search import SparkGridSearchCV
from splearn.naive_bayes import SparkMultinomialNBfrom sklearn.grid_search import GridSearchCV
from sklearn.naive_bayes import MultinomialNBX = [...]
y = [...]
X_rdd = sc.parallelize(X, 4)
y_rdd = sc.parallelize(y, 4)
Z = DictRDD((X_rdd, y_rdd),
columns=('X', 'y'),
dtype=[np.ndarray, np.ndarray])parameters = {'alpha': [0.1, 1, 10]}
fit_params = {'classes': np.unique(y)}local_estimator = MultinomialNB()
local_grid = GridSearchCV(estimator=local_estimator,
param_grid=parameters)estimator = SparkMultinomialNB()
grid = SparkGridSearchCV(estimator=estimator,
param_grid=parameters,
fit_params=fit_params)local_grid.fit(X, y)
grid.fit(Z)ROADMAP
=======- [ ] Transparent API to support plain numpy and scipy objects (partially done in the transparent_api branch)
- [ ] Update all dependencies
- [ ] Use Mllib and ML packages more extensively (since it becames more mature)
- [ ] Support Spark DataFramesSpecial thanks
==============- scikit-learn community
- spylearn community
- pyspark communitySimilar Projects
===============- `Thunder `_
- `Bolt `_.. |Build Status| image:: https://travis-ci.org/lensacom/sparkit-learn.png?branch=master
:target: https://travis-ci.org/lensacom/sparkit-learn
.. |PyPi| image:: https://img.shields.io/pypi/v/sparkit-learn.svg
:target: https://pypi.python.org/pypi/sparkit-learn
.. |Gitter| image:: https://badges.gitter.im/Join%20Chat.svg
:alt: Join the chat at https://gitter.im/lensacom/sparkit-learn
:target: https://gitter.im/lensacom/sparkit-learn?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge
.. |Gitential| image:: https://api.gitential.com/accounts/6/projects/75/badges/coding-hours.svg
:alt: Gitential Coding Hours
:target: https://gitential.com/accounts/6/projects/75/share?uuid=095e15c5-46b9-4534-a1d4-3b0bf1f33100&utm_source=shield&utm_medium=shield&utm_campaign=75