{"id":13487028,"url":"https://github.com/lensacom/sparkit-learn","last_synced_at":"2025-05-15T10:06:09.568Z","repository":{"id":21932658,"uuid":"25257051","full_name":"lensacom/sparkit-learn","owner":"lensacom","description":"PySpark + Scikit-learn = Sparkit-learn","archived":false,"fork":false,"pushed_at":"2020-12-31T01:56:49.000Z","size":455,"stargazers_count":1154,"open_issues_count":35,"forks_count":256,"subscribers_count":88,"default_branch":"master","last_synced_at":"2025-04-14T16:53:57.996Z","etag":null,"topics":["apache-spark","distributed-computing","machine-learning","python","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lensacom.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-10-15T14:01:10.000Z","updated_at":"2025-03-21T14:39:14.000Z","dependencies_parsed_at":"2022-07-17T11:46:26.832Z","dependency_job_id":null,"html_url":"https://github.com/lensacom/sparkit-learn","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lensacom%2Fsparkit-learn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lensacom%2Fsparkit-learn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lensacom%2Fsparkit-learn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lensacom%2Fsparkit-learn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lensacom","download_url":"https://codeload.github.com/lensacom/sparkit-learn/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254319718,"owners_count":22051072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","distributed-computing","machine-learning","python","scikit-learn"],"created_at":"2024-07-31T18:00:54.534Z","updated_at":"2025-05-15T10:06:09.501Z","avatar_url":"https://github.com/lensacom.png","language":"Python","funding_links":[],"categories":["The Data Science Toolbox","Python","Machine Learning"],"sub_categories":["General Machine Learning Packages","General Purpose Machine Learning","Automatic Plotting"],"readme":"Sparkit-learn\n=============\n\n|Build Status| |PyPi| |Gitter| |Gitential|\n\n**PySpark + Scikit-learn = Sparkit-learn**\n\nGitHub: https://github.com/lensacom/sparkit-learn\n\nAbout\n=====\n\nSparkit-learn aims to provide scikit-learn functionality and API on\nPySpark. The main goal of the library is to create an API that stays\nclose to sklearn's.\n\nThe driving principle was to *\"Think locally, execute distributively.\"*\nTo accomodate this concept, the basic data block is always an array or a\n(sparse) matrix and the operations are executed on block level.\n\n\nRequirements\n============\n\n-  **Python 2.7.x or 3.4.x**\n-  **Spark[\u003e=1.3.0]**\n-  NumPy[\u003e=1.9.0]\n-  SciPy[\u003e=0.14.0]\n-  Scikit-learn[\u003e=0.16]\n\n\n\nRun IPython from notebooks directory\n====================================\n\n.. code:: bash\n\n    PYTHONPATH=${PYTHONPATH}:.. IPYTHON_OPTS=\"notebook\" ${SPARK_HOME}/bin/pyspark --master local\\[4\\] --driver-memory 2G\n\n\nRun tests with\n==============\n\n.. code:: bash\n\n    ./runtests.sh\n\n\nQuick start\n===========\n\nSparkit-learn introduces three important distributed data format:\n\n-  **ArrayRDD:**\n\n   A *numpy.array* like distributed array\n\n   .. code:: python\n\n       from splearn.rdd import ArrayRDD\n\n       data = range(20)\n       # PySpark RDD with 2 partitions\n       rdd = sc.parallelize(data, 2) # each partition with 10 elements\n       # ArrayRDD\n       # each partition will contain blocks with 5 elements\n       X = ArrayRDD(rdd, bsize=5) # 4 blocks, 2 in each partition\n\n   Basic operations:\n\n   .. code:: python\n\n       len(X) # 20 - number of elements in the whole dataset\n       X.blocks # 4 - number of blocks\n       X.shape # (20,) - the shape of the whole dataset\n\n       X # returns an ArrayRDD\n       # \u003cclass 'splearn.rdd.ArrayRDD'\u003e from PythonRDD...\n\n       X.dtype # returns the type of the blocks\n       # numpy.ndarray\n\n       X.collect() # get the dataset\n       # [array([0, 1, 2, 3, 4]),\n       #  array([5, 6, 7, 8, 9]),\n       #  array([10, 11, 12, 13, 14]),\n       #  array([15, 16, 17, 18, 19])]\n\n       X[1].collect() # indexing\n       # [array([5, 6, 7, 8, 9])]\n\n       X[1] # also returns an ArrayRDD!\n\n       X[1::2].collect() # slicing\n       # [array([5, 6, 7, 8, 9]),\n       #  array([15, 16, 17, 18, 19])]\n\n       X[1::2] # returns an ArrayRDD as well\n\n       X.tolist() # returns the dataset as a list\n       # [0, 1, 2, ... 17, 18, 19]\n       X.toarray() # returns the dataset as a numpy.array\n       # array([ 0,  1,  2, ... 17, 18, 19])\n\n       # pyspark.rdd operations will still work\n       X.getNumPartitions() # 2 - number of partitions\n\n\n- **SparseRDD:**\n\n  The sparse counterpart of the *ArrayRDD*, the main difference is that the\n  blocks are sparse matrices. The reason behind this split is to follow the\n  distinction between *numpy.ndarray*s and *scipy.sparse* matrices.\n  Usually the *SparseRDD* is created by *splearn*'s transformators, but one can\n  instantiate too.\n\n  .. code:: python\n\n       # generate a SparseRDD from a text using SparkCountVectorizer\n       from splearn.rdd import SparseRDD\n       from sklearn.feature_extraction.tests.test_text import ALL_FOOD_DOCS\n       ALL_FOOD_DOCS\n       #(u'the pizza pizza beer copyright',\n       # u'the pizza burger beer copyright',\n       # u'the the pizza beer beer copyright',\n       # u'the burger beer beer copyright',\n       # u'the coke burger coke copyright',\n       # u'the coke burger burger',\n       # u'the salad celeri copyright',\n       # u'the salad salad sparkling water copyright',\n       # u'the the celeri celeri copyright',\n       # u'the tomato tomato salad water',\n       # u'the tomato salad water copyright')\n\n       # ArrayRDD created from the raw data\n       X = ArrayRDD(sc.parallelize(ALL_FOOD_DOCS, 4), 2)\n       X.collect()\n       # [array([u'the pizza pizza beer copyright',\n       #         u'the pizza burger beer copyright'], dtype='\u003cU31'),\n       #  array([u'the the pizza beer beer copyright',\n       #         u'the burger beer beer copyright'], dtype='\u003cU33'),\n       #  array([u'the coke burger coke copyright',\n       #         u'the coke burger burger'], dtype='\u003cU30'),\n       #  array([u'the salad celeri copyright',\n       #         u'the salad salad sparkling water copyright'], dtype='\u003cU41'),\n       #  array([u'the the celeri celeri copyright',\n       #         u'the tomato tomato salad water'], dtype='\u003cU31'),\n       #  array([u'the tomato salad water copyright'], dtype='\u003cU32')]\n\n       # Feature extraction executed\n       from splearn.feature_extraction.text import SparkCountVectorizer\n       vect = SparkCountVectorizer()\n       X = vect.fit_transform(X)\n       # and we have a SparseRDD\n       X\n       # \u003cclass 'splearn.rdd.SparseRDD'\u003e from PythonRDD...\n\n       # it's type is the scipy.sparse's general parent\n       X.dtype\n       # scipy.sparse.base.spmatrix\n\n       # slicing works just like in ArrayRDDs\n       X[2:4].collect()\n       # [\u003c2x11 sparse matrix of type '\u003ctype 'numpy.int64'\u003e'\n       #   with 7 stored elements in Compressed Sparse Row format\u003e,\n       #  \u003c2x11 sparse matrix of type '\u003ctype 'numpy.int64'\u003e'\n       #   with 9 stored elements in Compressed Sparse Row format\u003e]\n\n       # general mathematical operations are available\n       X.sum(), X.mean(), X.max(), X.min()\n       # (55, 0.45454545454545453, 2, 0)\n\n       # even with axis parameters provided\n       X.sum(axis=1)\n       # matrix([[5],\n       #         [5],\n       #         [6],\n       #         [5],\n       #         [5],\n       #         [4],\n       #         [4],\n       #         [6],\n       #         [5],\n       #         [5],\n       #         [5]])\n\n       # It can be transformed to dense ArrayRDD\n       X.todense()\n       # \u003cclass 'splearn.rdd.ArrayRDD'\u003e from PythonRDD...\n       X.todense().collect()\n       # [array([[1, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0],\n       #         [1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0]]),\n       #  array([[2, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0],\n       #         [2, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]]),\n       #  array([[0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0],\n       #         [0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0]]),\n       #  array([[0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0],\n       #         [0, 0, 0, 0, 1, 0, 2, 1, 1, 0, 1]]),\n       #  array([[0, 0, 2, 0, 1, 0, 0, 0, 2, 0, 0],\n       #         [0, 0, 0, 0, 0, 0, 1, 0, 1, 2, 1]]),\n       #  array([[0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1]])]\n\n       # One can instantiate SparseRDD manually too:\n       sparse = sc.parallelize(np.array([sp.eye(2).tocsr()]*20), 2)\n       sparse = SparseRDD(sparse, bsize=5)\n       sparse\n       # \u003cclass 'splearn.rdd.SparseRDD'\u003e from PythonRDD...\n\n       sparse.collect()\n       # [\u003c10x2 sparse matrix of type '\u003ctype 'numpy.float64'\u003e'\n       #   with 10 stored elements in Compressed Sparse Row format\u003e,\n       #  \u003c10x2 sparse matrix of type '\u003ctype 'numpy.float64'\u003e'\n       #   with 10 stored elements in Compressed Sparse Row format\u003e,\n       #  \u003c10x2 sparse matrix of type '\u003ctype 'numpy.float64'\u003e'\n       #   with 10 stored elements in Compressed Sparse Row format\u003e,\n       #  \u003c10x2 sparse matrix of type '\u003ctype 'numpy.float64'\u003e'\n       #   with 10 stored elements in Compressed Sparse Row format\u003e]\n\n\n-  **DictRDD:**\n\n   A column based data format, each column with it's own type.\n\n   .. code:: python\n\n       from splearn.rdd import DictRDD\n\n       X = range(20)\n       y = list(range(2)) * 10\n       # PySpark RDD with 2 partitions\n       X_rdd = sc.parallelize(X, 2) # each partition with 10 elements\n       y_rdd = sc.parallelize(y, 2) # each partition with 10 elements\n       # DictRDD\n       # each partition will contain blocks with 5 elements\n       Z = DictRDD((X_rdd, y_rdd),\n                   columns=('X', 'y'),\n                   bsize=5,\n                   dtype=[np.ndarray, np.ndarray]) # 4 blocks, 2/partition\n       # if no dtype is provided, the type of the blocks will be determined\n       # automatically\n\n       # or:\n       import numpy as np\n\n       data = np.array([range(20), list(range(2))*10]).T\n       rdd = sc.parallelize(data, 2)\n       Z = DictRDD(rdd,\n                   columns=('X', 'y'),\n                   bsize=5,\n                   dtype=[np.ndarray, np.ndarray])\n\n   Basic operations:\n\n   .. code:: python\n\n       len(Z) # 8 - number of blocks\n       Z.columns # returns ('X', 'y')\n       Z.dtype # returns the types in correct order\n       # [numpy.ndarray, numpy.ndarray]\n\n       Z # returns a DictRDD\n       #\u003cclass 'splearn.rdd.DictRDD'\u003e from PythonRDD...\n\n       Z.collect()\n       # [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),\n       #  (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),\n       #  (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0])),\n       #  (array([15, 16, 17, 18, 19]), array([1, 0, 1, 0, 1]))]\n\n       Z[:, 'y'] # column select - returns an ArrayRDD\n       Z[:, 'y'].collect()\n       # [array([0, 1, 0, 1, 0]),\n       #  array([1, 0, 1, 0, 1]),\n       #  array([0, 1, 0, 1, 0]),\n       #  array([1, 0, 1, 0, 1])]\n\n       Z[:-1, ['X', 'y']] # slicing - DictRDD\n       Z[:-1, ['X', 'y']].collect()\n       # [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),\n       #  (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),\n       #  (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0]))]\n\n\nBasic workflow\n--------------\n\nWith the use of the described data structures, the basic workflow is\nalmost identical to sklearn's.\n\nDistributed vectorizing of texts\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nSparkCountVectorizer\n^^^^^^^^^^^^^^^^^^^^\n\n.. code:: python\n\n    from splearn.rdd import ArrayRDD\n    from splearn.feature_extraction.text import SparkCountVectorizer\n    from sklearn.feature_extraction.text import CountVectorizer\n\n    X = [...]  # list of texts\n    X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext\n\n    local = CountVectorizer()\n    dist = SparkCountVectorizer()\n\n    result_local = local.fit_transform(X)\n    result_dist = dist.fit_transform(X_rdd)  # SparseRDD\n\n\nSparkHashingVectorizer\n^^^^^^^^^^^^^^^^^^^^^^\n\n.. code:: python\n\n    from splearn.rdd import ArrayRDD\n    from splearn.feature_extraction.text import SparkHashingVectorizer\n    from sklearn.feature_extraction.text import HashingVectorizer\n\n    X = [...]  # list of texts\n    X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext\n\n    local = HashingVectorizer()\n    dist = SparkHashingVectorizer()\n\n    result_local = local.fit_transform(X)\n    result_dist = dist.fit_transform(X_rdd)  # SparseRDD\n\n\nSparkTfidfTransformer\n^^^^^^^^^^^^^^^^^^^^^\n\n.. code:: python\n\n    from splearn.rdd import ArrayRDD\n    from splearn.feature_extraction.text import SparkHashingVectorizer\n    from splearn.feature_extraction.text import SparkTfidfTransformer\n    from splearn.pipeline import SparkPipeline\n\n    from sklearn.feature_extraction.text import HashingVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.pipeline import Pipeline\n\n    X = [...]  # list of texts\n    X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext\n\n    local_pipeline = Pipeline((\n        ('vect', HashingVectorizer()),\n        ('tfidf', TfidfTransformer())\n    ))\n    dist_pipeline = SparkPipeline((\n        ('vect', SparkHashingVectorizer()),\n        ('tfidf', SparkTfidfTransformer())\n    ))\n\n    result_local = local_pipeline.fit_transform(X)\n    result_dist = dist_pipeline.fit_transform(X_rdd)  # SparseRDD\n\n\nDistributed Classifiers\n~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from splearn.rdd import DictRDD\n    from splearn.feature_extraction.text import SparkHashingVectorizer\n    from splearn.feature_extraction.text import SparkTfidfTransformer\n    from splearn.svm import SparkLinearSVC\n    from splearn.pipeline import SparkPipeline\n\n    from sklearn.feature_extraction.text import HashingVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.svm import LinearSVC\n    from sklearn.pipeline import Pipeline\n\n    X = [...]  # list of texts\n    y = [...]  # list of labels\n    X_rdd = sc.parallelize(X, 4)\n    y_rdd = sc.parallelize(y, 4)\n    Z = DictRDD((X_rdd, y_rdd),\n                columns=('X', 'y'),\n                dtype=[np.ndarray, np.ndarray])\n\n    local_pipeline = Pipeline((\n        ('vect', HashingVectorizer()),\n        ('tfidf', TfidfTransformer()),\n        ('clf', LinearSVC())\n    ))\n    dist_pipeline = SparkPipeline((\n        ('vect', SparkHashingVectorizer()),\n        ('tfidf', SparkTfidfTransformer()),\n        ('clf', SparkLinearSVC())\n    ))\n\n    local_pipeline.fit(X, y)\n    dist_pipeline.fit(Z, clf__classes=np.unique(y))\n\n    y_pred_local = local_pipeline.predict(X)\n    y_pred_dist = dist_pipeline.predict(Z[:, 'X'])\n\n\nDistributed Model Selection\n~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from splearn.rdd import DictRDD\n    from splearn.grid_search import SparkGridSearchCV\n    from splearn.naive_bayes import SparkMultinomialNB\n\n    from sklearn.grid_search import GridSearchCV\n    from sklearn.naive_bayes import MultinomialNB\n\n    X = [...]\n    y = [...]\n    X_rdd = sc.parallelize(X, 4)\n    y_rdd = sc.parallelize(y, 4)\n    Z = DictRDD((X_rdd, y_rdd),\n                columns=('X', 'y'),\n                dtype=[np.ndarray, np.ndarray])\n\n    parameters = {'alpha': [0.1, 1, 10]}\n    fit_params = {'classes': np.unique(y)}\n\n    local_estimator = MultinomialNB()\n    local_grid = GridSearchCV(estimator=local_estimator,\n                              param_grid=parameters)\n\n    estimator = SparkMultinomialNB()\n    grid = SparkGridSearchCV(estimator=estimator,\n                             param_grid=parameters,\n                             fit_params=fit_params)\n\n    local_grid.fit(X, y)\n    grid.fit(Z)\n\n\nROADMAP\n=======\n\n- [ ] Transparent API to support plain numpy and scipy objects (partially done in the transparent_api branch)\n- [ ] Update all dependencies\n- [ ] Use Mllib and ML packages more extensively (since it becames more mature)\n- [ ] Support Spark DataFrames\n\n\nSpecial thanks\n==============\n\n- scikit-learn community\n- spylearn community\n- pyspark community\n\nSimilar Projects\n===============\n\n- `Thunder \u003chttps://github.com/thunder-project/thunder\u003e`_\n- `Bolt \u003chttps://github.com/bolt-project/bolt\u003e`_\n\n.. |Build Status| image:: https://travis-ci.org/lensacom/sparkit-learn.png?branch=master\n   :target: https://travis-ci.org/lensacom/sparkit-learn\n.. |PyPi| image:: https://img.shields.io/pypi/v/sparkit-learn.svg\n   :target: https://pypi.python.org/pypi/sparkit-learn\n.. |Gitter| image:: https://badges.gitter.im/Join%20Chat.svg\n   :alt: Join the chat at https://gitter.im/lensacom/sparkit-learn\n   :target: https://gitter.im/lensacom/sparkit-learn?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge\n.. |Gitential| image:: https://api.gitential.com/accounts/6/projects/75/badges/coding-hours.svg\n   :alt: Gitential Coding Hours\n   :target: https://gitential.com/accounts/6/projects/75/share?uuid=095e15c5-46b9-4534-a1d4-3b0bf1f33100\u0026utm_source=shield\u0026utm_medium=shield\u0026utm_campaign=75\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flensacom%2Fsparkit-learn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flensacom%2Fsparkit-learn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flensacom%2Fsparkit-learn/lists"}