{"id":13713388,"url":"https://github.com/joblib/joblib-spark","last_synced_at":"2025-05-16T16:02:42.365Z","repository":{"id":39536334,"uuid":"223007271","full_name":"joblib/joblib-spark","owner":"joblib","description":"Joblib Apache Spark Backend","archived":false,"fork":false,"pushed_at":"2024-08-14T11:13:34.000Z","size":101,"stargazers_count":245,"open_issues_count":20,"forks_count":26,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-27T11:15:55.704Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joblib.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-20T19:02:44.000Z","updated_at":"2025-01-31T13:26:33.000Z","dependencies_parsed_at":"2024-06-18T20:08:57.098Z","dependency_job_id":"164416be-a2b9-49d4-987d-367d2f3e75dd","html_url":"https://github.com/joblib/joblib-spark","commit_stats":{"total_commits":42,"total_committers":5,"mean_commits":8.4,"dds":"0.11904761904761907","last_synced_commit":"7d52abbe2a8dea8f4d056aa7ae031014b2068b68"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joblib%2Fjoblib-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joblib%2Fjoblib-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joblib%2Fjoblib-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joblib%2Fjoblib-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joblib","download_url":"https://codeload.github.com/joblib/joblib-spark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246998217,"owners_count":20866696,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T23:01:34.637Z","updated_at":"2025-04-03T12:12:14.546Z","avatar_url":"https://github.com/joblib.png","language":"Python","funding_links":[],"categories":["Data Management \u0026 Processing","Packages"],"sub_categories":["Database \u0026 Cloud Management","General Purpose Libraries"],"readme":"# Joblib Apache Spark Backend\n\nThis library provides Apache Spark backend for joblib to distribute tasks on a Spark cluster.\n\n## Installation\n\n`joblibspark` requires Python 3.6+, `joblib\u003e=0.14` and `pyspark\u003e=2.4` to run.\nTo install `joblibspark`, run:\n\n```bash\npip install joblibspark\n```\n\nThe installation does not install PySpark because for most users, PySpark is already installed.\nIf you do not have PySpark installed, you can install `pyspark` together with `joblibspark`:\n\n```bash\npip install pyspark\u003e=3.0.0 joblibspark\n```\n\nIf you want to use `joblibspark` with `scikit-learn`, please install `scikit-learn\u003e=0.21`.\n\n## Examples\n\nRun following example code in `pyspark` shell:\n\n```python\nfrom sklearn.utils import parallel_backend\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn import datasets\nfrom sklearn import svm\nfrom joblibspark import register_spark\n\nregister_spark() # register spark backend\n\niris = datasets.load_iris()\nclf = svm.SVC(kernel='linear', C=1)\nwith parallel_backend('spark', n_jobs=3):\n  scores = cross_val_score(clf, iris.data, iris.target, cv=5)\n\nprint(scores)\n```\n\n## Limitations\n\n`joblibspark` does not generally support run model inference and feature engineering in parallel.\nFor example:\n\n```python\nfrom sklearn.feature_extraction import FeatureHasher\nh = FeatureHasher(n_features=10)\nwith parallel_backend('spark', n_jobs=3):\n    # This won't run parallelly on spark, it will still run locally.\n    h.transform(...)\n\nfrom sklearn import linear_model\nregr = linear_model.LinearRegression()\nregr.fit(X_train, y_train)\n\nwith parallel_backend('spark', n_jobs=3):\n    # This won't run parallelly on spark, it will still run locally.\n    regr.predict(diabetes_X_test)\n```\n\nNote: for `sklearn.ensemble.RandomForestClassifier`, there is a `n_jobs` parameter,\nthat means the algorithm support model training/inference in parallel,\nbut in its inference implementation, it bind the backend to built-in backends,\nso the spark backend not work for this case.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoblib%2Fjoblib-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoblib%2Fjoblib-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoblib%2Fjoblib-spark/lists"}