{"id":25452771,"url":"https://github.com/joblib/joblib-hadoop","last_synced_at":"2025-09-05T06:43:19.563Z","repository":{"id":66083200,"uuid":"89694177","full_name":"joblib/joblib-hadoop","owner":"joblib","description":"Use Joblib in an Hadoop Cluster","archived":false,"fork":false,"pushed_at":"2025-02-26T15:08:20.000Z","size":122,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-09-03T00:41:53.664Z","etag":null,"topics":["cloud","computing","hadoop","parallel"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joblib.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.rst","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-04-28T10:06:49.000Z","updated_at":"2025-02-26T15:08:24.000Z","dependencies_parsed_at":null,"dependency_job_id":"ef0ae8b9-87cc-4dc4-b212-aa0bc81a53d9","html_url":"https://github.com/joblib/joblib-hadoop","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/joblib/joblib-hadoop","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joblib%2Fjoblib-hadoop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joblib%2Fjoblib-hadoop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joblib%2Fjoblib-hadoop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joblib%2Fjoblib-hadoop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joblib","download_url":"https://codeload.github.com/joblib/joblib-hadoop/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joblib%2Fjoblib-hadoop/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273723203,"owners_count":25156303,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-05T02:00:09.113Z","response_time":402,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud","computing","hadoop","parallel"],"created_at":"2025-02-17T23:41:33.392Z","updated_at":"2025-09-05T06:43:19.551Z","avatar_url":"https://github.com/joblib.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"============\nUNMAINTAINED\n============\n\nThis repository has been unmaintained for several years. Making the tests pass\nagain requires significant work to update the configuration to use more\nversions of the base docker images and dependencies.\n\nFurthermore, it depends on https://github.com/dask/hdfs3 which is also\nunmaintained and would therefore require significant code rewrite to switch to\npyarrow.\n\nSince there was very little adoption of this project. It was decided to mark it as\nofficially archived on 2025-02-26.\n\nJoblib-hadoop\n=============\n\n|Travis| |Codecov|\n\n.. |Travis| image:: https://travis-ci.org/joblib/joblib-hadoop.svg?branch=master\n    :target: https://travis-ci.org/joblib/joblib-hadoop\n\n.. |Codecov| image:: https://codecov.io/gh/joblib/joblib-hadoop/branch/master/graph/badge.svg\n    :target: https://codecov.io/gh/joblib/joblib-hadoop\n\nThis package provides parallel and store backends for joblib that can be use on\na Hadoop cluster.\n\nIf you don't know joblib already, user documentation is located on\nhttps://pythonhosted.org/joblib\n\nJoblib-hadoop supports Python 2.7, 3.4 and 3.5.\n\nGetting the latest code\n=======================\n\nTo get the latest code use git::\n\n    git clone git://github.com/joblib/joblib-hadoop.git\n\nInstalling joblib-hadoop\n========================\n\nWe recommend using\n`Python Anaconda 3 distribution \u003chttps://www.continuum.io/Downloads\u003e`_ for\nfull support of the HDFS store backends.\n\n1. Create an Anaconda environment (use python 2.7, 3.4 or 3.5) and activate it:\n\n..  code-block:: bash\n\n    $ conda create -n joblibhadoop-env python==3.5 libhdfs3 -c conda-forge\n    $ . activate joblibhadoop-env\n\nWe recommend using anaconda because it provides a pre-built version of\nlibhdfs3. See build_libhdfs3_ if you want to install it using pip.\n\n2. From the `joblibhadoop-env` environment, perform installation using pip:\n\n..  code-block:: bash\n\n    $ cd joblib-hadoop\n    $ pip install -r requirements.txt .\n\n\nUsing joblib-hadoop on a Hadoop cluster\n=======================================\n\n1. Use a HDFS storage backend with Joblib memory to cache results (replace\n'namenode' with the name of the HDFS namenode):\n\n..  code-block:: python\n\n  import numpy as np\n  from joblib import Memory\n  from joblibhadoop.hdfs import register_hdfs_store_backend\n\n  if __name__ == '__main__':\n      register_hdfs_store_backend()\n\n      mem = Memory(location='joblib_cache_hdfs', backend='hdfs',\n                   verbose=100, compress=True\n                   store_options=dict(host='namenode', port=8020, user='test'))\n\n      multiply = mem.cache(np.multiply)\n      array1 = np.arange(10000)\n      array2 = np.arange(10000)\n\n      result = multiply(array1, array2)\n\n      # Second call should return the cached result\n      result = multiply(array1, array2)\n      print(result)\n\n2. Use a YARN backend with Joblib parallel to parallelize computations:\n\n..  code-block:: python\n\n  from math import sqrt\n  from joblib import (Parallel, delayed,\n                      register_parallel_backend, parallel_backend)\n  from joblibhadoop.yarn import YarnBackend\n\n  if __name__ == '__main__':\n      register_parallel_backend('yarn', YarnBackend)\n\n      # Run in parallel using Yarn backend\n      with parallel_backend('yarn', n_jobs=5):\n          print(Parallel(verbose=100)(\n              delayed(sqrt)(i**2) for i in range(100)))\n\n      # Should be executed in parallel locally\n      print(Parallel(verbose=100, n_jobs=5)(\n          delayed(sqrt)(i**2) for i in range(100)))\n\nThe YARN parallel backend example only works on a host where Hadoop is installed and \ncorrectly configured.\n\n\nAll examples are available in the `examples \u003cexamples\u003e`_ directory.\n\nDevelopping with joblibhadoop\n=============================\n\nIn order to run the test suite, you need to setup a local hadoop cluster inside\nDocker containers. This can be achieved very easily using the recipes available\nin the `docker \u003cdocker\u003e`_ directory and with the provided Makefile targets.\n\nTo avoid problems when accessing an Hadoop cluster using `localhost`,\njoblib-hadoop provides the `joblib-hadoop-client` container. This container has\nHadoop 2.7.0 installed and is thus fully functionnal for playing locally with\nthe hadoop cluster.\n\nAnother important point is that the root directory of this project is shared\nwith the `/shared` directory inside the Hadoop client container. Thanks to this\ntrick, one can code on the host and test in the container without having to\nrebuild it.\n\nPrerequisites\n-------------\n\nThere are some prerequisites to check before going further.\n\n1. `Install docker-engine \u003chttps://docs.docker.com/engine/installation/\u003e`_:\n\nYou have to be able to run the hello-world container:\n\n..  code-block:: bash\n\n    $ docker run hello-world\n\n2. Install docker-compose with pip:\n\n..  code-block:: bash\n\n    $ pip install docker-compose\n\n\n3. Start your hadoop cluster using docker-compose:\n\n..  code-block:: bash\n\n    $ cd joblib-hadoop/docker\n    $ docker-compose up\n\nRunning the test suite\n----------------------\n\nThe test suite has to be launched from the `joblib-hadoop-client` container of\nthe docker-compose configuration. This is achieved very easily with `docker-test`\nMakefile target.\n\n1. First, ensure your hadoop cluster is already started:\n\n..  code-block:: bash\n\n   $ cd joblib-hadoop/docker\n   $ docker-compose up -d\n   $ docker-compose ps\n\nYour containers should all be in the state *Up* except `joblib-hadoop-client`\nthat should have exited with code 0.\n\n2. You can now start the test suite with:\n\n..  code-block:: bash\n\n   $ cd joblib-hadoop\n   $ make docker-test\n\n\nIf you want to access the container directly and test some customizations or\nrun examples. We provided the other following targets to be\n**run from your host**:\n\n- **make run-container**: start an interactive shell in the\n  `joblib-hadoop-client` container\n\n- **make run-examples**: start a new container, install joblib-hadoop and run\n  the examples\n\nHere we list the helpers to be **run from the container**:\n\n- **make install**: install joblib-hadoop in the container once logged in\n  (you need to be in the container with make run-container first)\n\n- **make run-hdfs-example**: run the HDFS Memory multiply example with the cluster.\n\n- **make run-yarb-example**: run the YARN parallel backend example on the cluster.\n\n\n.. _build_libhdfs3:\n\nBuilding and installing the hdfs3 package by hand\n=================================================\n\nFor the moment hdfs3 cannot be directly installed using pip : the reason is\nbecause hdfs3 depends on a C++ based library that is not available in the\nLinux distros and that one needs to build by hand first.\n\nThe following notes are specific to Ubuntu 16.04 but can also be adapted to\nFedora (packages names are slightly different).\n\n1. Clone libhdfs3 from github:\n\n..  code-block:: bash\n\n    $ sudo mkdir /opt/hdfs3\n    $ sudo chown \u003clogin\u003e:\u003clogin\u003e /opt/hdfs3\n    $ cd /opt/hdfs3\n    $ git clone git@github.com:Pivotal-Data-Attic/pivotalrd-libhdfs3.git libhdfs3\n\n\n2. Install required packages\n\n..  code-block:: bash\n\n    $ sudo apt-get install cmake cmake-curses-gui libxml2-dev libprotobuf-dev \\\n    libkrb5-dev uuid-dev libgsasl7-dev protobuf-compiler protobuf-c-compiler \\\n    build-essential -y\n\n\n3. Use CMake to configure and build\n\n..  code-block:: bash\n\n   $ cd /opt/hdfs3/libhdfs3\n   $ mkdir build\n   $ cd build\n   $ ../bootstrap\n   $ make\n   $ make install\n\n\n4. Add the following to your **~/.bashrc** environment file:\n\n::\n\n   export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hdfs3/libhdfs3/dist\n\n5. reload your environment:\n\n..  code-block:: bash\n\n   $ source ~/.bashrc\n\n6. Use **pip** to install *hdfs3* (use `sudo` if needed):\n\n..  code-block:: bash\n\n   $ pip install hdfs3\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoblib%2Fjoblib-hadoop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoblib%2Fjoblib-hadoop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoblib%2Fjoblib-hadoop/lists"}