{"id":16223740,"url":"https://github.com/lykmapipo/python-joblib-cookbook","last_synced_at":"2025-04-03T04:30:36.529Z","repository":{"id":215079504,"uuid":"738056398","full_name":"lykmapipo/Python-Joblib-Cookbook","owner":"lykmapipo","description":"A step-by-step guide to master various aspects of Joblib for parallel computing in Python","archived":false,"fork":false,"pushed_at":"2024-01-18T16:26:16.000Z","size":46,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-26T17:53:18.592Z","etag":null,"topics":["apache-spark","cache","dask","distributed-computing","joblib","loky","lykmapipo","memoization","multiprocessing","parallel-computing","python","threading"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lykmapipo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-02T10:08:54.000Z","updated_at":"2025-01-20T20:47:31.000Z","dependencies_parsed_at":"2024-01-02T11:48:12.440Z","dependency_job_id":"a92ba432-f229-48cf-8903-6d9cc98d0ea4","html_url":"https://github.com/lykmapipo/Python-Joblib-Cookbook","commit_stats":{"total_commits":16,"total_committers":1,"mean_commits":16.0,"dds":0.0,"last_synced_commit":"77811dcfd867c7592b9788c61031d6db8b7a7ce3"},"previous_names":["lykmapipo/python-joblib-cookbook"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lykmapipo%2FPython-Joblib-Cookbook","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lykmapipo%2FPython-Joblib-Cookbook/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lykmapipo%2FPython-Joblib-Cookbook/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lykmapipo%2FPython-Joblib-Cookbook/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lykmapipo","download_url":"https://codeload.github.com/lykmapipo/Python-Joblib-Cookbook/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246938928,"owners_count":20857916,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","cache","dask","distributed-computing","joblib","loky","lykmapipo","memoization","multiprocessing","parallel-computing","python","threading"],"created_at":"2024-10-10T12:19:55.838Z","updated_at":"2025-04-03T04:30:36.274Z","avatar_url":"https://github.com/lykmapipo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Python Joblib Cookbook\n\nA step-by-step guide to master various aspects of [Joblib](https://github.com/joblib/joblib), and utilize its functionalities for parallel computing and task handling in Python.\n\n\n## Requirements\n\n- [Python 3.8+](https://www.python.org/)\n- [pip 23.3+](https://github.com/pypa/pip)\n- [joblib 1.3+](https://github.com/joblib/joblib)\n- [numpy 1.24+](https://github.com/numpy/numpy)\n- [scikit-learn 1.3+](https://github.com/scikit-learn/scikit-learn)\n- [dask 2023.5+](https://github.com/dask/dask)\n- [ray 2.9+](https://github.com/ray-project/ray)\n\n\n---\n\n## Installing Joblib\n\n**Objective:** Learn how to install and verify Joblib using `pip`.\n\n```sh\npip install joblib\n```\n\n```sh\npip show joblib\n```\n\n**Tips:**\n\n- Ensure the appropriate [Python virtual environment](https://docs.python.org/3/library/venv.html) is activated before running the installation command.\n\n- Ensure [pip](https://pip.pypa.io/en/stable/installation/) is installed before before running the installation command.\n\n- If you want use [docker](https://www.docker.com/) run:\n\n```sh\ndocker build -t python-joblib-cookbook:3.8-slim-bookworm .\n\ndocker run -it --rm \\\n    -v $(pwd)/data:/python-joblib-cookbook/data \\\n    -v $(pwd)/tmp:/python-joblib-cookbook/tmp \\\n    -v $(pwd)/scripts:/python-joblib-cookbook/scripts\\\n    python-joblib-cookbook:3.8-slim-bookworm\n\n```\n\n\n---\n\n## Basic Usage\n\n**Objective:** Understand the fundamental usage of Joblib for parallelizing functions.\n\n```python\nfrom joblib import Parallel, delayed\n\n\ndef square(x):\n    return x**2\n\n\nresults = Parallel(n_jobs=-1, verbose=50)(delayed(square)(i) for i in range(10))\n\nprint(results)\n\n```\n\n**Tips:**\n\n- Adjust the `n_jobs` to `0, 1, etc`, to control the number of parallel jobs (`-1` uses all available `cpu cores`)\n\n- Adjust the `vebosity` to `0, 1, 2, 3, 10, 50 etc.`, to control level of progress messages that are printed.\n\n\n---\n\n## Basic Configuration\n**Objective:** Understand how to configure Joblib (i.e to set `backend`, `n_jobs`, `verbose` etc).\n\n```python\nfrom joblib import Parallel, delayed, parallel_config\n\n\ndef square(x):\n    return x**2\n\n\nwith parallel_config(backend=\"loky\", n_jobs=-1, verbose=50):\n    results = Parallel()(delayed(square)(i) for i in range(10))\n\nprint(results)\n\n```\n\n**Tips:**\n\n- It is particularly useful (recommended) to use `parallel_config` when configuring joblib, especially when using libraries (e.g [scikit-learn](https://github.com/scikit-learn/scikit-learn)) that uses joblib internally.\n\n- `backend` specifies the parallelization backend to use. By default, available backends are `loky`, `threading` and `multiprocessing`. Custom backends i.e `Dask`, `Ray` etc., need to be registered before usage.\n\n- `n_jobs` specifies the maximum number of parallel jobs. If `-1` all CPU cores are used.\n\n- `verbose` specifies level of progress messages to be printed, when executiong the jobs.\n\n\n---\n\n## Parallelizing a For Loop\n\n**Objective:** Parallelize a for loop using Joblib.\n\n```python\nfrom joblib import Parallel, delayed, parallel_config\n\n\ndef process_item(item):\n    return item**2\n\n\nitems = list(range(10))\n\nwith parallel_config(backend=\"loky\", n_jobs=-1, verbose=50):\n    results = Parallel()(delayed(process_item)(item) for item in items)\n\nprint(results)\n\n```\n\n**Tips:**\n\n- Adjust the number of items in the list and observe performance changes when parallelizing.\n\n\n---\n\n## Memoizing a Function Results\n\n**Objective:** Use Joblib's `Memory` to cache function results and speed up repeated computations.\n\n```python\nfrom joblib import Memory, Parallel, delayed, parallel_config\n\nmem = Memory(\"./tmp/cache\", verbose=10)\n\n\n@mem.cache\ndef process_item(item):\n    return item**2\n\n\nitems = list(range(100))\n\nwith parallel_config(backend=\"loky\", n_jobs=-1, verbose=50):\n    results = Parallel()(delayed(process_item)(item) for item in items)\n\nprint(results)\n\n\n```\n\n**Tips:**\n\n- Adjust the number of items in the list, re-run the codes and observe performance changes when caching.\n\n- Adjust `Memory` verbose level to `0, 2, 10, 50 etc.` to see if cached results are used.\n\n\n---\n\n## Memory Mapping Large Arrays\n\n**Objective:** Use memory mapping with Joblib for handling large arrays efficiently.\n\n```python\nimport joblib\nimport numpy as np\n\ndata = np.random.rand(1000, 1000)\nfilename = \"./tmp/large_array.dat\"\n\njoblib.dump(data, filename, compress=3, protocol=4)\nloaded_data = joblib.load(filename)\n\nprint(loaded_data)\n\n```\n\n**Tips:**\n\n- Experiment with different compression levels and pickle protocols for optimization.\n\n\n---\n\n## Customizing Joblib Parallel Backend\n\n**Objective:** Customize Joblib's parallel backend for specific requirements.\n\n```python\nfrom joblib import Parallel, delayed, parallel_config\n\n\ndef square(x):\n    return x**2\n\n\nwith parallel_config(backend=\"threading\", n_jobs=-1, verbose=50):\n    results = Parallel()(delayed(square)(i) for i in range(10))\n\nprint(results)\n\n```\n\n**Tips:**\n\n- Explore different parallel backends and adjust the number of jobs for performance comparison.\n\n\n---\n\n\n## Exception Handling\n**Objective:** Implement proper exception handling for parallelized tasks.\n\n```python\nfrom joblib import Parallel, delayed, parallel_config\n\n\ndef divide(x, y):\n    try:\n        result = x / y\n    except ZeroDivisionError:\n        result = float(\"nan\")\n    return result\n\n\ndata = [(1, 2), (3, 0), (5, 2)]\n\nwith parallel_config(backend=\"loky\", n_jobs=-1, verbose=50):\n    results = Parallel()(delayed(divide)(x, y) for x, y in data)\n\nprint(results)\n\n```\n\n**Tips:**\n\n- Ensure proper error handling within the parallelized function.\n\n\n---\n\n## Parallelizing Machine Learning Training\n\n**Objective:** Parallelize machine learning model training using Joblib.\n\n```python\nimport joblib\nfrom sklearn.datasets import make_classification\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.metrics import accuracy_score\nfrom sklearn.model_selection import train_test_split\n\nX, y = make_classification(n_samples=1000, n_features=20, random_state=42)\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n\nwith joblib.parallel_config(backend=\"loky\", n_jobs=-1, verbose=50):\n    clf = RandomForestClassifier(n_estimators=100, random_state=42, verbose=50)\n    clf.fit(X_train, y_train)\n\n    y_pred = clf.predict(X_test)\n    accuracy = accuracy_score(y_test, y_pred)\n\nprint(f\"Accuracy: {accuracy}\")\n\n```\n\n**Tips:**\n\n- Experiment with different machine learning models and datasets to observe performance gains.\n\n\n---\n\n## Multi log-files Data Processing\n\n**Objective:** Process multiple log files concurrently.\n\n```python\nimport re\nfrom datetime import datetime\nfrom pathlib import Path\n\nfrom joblib import Parallel, delayed, parallel_config\n\n\ndef parse_log_line(log_line):\n    log_pattern = r\"\\[(?P\u003cdatetime\u003e.*?)\\] (?P\u003clevel\u003e\\w+): (?P\u003cmessage\u003e.*)\"\n    log_match = re.match(log_pattern, log_line)\n\n    log_datetime = datetime.strptime(log_match.group(\"datetime\"), \"%Y-%m-%d %H:%M:%S\")\n    log_level = log_match.group(\"level\")\n    log_message = log_match.group(\"message\")\n    return log_datetime, log_level, log_message\n\n\ndef process_log_file(log_file=None):\n    with open(log_file, \"r\") as file:\n        log_lines = file.readlines()\n        with parallel_config(backend=\"threading\", n_jobs=-1, verbose=50):\n            logs = Parallel()(delayed(parse_log_line)(log_line) for log_line in log_lines)\n        return logs\n\n\ndef glob_log_files(logs_dir=None):\n    logs_dir_path = Path(logs_dir).expanduser().resolve()\n    yield from logs_dir_path.glob(\"*.txt\")\n\n\nlog_files = glob_log_files(logs_dir=\"./data/raw/logs\")\nwith parallel_config(backend=\"loky\", n_jobs=-1, verbose=50):\n    logs = Parallel()(delayed(process_log_file)(log_file) for log_file in log_files)\n\nprint(logs)\n\n```\n\n**Tips:**\n\n- Experiment with different parallel backends and data formats.\n\n\n---\n\n## Distributed Computing with Dask\n\n**Objective:** Utilize `Dask` as a Joblib backend, to enable distributed computing capabilities.\n\n```sh\npip install dask distributed\n```\n\n```python\nfrom dask.distributed import Client, LocalCluster\nfrom joblib import Parallel, delayed, parallel_config\n\n\ndef square(x):\n    return x**2\n\n\n# See: https://docs.dask.org/en/stable/deploying.html#distributed-computing\nif __name__ == \"__main__\":\n    with LocalCluster() as cluster:\n        with Client(cluster) as client:\n            with parallel_config(backend=\"dask\", n_jobs=-1, verbose=50):\n                results = Parallel()(delayed(square)(i) for i in range(10))\n\n    print(results)\n\n```\n\n**Tips:**\n\n- Experiment with many ways to [deploy and run Dask clusters](https://docs.dask.org/en/stable/deploying.html#distributed-computing) and observe performance gains.\n\n\n---\n\n## Distributed Computing with Ray\n\n**Objective:** Utilize `Ray` as a Joblib backend, to enable distributed computing capabilities.\n\n```sh\npip install ray\n```\n\n```python\nfrom joblib import Parallel, delayed, parallel_config\nfrom ray.util.joblib import register_ray\n\n\ndef square(x):\n    return x**2\n\n\n# Register Ray Backend to be called with parallel_config(backend=\"ray\")\nregister_ray()\n\n# See: https://docs.ray.io/en/latest/ray-core/walkthrough.html\nif __name__ == \"__main__\":\n    with parallel_config(backend=\"ray\", n_jobs=-1, verbose=50):\n        results = Parallel()(delayed(square)(i) for i in range(10))\n\n    print(results)\n\n```\n\n**Tips:**\n\n- Experiment with many ways to [deploy and run Ray clusters](https://docs.ray.io/en/latest/cluster/getting-started.html) and observe performance gains.\n\n\n---\n\n## What's Next\n\n1. **Explore Advanced Joblib Features:** Delve deeper into Joblib's advanced features such as caching, lazy evaluation, and distributed computing for more complex tasks.\n\n2. **Apply Joblib to Real-world Projects:** Implement Joblib in your own projects involving data processing, machine learning, or any CPU-intensive tasks to experience its benefits firsthand.\n\n3. **Discover Related Libraries:** Explore other Python libraries for parallel computing and optimization, such as Dask, Ray or Multiprocessing, to broaden your toolkit.\n\n4. **Stay Updated:** Keep an eye on Joblib's updates and enhancements in future releases to leverage the latest functionalities and optimizations.\n\n\n## Gotchas\n\n1. **Choose the Right Backend:** Select the appropriate Joblib backend based on your task and available resources. For CPU-bound tasks, `loky` or `multiprocessing` might be suitable. For I/O-bound tasks, `threading` or specific distributed computing backends like `dask` might be better.\n\n2. **Optimal Number of Workers:** Experiment with the number of workers (`n_jobs`) to find the optimal configuration. Too many workers can lead to resource contention, while too few might underutilize resources.\n\n3. **Data Transfer Overhead:** Minimize data transfer overhead between processes/threads. Large data transfers between parallel workers can become a bottleneck. Avoid unnecessary data sharing or copying if possible.\n\n4. **Memory Consideration:** Be mindful of memory usage, especially when processing large datasets in parallel. Parallelism can increase memory consumption, potentially leading to resource contention or out-of-memory issues.\n\n5. **Cleanup Resources:** Ensure proper cleanup of resources (e.g., closing files, releasing memory) after the parallel tasks complete to avoid resource leaks.\n\n6. **Proper Error Handling:** Implement proper error handling mechanisms, especially when dealing with parallel tasks, to manage exceptions and prevent deadlocks or crashes.\n\n7. **Benchmark and Profile:** Measure the performance of your parallelized code using benchmarking tools (`timeit`, `time`, etc.) to identify bottlenecks and areas for improvement.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flykmapipo%2Fpython-joblib-cookbook","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flykmapipo%2Fpython-joblib-cookbook","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flykmapipo%2Fpython-joblib-cookbook/lists"}