https://github.com/lykmapipo/python-joblib-cookbook
A step-by-step guide to master various aspects of Joblib for parallel computing in Python
https://github.com/lykmapipo/python-joblib-cookbook
apache-spark cache dask distributed-computing joblib loky lykmapipo memoization multiprocessing parallel-computing python threading
Last synced: 6 months ago
JSON representation
A step-by-step guide to master various aspects of Joblib for parallel computing in Python
- Host: GitHub
- URL: https://github.com/lykmapipo/python-joblib-cookbook
- Owner: lykmapipo
- License: mit
- Created: 2024-01-02T10:08:54.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-01-18T16:26:16.000Z (over 1 year ago)
- Last Synced: 2025-03-26T17:53:18.592Z (6 months ago)
- Topics: apache-spark, cache, dask, distributed-computing, joblib, loky, lykmapipo, memoization, multiprocessing, parallel-computing, python, threading
- Language: Python
- Homepage:
- Size: 44.9 KB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Python Joblib Cookbook
A step-by-step guide to master various aspects of [Joblib](https://github.com/joblib/joblib), and utilize its functionalities for parallel computing and task handling in Python.
## Requirements
- [Python 3.8+](https://www.python.org/)
- [pip 23.3+](https://github.com/pypa/pip)
- [joblib 1.3+](https://github.com/joblib/joblib)
- [numpy 1.24+](https://github.com/numpy/numpy)
- [scikit-learn 1.3+](https://github.com/scikit-learn/scikit-learn)
- [dask 2023.5+](https://github.com/dask/dask)
- [ray 2.9+](https://github.com/ray-project/ray)---
## Installing Joblib
**Objective:** Learn how to install and verify Joblib using `pip`.
```sh
pip install joblib
``````sh
pip show joblib
```**Tips:**
- Ensure the appropriate [Python virtual environment](https://docs.python.org/3/library/venv.html) is activated before running the installation command.
- Ensure [pip](https://pip.pypa.io/en/stable/installation/) is installed before before running the installation command.
- If you want use [docker](https://www.docker.com/) run:
```sh
docker build -t python-joblib-cookbook:3.8-slim-bookworm .docker run -it --rm \
-v $(pwd)/data:/python-joblib-cookbook/data \
-v $(pwd)/tmp:/python-joblib-cookbook/tmp \
-v $(pwd)/scripts:/python-joblib-cookbook/scripts\
python-joblib-cookbook:3.8-slim-bookworm```
---
## Basic Usage
**Objective:** Understand the fundamental usage of Joblib for parallelizing functions.
```python
from joblib import Parallel, delayeddef square(x):
return x**2results = Parallel(n_jobs=-1, verbose=50)(delayed(square)(i) for i in range(10))
print(results)
```
**Tips:**
- Adjust the `n_jobs` to `0, 1, etc`, to control the number of parallel jobs (`-1` uses all available `cpu cores`)
- Adjust the `vebosity` to `0, 1, 2, 3, 10, 50 etc.`, to control level of progress messages that are printed.
---
## Basic Configuration
**Objective:** Understand how to configure Joblib (i.e to set `backend`, `n_jobs`, `verbose` etc).```python
from joblib import Parallel, delayed, parallel_configdef square(x):
return x**2with parallel_config(backend="loky", n_jobs=-1, verbose=50):
results = Parallel()(delayed(square)(i) for i in range(10))print(results)
```
**Tips:**
- It is particularly useful (recommended) to use `parallel_config` when configuring joblib, especially when using libraries (e.g [scikit-learn](https://github.com/scikit-learn/scikit-learn)) that uses joblib internally.
- `backend` specifies the parallelization backend to use. By default, available backends are `loky`, `threading` and `multiprocessing`. Custom backends i.e `Dask`, `Ray` etc., need to be registered before usage.
- `n_jobs` specifies the maximum number of parallel jobs. If `-1` all CPU cores are used.
- `verbose` specifies level of progress messages to be printed, when executiong the jobs.
---
## Parallelizing a For Loop
**Objective:** Parallelize a for loop using Joblib.
```python
from joblib import Parallel, delayed, parallel_configdef process_item(item):
return item**2items = list(range(10))
with parallel_config(backend="loky", n_jobs=-1, verbose=50):
results = Parallel()(delayed(process_item)(item) for item in items)print(results)
```
**Tips:**
- Adjust the number of items in the list and observe performance changes when parallelizing.
---
## Memoizing a Function Results
**Objective:** Use Joblib's `Memory` to cache function results and speed up repeated computations.
```python
from joblib import Memory, Parallel, delayed, parallel_configmem = Memory("./tmp/cache", verbose=10)
@mem.cache
def process_item(item):
return item**2items = list(range(100))
with parallel_config(backend="loky", n_jobs=-1, verbose=50):
results = Parallel()(delayed(process_item)(item) for item in items)print(results)
```
**Tips:**
- Adjust the number of items in the list, re-run the codes and observe performance changes when caching.
- Adjust `Memory` verbose level to `0, 2, 10, 50 etc.` to see if cached results are used.
---
## Memory Mapping Large Arrays
**Objective:** Use memory mapping with Joblib for handling large arrays efficiently.
```python
import joblib
import numpy as npdata = np.random.rand(1000, 1000)
filename = "./tmp/large_array.dat"joblib.dump(data, filename, compress=3, protocol=4)
loaded_data = joblib.load(filename)print(loaded_data)
```
**Tips:**
- Experiment with different compression levels and pickle protocols for optimization.
---
## Customizing Joblib Parallel Backend
**Objective:** Customize Joblib's parallel backend for specific requirements.
```python
from joblib import Parallel, delayed, parallel_configdef square(x):
return x**2with parallel_config(backend="threading", n_jobs=-1, verbose=50):
results = Parallel()(delayed(square)(i) for i in range(10))print(results)
```
**Tips:**
- Explore different parallel backends and adjust the number of jobs for performance comparison.
---
## Exception Handling
**Objective:** Implement proper exception handling for parallelized tasks.```python
from joblib import Parallel, delayed, parallel_configdef divide(x, y):
try:
result = x / y
except ZeroDivisionError:
result = float("nan")
return resultdata = [(1, 2), (3, 0), (5, 2)]
with parallel_config(backend="loky", n_jobs=-1, verbose=50):
results = Parallel()(delayed(divide)(x, y) for x, y in data)print(results)
```
**Tips:**
- Ensure proper error handling within the parallelized function.
---
## Parallelizing Machine Learning Training
**Objective:** Parallelize machine learning model training using Joblib.
```python
import joblib
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_splitX, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)with joblib.parallel_config(backend="loky", n_jobs=-1, verbose=50):
clf = RandomForestClassifier(n_estimators=100, random_state=42, verbose=50)
clf.fit(X_train, y_train)y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)print(f"Accuracy: {accuracy}")
```
**Tips:**
- Experiment with different machine learning models and datasets to observe performance gains.
---
## Multi log-files Data Processing
**Objective:** Process multiple log files concurrently.
```python
import re
from datetime import datetime
from pathlib import Pathfrom joblib import Parallel, delayed, parallel_config
def parse_log_line(log_line):
log_pattern = r"\[(?P.*?)\] (?P\w+): (?P.*)"
log_match = re.match(log_pattern, log_line)log_datetime = datetime.strptime(log_match.group("datetime"), "%Y-%m-%d %H:%M:%S")
log_level = log_match.group("level")
log_message = log_match.group("message")
return log_datetime, log_level, log_messagedef process_log_file(log_file=None):
with open(log_file, "r") as file:
log_lines = file.readlines()
with parallel_config(backend="threading", n_jobs=-1, verbose=50):
logs = Parallel()(delayed(parse_log_line)(log_line) for log_line in log_lines)
return logsdef glob_log_files(logs_dir=None):
logs_dir_path = Path(logs_dir).expanduser().resolve()
yield from logs_dir_path.glob("*.txt")log_files = glob_log_files(logs_dir="./data/raw/logs")
with parallel_config(backend="loky", n_jobs=-1, verbose=50):
logs = Parallel()(delayed(process_log_file)(log_file) for log_file in log_files)print(logs)
```
**Tips:**
- Experiment with different parallel backends and data formats.
---
## Distributed Computing with Dask
**Objective:** Utilize `Dask` as a Joblib backend, to enable distributed computing capabilities.
```sh
pip install dask distributed
``````python
from dask.distributed import Client, LocalCluster
from joblib import Parallel, delayed, parallel_configdef square(x):
return x**2# See: https://docs.dask.org/en/stable/deploying.html#distributed-computing
if __name__ == "__main__":
with LocalCluster() as cluster:
with Client(cluster) as client:
with parallel_config(backend="dask", n_jobs=-1, verbose=50):
results = Parallel()(delayed(square)(i) for i in range(10))print(results)
```
**Tips:**
- Experiment with many ways to [deploy and run Dask clusters](https://docs.dask.org/en/stable/deploying.html#distributed-computing) and observe performance gains.
---
## Distributed Computing with Ray
**Objective:** Utilize `Ray` as a Joblib backend, to enable distributed computing capabilities.
```sh
pip install ray
``````python
from joblib import Parallel, delayed, parallel_config
from ray.util.joblib import register_raydef square(x):
return x**2# Register Ray Backend to be called with parallel_config(backend="ray")
register_ray()# See: https://docs.ray.io/en/latest/ray-core/walkthrough.html
if __name__ == "__main__":
with parallel_config(backend="ray", n_jobs=-1, verbose=50):
results = Parallel()(delayed(square)(i) for i in range(10))print(results)
```
**Tips:**
- Experiment with many ways to [deploy and run Ray clusters](https://docs.ray.io/en/latest/cluster/getting-started.html) and observe performance gains.
---
## What's Next
1. **Explore Advanced Joblib Features:** Delve deeper into Joblib's advanced features such as caching, lazy evaluation, and distributed computing for more complex tasks.
2. **Apply Joblib to Real-world Projects:** Implement Joblib in your own projects involving data processing, machine learning, or any CPU-intensive tasks to experience its benefits firsthand.
3. **Discover Related Libraries:** Explore other Python libraries for parallel computing and optimization, such as Dask, Ray or Multiprocessing, to broaden your toolkit.
4. **Stay Updated:** Keep an eye on Joblib's updates and enhancements in future releases to leverage the latest functionalities and optimizations.
## Gotchas
1. **Choose the Right Backend:** Select the appropriate Joblib backend based on your task and available resources. For CPU-bound tasks, `loky` or `multiprocessing` might be suitable. For I/O-bound tasks, `threading` or specific distributed computing backends like `dask` might be better.
2. **Optimal Number of Workers:** Experiment with the number of workers (`n_jobs`) to find the optimal configuration. Too many workers can lead to resource contention, while too few might underutilize resources.
3. **Data Transfer Overhead:** Minimize data transfer overhead between processes/threads. Large data transfers between parallel workers can become a bottleneck. Avoid unnecessary data sharing or copying if possible.
4. **Memory Consideration:** Be mindful of memory usage, especially when processing large datasets in parallel. Parallelism can increase memory consumption, potentially leading to resource contention or out-of-memory issues.
5. **Cleanup Resources:** Ensure proper cleanup of resources (e.g., closing files, releasing memory) after the parallel tasks complete to avoid resource leaks.
6. **Proper Error Handling:** Implement proper error handling mechanisms, especially when dealing with parallel tasks, to manage exceptions and prevent deadlocks or crashes.
7. **Benchmark and Profile:** Measure the performance of your parallelized code using benchmarking tools (`timeit`, `time`, etc.) to identify bottlenecks and areas for improvement.