https://github.com/neonwatty/quick_batch
ultra simple command line tool for docker-scaling batch processing
https://github.com/neonwatty/quick_batch
containerization data-science deep-learning docker large-scale machine-learning python
Last synced: about 2 months ago
JSON representation
ultra simple command line tool for docker-scaling batch processing
- Host: GitHub
- URL: https://github.com/neonwatty/quick_batch
- Owner: neonwatty
- Created: 2023-05-19T22:43:57.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-06-09T19:23:01.000Z (about 3 years ago)
- Last Synced: 2025-09-04T17:34:04.964Z (10 months ago)
- Topics: containerization, data-science, deep-learning, docker, large-scale, machine-learning, python
- Language: Python
- Homepage:
- Size: 9.24 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[](https://github.com/jermwatt/quick_batch/actions/workflows/python-app.yml)
[](https://github.com/jermwatt/quick_batch/actions/workflows/python-publish.yml)
# quick_batch
`quick_batch` is an ultra-simple command-line tool for large batch python-driven processing and transformation. It was designed to be fast to deploy, transparent, and portable. This allows you to scale any `processor` function that needs to be run over a large set of input data, enabling batch/parallel processing of the input with minimal setup and teardown.
- [quick\_batch](#quick_batch)
- [Getting started](#getting-started)
- [Usage](#usage)
- [Scaling](#scaling)
- [Installation](#installation)
- [The `processor.py` file](#the-processorpy-file)
- [Why use quick\_batch](#why-use-quick_batch)
# Getting started
All you need to scale batch transformations with `quick_batch` is a
- transformation function(s) in a `processor.py` file
- `Dockerfile` containing a container build appropriate to y our processor
- an optional `requirements.txt` file containing required python modules
Document paths to these objects as well as other parameters in a `config.yaml` config file of the form below.
Under `processor` you can either define a `dockerfile_path` to your Dockerfile or an `image_name` to a pre-built image to be pulled.
```yaml
data:
input_path: /path/to/your/input/data
output_path: /path/to/your/output/data
log_path: /path/to/your/log/file
queue:
feed_rate:
order_files:
processor:
dockerfile_path: /path/to/your/Dockerfile OR
image_name:
requirements_path: /path/to/your/requirements.txt
processor_path: /path/to/your/processor/processor.py
num_processors:
```
`quick_batch` will point your `processor.py` at the `input_path` defined in this `config.yaml` and process the files listed in it in parallel at a scale given by your choice of `num_processors`.
Output will be written to the `output_path` specified in the configuration file.
You can see the `examples` directory for examples of valid configs, processors, requirements, and dockerfiles.
## Usage
To start processing with your `config.yaml` use `quick_batch`'s `config` command at the terminal by typing
```bash
quick_batch config /path/to/your/config.yaml
```
This will start the build and deploy process for processing your data as defined in your `config.yaml`.
## Scaling
Use the `scale` commoand to manually scale the number of processors / containers running your process
```bash
quick_batch scale
```
Here `` is an integer >= 1. For example, to scale to 3 parallel processors / containers: `quick_batch scale 3`
## Installation
To install quick_batch, simply use `pip`:
```bash
pip install quick-batch
```
## The `processor.py` file
Create a `processor.py` file with the following basic pattern:
```python
import ...
def processor(todos):
for file_name in todos.file_paths_to_process:
# processing code
```
The `todos` object will carry in `feed_rate` number of file names to process in `.file_paths_to_process`.
Note: the function name `processor` is mandatory.
# Why use quick_batch
quick_batch aims to be
- **dead simple to use:** versus standard cloud service batch transformation services that require significant configuration / service understanding
- **ultra fast setup:** versus setup of heavier orchestration tools like `airflow` or `mlflow`, which may be a hinderance due to time / familiarity / organisational constraints
- **100% portable:** - use quick_batch on any machine, anywhere
- **processor-invariant:** quick_batch works with arbitrary processes, not just machine learning or deep learning tasks.
- **transparent and open source:** quick_batch uses Docker under the hood and only abstracts away the not-so-fun stuff - including instantiation, scaling, and teardown. you can still monitor your processing using familiar Docker command-line arguments (like `docker service ls`, `docker service logs`, etc.).