https://github.com/jeremyarancio/vlm-batch-deployment

Batch Deployment for Document Parsing with AWS Batch & Qwen-2.5-VL
https://github.com/jeremyarancio/vlm-batch-deployment

aws batch llm vllm vlm

Last synced: 5 months ago
JSON representation

Batch Deployment for Document Parsing with AWS Batch & Qwen-2.5-VL

Host: GitHub
URL: https://github.com/jeremyarancio/vlm-batch-deployment
Owner: jeremyarancio
Created: 2025-04-17T13:31:37.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-04-28T06:33:29.000Z (5 months ago)
Last Synced: 2025-04-28T07:33:39.420Z (5 months ago)
Topics: aws, batch, llm, vllm, vlm
Language: Jupyter Notebook
Homepage:
Size: 398 KB
Stars: 14
Watchers: 1
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# VLM for Document structured extraction

In this project, we build a Batch inference job to extract data from reports and invoices using Vision Language Models (VLM) with vLLM.

The Batch inference is deployed and orchestrated in [AWS Batch](https://aws.amazon.com/fr/batch/).

This project is part of the Webinar we presented with [Julien Hurault](https://www.linkedin.com/in/julienhuraultanalytics/).

:microphone: Webinar (coming soon) \
:newspaper: [Article](https://medium.com/towards-artificial-intelligence/deploy-an-in-house-vision-language-model-to-parse-millions-of-documents-say-goodbye-to-gemini-and-cdac6f77aff5)

Subscribe to the [Newsletter](https://medium.com/@jeremyarancio/subscribe).

## Quick start

The repository is organized as such:

```
.
├── src
│   └── llm
│   ├── __init__.py
│   ├── __main__.py
│   ├── parser // Job module
│   └── settings.py // Settings and Env variables
├── data
│   └── docs // Downloaded documents for testing
├── infra // AWS Batch insfrastructure deployment
├── Dockerfile
├── Makefile
├── NOTES.md // Technical notes
├── README.md
├── assets
├── notebooks // Experimentations
├── scripts // Various scripts not used in package
├── pyproject.toml
└── uv.lock
```

The module is packaged with [uv](https://github.com/astral-sh/uv).
To install all the dependencies, run:

```bash
uv sync
```

To run the batch job:

1. Use the `.env.template` to create your own `.env` file.
2. You need to run the job within an environement with GPU such as L4, depending on the size of the model

Then run:

```bash
uv run run-batch-job
```

## Run online Batch inference

Deploy the module using Docker to AWS ECR with:

```bash
make deploy ECR_ACCOUNT_ID=
```

NOTE: You may want to change the ECR repository (ECR_REPO_NAME) or the AWS region (AWS_REGION)

Then, deploy the Batch infrastructure on AWS using Terraform, run:

```bash
make aws-batch-apply
```

NOTE: Be sure to have Terraform installed.

Once the infrastructure is set up, you can launch a job using the `aws batch` cli command.

```bash
aws batch submit-job \
--job-name \
--job-queue demo-job-queue \
--job-definition demo-job-definition
```

## Process overview

The Batch process looks like the following:

* The documents are loaded from S3 as images. You need to indicates 3 environment variables:
* `S3_BUCKET`: the S3 bucket name
* `S3_PREPROCESSED_IMAGES_DIR_PREFIX`: the directory name where the invoices are stored. It should be images and not PDFs.
* `S3_PROCESSED_DATASET_PREFIX`: The path of the output dataset. Right now, the task only returns JSONL dataset (`.jsonl`).
* The model `MODEL_NAME` is loaded using **vLLM**. By default, we load *"Qwen/Qwen2.5-VL-3B-Instruct"*. But feel free to get any larger models if they fit into memory.
* vLLM is configured to return a structured output using `"GuidedDecoding"` by providing the expected schema with Pydantic.
* Images are processed by vLLM and a `json` is extracted for each invoice. If the json decoding is not successful, an empty dict is returned instead.
* NOT IMPLEMENTED YET: Pydantic is used to validate the extracted jsons and default values are returned if field validation fails.
* The list of dicts, with an unique identifier (such as the S3 file path), is transformed into a usable dataset (here JSONL since there's no data type validation with Pydantic yet.)
* The dataset is finally exported to S3. Indicate where with the environment variable `S3_PROCESSED_DATASET_PREFIX`. Be sure to indicate the proper file format (`.jsonl` in this case.)

## Dataset

For this demo, we used synthetically generated invoices from this [dataset](https://huggingface.co/datasets/mathieu1256/FATURA2-invoices) on Hugging Face.

To download the full dataset:

```bash
make download-data
```

There's also a script in `scripts/` folder to load a sample of images.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jeremyarancio/vlm-batch-deployment

Awesome Lists containing this project

README