An open API service indexing awesome lists of open source software.

https://github.com/peterk/pimmer

Exploratory code for PDF image mining
https://github.com/peterk/pimmer

code4lib datamining humanities image-analysis image-mining opencv

Last synced: 12 months ago
JSON representation

Exploratory code for PDF image mining

Awesome Lists containing this project

README

          

# pimmer
Exploratory code for PDF image mining. A multi page PDF will be split and converted to jpeg files that are mined for illustrations and images. Baed on https://github.com/megloff1/image-mining with added PDF splitting, a simple GUI and queue management.

## Install

1. Make sure you have Git and [Docker](https://www.docker.com) with docker-compose installed.
2. Get the latest version of this repository: `git clone --depth 1 https://github.com/peterk/pimmer.git`.
2. Copy the example_env file to `.env` and edit settings.
3. Make sure you have a folder called `data` in the project root folder (jobs and resulting image files will end up here). You can map output to a different local folder for the worker in `docker-compose.yml`.
4. Run `docker-compose up -d`. Wait a minute until the queue and worker is up.

The service is now running on http://localhost:7777.

If you are planning on processing a large number of documents you can start more workers with `docker-compose up -d --scale worker=5` and then post files with curl to the `/process/` endpoint:

`curl -v --silent -F "file=@testdata/hat_catalog.pdf" http://0.0.0.0:7777/process/`

Please report bugs and feedback in the Github issue tracker.

## Results

The detected images will end up as individual image files in job folders in the ./data/results.

The job folder will also contain a json file per page with the coordinates of the detected images.

A digitized hat catalog like this:
![Hat catalog page](testdata/hat_catalog_page.jpg?raw=true "Hat catalog page")

... results in all the individual hat images:
![Individual hat images](testdata/hat_catalog_result.jpg?raw=true "Detected hat images")