https://github.com/peterk/pimmer

Exploratory code for PDF image mining
https://github.com/peterk/pimmer

code4lib datamining humanities image-analysis image-mining opencv

Last synced: about 1 year ago
JSON representation

Exploratory code for PDF image mining

Host: GitHub
URL: https://github.com/peterk/pimmer
Owner: peterk
License: apache-2.0
Created: 2018-11-23T10:17:27.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2024-08-31T21:19:18.000Z (almost 2 years ago)
Last Synced: 2025-03-26T06:11:19.505Z (about 1 year ago)
Topics: code4lib, datamining, humanities, image-analysis, image-mining, opencv
Language: Python
Size: 25 MB
Stars: 6
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# pimmer
Exploratory code for PDF image mining. A multi page PDF will be split and converted to jpeg files that are mined for illustrations and images. Baed on https://github.com/megloff1/image-mining with added PDF splitting, a simple GUI and queue management.

## Install

1. Make sure you have Git and [Docker](https://www.docker.com) with docker-compose installed.
2. Get the latest version of this repository: `git clone --depth 1 https://github.com/peterk/pimmer.git`.
2. Copy the example_env file to `.env` and edit settings.
3. Make sure you have a folder called `data` in the project root folder (jobs and resulting image files will end up here). You can map output to a different local folder for the worker in `docker-compose.yml`.
4. Run `docker-compose up -d`. Wait a minute until the queue and worker is up.

The service is now running on http://localhost:7777.

If you are planning on processing a large number of documents you can start more workers with `docker-compose up -d --scale worker=5` and then post files with curl to the `/process/` endpoint:

`curl -v --silent -F "file=@testdata/hat_catalog.pdf" http://0.0.0.0:7777/process/`

Please report bugs and feedback in the Github issue tracker.

## Results

The detected images will end up as individual image files in job folders in the ./data/results.

The job folder will also contain a json file per page with the coordinates of the detected images.

A digitized hat catalog like this:
![Hat catalog page](testdata/hat_catalog_page.jpg?raw=true "Hat catalog page")

... results in all the individual hat images:
![Individual hat images](testdata/hat_catalog_result.jpg?raw=true "Detected hat images")

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/peterk/pimmer

Awesome Lists containing this project

README