https://github.com/peterk/pimmer
Exploratory code for PDF image mining
https://github.com/peterk/pimmer
code4lib datamining humanities image-analysis image-mining opencv
Last synced: 12 months ago
JSON representation
Exploratory code for PDF image mining
- Host: GitHub
- URL: https://github.com/peterk/pimmer
- Owner: peterk
- License: apache-2.0
- Created: 2018-11-23T10:17:27.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-08-31T21:19:18.000Z (over 1 year ago)
- Last Synced: 2025-03-26T06:11:19.505Z (about 1 year ago)
- Topics: code4lib, datamining, humanities, image-analysis, image-mining, opencv
- Language: Python
- Size: 25 MB
- Stars: 6
- Watchers: 1
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pimmer
Exploratory code for PDF image mining. A multi page PDF will be split and converted to jpeg files that are mined for illustrations and images. Baed on https://github.com/megloff1/image-mining with added PDF splitting, a simple GUI and queue management.
## Install
1. Make sure you have Git and [Docker](https://www.docker.com) with docker-compose installed.
2. Get the latest version of this repository: `git clone --depth 1 https://github.com/peterk/pimmer.git`.
2. Copy the example_env file to `.env` and edit settings.
3. Make sure you have a folder called `data` in the project root folder (jobs and resulting image files will end up here). You can map output to a different local folder for the worker in `docker-compose.yml`.
4. Run `docker-compose up -d`. Wait a minute until the queue and worker is up.
The service is now running on http://localhost:7777.
If you are planning on processing a large number of documents you can start more workers with `docker-compose up -d --scale worker=5` and then post files with curl to the `/process/` endpoint:
`curl -v --silent -F "file=@testdata/hat_catalog.pdf" http://0.0.0.0:7777/process/`
Please report bugs and feedback in the Github issue tracker.
## Results
The detected images will end up as individual image files in job folders in the ./data/results.
The job folder will also contain a json file per page with the coordinates of the detected images.
A digitized hat catalog like this:

... results in all the individual hat images:
