{"id":24429198,"url":"https://github.com/peterk/pimmer","last_synced_at":"2025-04-12T11:21:55.857Z","repository":{"id":66470417,"uuid":"158816147","full_name":"peterk/pimmer","owner":"peterk","description":"Exploratory code for PDF image mining","archived":false,"fork":false,"pushed_at":"2024-08-31T21:19:18.000Z","size":26178,"stargazers_count":6,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-26T06:11:19.505Z","etag":null,"topics":["code4lib","datamining","humanities","image-analysis","image-mining","opencv"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/peterk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-23T10:17:27.000Z","updated_at":"2024-08-31T21:19:21.000Z","dependencies_parsed_at":"2024-08-28T20:54:21.833Z","dependency_job_id":"e168b916-5e16-420c-be35-ffc3486d4a5d","html_url":"https://github.com/peterk/pimmer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterk%2Fpimmer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterk%2Fpimmer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterk%2Fpimmer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/peterk%2Fpimmer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/peterk","download_url":"https://codeload.github.com/peterk/pimmer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248558133,"owners_count":21124223,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code4lib","datamining","humanities","image-analysis","image-mining","opencv"],"created_at":"2025-01-20T13:33:46.852Z","updated_at":"2025-04-12T11:21:55.828Z","avatar_url":"https://github.com/peterk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pimmer\nExploratory code for PDF image mining. A multi page PDF will be split and converted to jpeg files that are mined for illustrations and images. Baed on https://github.com/megloff1/image-mining with added PDF splitting, a simple GUI and queue management.\n\n## Install\n\n1. Make sure you have Git and [Docker](https://www.docker.com) with docker-compose installed.\n2. Get the latest version of this repository: `git clone --depth 1 https://github.com/peterk/pimmer.git`.\n2. Copy the example_env file to `.env` and edit settings.\n3. Make sure you have a folder called `data` in the project root folder (jobs and resulting image files will end up here). You can map output to a different local folder for the worker in `docker-compose.yml`.\n4. Run `docker-compose up -d`. Wait a minute until the queue and worker is up.\n\nThe service is now running on http://localhost:7777.\n\nIf you are planning on processing a large number of documents you can start more workers with `docker-compose up -d --scale worker=5` and then post files with curl to the `/process/` endpoint:\n\n`curl -v --silent -F \"file=@testdata/hat_catalog.pdf\" http://0.0.0.0:7777/process/`\n\nPlease report bugs and feedback in the Github issue tracker.\n\n## Results\n\nThe detected images will end up as individual image files in job folders in the ./data/results. \n\nThe job folder will also contain a json file per page with the coordinates of the detected images.\n\nA digitized hat catalog like this:\n![Hat catalog page](testdata/hat_catalog_page.jpg?raw=true \"Hat catalog page\")\n\n... results in all the individual hat images:\n![Individual hat images](testdata/hat_catalog_result.jpg?raw=true \"Detected hat images\")\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterk%2Fpimmer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpeterk%2Fpimmer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpeterk%2Fpimmer/lists"}