Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hintak/pymupdf-jbig2-extract
Convert masks in pdf to images
https://github.com/hintak/pymupdf-jbig2-extract
Last synced: 17 days ago
JSON representation
Convert masks in pdf to images
- Host: GitHub
- URL: https://github.com/hintak/pymupdf-jbig2-extract
- Owner: HinTak
- Created: 2024-02-22T01:41:17.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-05-11T23:40:24.000Z (6 months ago)
- Last Synced: 2024-05-12T23:24:22.072Z (6 months ago)
- Language: Python
- Size: 7.81 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Some pdf's (those from the Internet Archive, apparently) consist of mostly
scans. Structurally, every page is a background image with a mask, occasionally with
an invisible OCR text layer, and sometimes a lower-res preview/thumbnail image.This is mostly a script trying to convert the mask into an image losslessly.
The background image (often just yellowing paper) makes the document less-readable.
Smaller pdf with single image per page (without requiring compositing) also loads faster, too.The final `extract-all-mask.py` script keeps the original front and back cover, but converts every
page from 2 to N-1 from image+mask to just the mask.Out of about 400 such pdf's:
* There is one (also apparently the oldest by its creation/modification date)
which has this strange problem (indented and spaced for readability from original)
of every page of 308 refering to every image and mask. Hence the `page.clean_contents()` line
and `mutool clean -sg ...`.```
4668 0 obj
<< /Type /Page
/Parent 924 0 R
...
/Resources<<...
/XObject<>
>>
/MediaBox[0 0 487 832]
/StructParents 0
>>
```* There is one other pdf which seems to have (blank) pages with mask only.
* The script under `gist.github.com` has been modified in small ways. The original can be found at that location.
(migrated to python 3, and changing resolution and color inversion; see below).* For about 90% (i.e. about 40) of the 400, the heuristics (some masks have a "DeviceGray" color space name, and need to
be inverted to be black-text-on-white-background. Such images also have resolution 360 dpi instead of 300 dpi) seems to work.
But 5 pdfs appear white-text-on-black-background, and about 40 have the wrong resolution with values 150dpi, 300dpi, 350 dpi, 360dpi, 500dpi,
and ~642dpi being seen. Script `convert-page2.py` is intended to be run on the result pdf to quickly see the color of page 2 of result.
`resolution-page2.py` is intended to be run on the source pdf quickly to tell the resolution, plus the
every-page-references-every-image and mask-only-page problems.