Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hintak/pymupdf-jbig2-extract

Convert masks in pdf to images
https://github.com/hintak/pymupdf-jbig2-extract

Last synced: 17 days ago
JSON representation

Convert masks in pdf to images

Awesome Lists containing this project

README

        

Some pdf's (those from the Internet Archive, apparently) consist of mostly
scans. Structurally, every page is a background image with a mask, occasionally with
an invisible OCR text layer, and sometimes a lower-res preview/thumbnail image.

This is mostly a script trying to convert the mask into an image losslessly.
The background image (often just yellowing paper) makes the document less-readable.
Smaller pdf with single image per page (without requiring compositing) also loads faster, too.

The final `extract-all-mask.py` script keeps the original front and back cover, but converts every
page from 2 to N-1 from image+mask to just the mask.

Out of about 400 such pdf's:

* There is one (also apparently the oldest by its creation/modification date)
which has this strange problem (indented and spaced for readability from original)
of every page of 308 refering to every image and mask. Hence the `page.clean_contents()` line
and `mutool clean -sg ...`.

```
4668 0 obj
<< /Type /Page
/Parent 924 0 R
...
/Resources<<...
/XObject<>
>>
/MediaBox[0 0 487 832]
/StructParents 0
>>
```

* There is one other pdf which seems to have (blank) pages with mask only.

* The script under `gist.github.com` has been modified in small ways. The original can be found at that location.
(migrated to python 3, and changing resolution and color inversion; see below).

* For about 90% (i.e. about 40) of the 400, the heuristics (some masks have a "DeviceGray" color space name, and need to
be inverted to be black-text-on-white-background. Such images also have resolution 360 dpi instead of 300 dpi) seems to work.
But 5 pdfs appear white-text-on-black-background, and about 40 have the wrong resolution with values 150dpi, 300dpi, 350 dpi, 360dpi, 500dpi,
and ~642dpi being seen. Script `convert-page2.py` is intended to be run on the result pdf to quickly see the color of page 2 of result.
`resolution-page2.py` is intended to be run on the source pdf quickly to tell the resolution, plus the
every-page-references-every-image and mask-only-page problems.