https://github.com/hintak/pymupdf-jbig2-extract

Convert masks in pdf to images
https://github.com/hintak/pymupdf-jbig2-extract

Last synced: 28 days ago
JSON representation

Convert masks in pdf to images

Host: GitHub
URL: https://github.com/hintak/pymupdf-jbig2-extract
Owner: HinTak
Created: 2024-02-22T01:41:17.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-05-11T23:40:24.000Z (about 1 year ago)
Last Synced: 2025-06-14T23:43:08.568Z (30 days ago)
Language: Python
Size: 7.81 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Some pdf's (those from the Internet Archive, apparently) consist of mostly
scans. Structurally, every page is a background image with a mask, occasionally with
an invisible OCR text layer, and sometimes a lower-res preview/thumbnail image.

This is mostly a script trying to convert the mask into an image losslessly.
The background image (often just yellowing paper) makes the document less-readable.
Smaller pdf with single image per page (without requiring compositing) also loads faster, too.

The final `extract-all-mask.py` script keeps the original front and back cover, but converts every
page from 2 to N-1 from image+mask to just the mask.

Out of about 400 such pdf's:

* There is one (also apparently the oldest by its creation/modification date)
which has this strange problem (indented and spaced for readability from original)
of every page of 308 refering to every image and mask. Hence the `page.clean_contents()` line
and `mutool clean -sg ...`.

```
4668 0 obj
<< /Type /Page
/Parent 924 0 R
...
/Resources<<...
/XObject<>
>>
/MediaBox[0 0 487 832]
/StructParents 0
>>
```

* There is one other pdf which seems to have (blank) pages with mask only.

* The script under `gist.github.com` has been modified in small ways. The original can be found at that location.
(migrated to python 3, and changing resolution and color inversion; see below).

* For about 90% (i.e. about 40) of the 400, the heuristics (some masks have a "DeviceGray" color space name, and need to
be inverted to be black-text-on-white-background. Such images also have resolution 360 dpi instead of 300 dpi) seems to work.
But 5 pdfs appear white-text-on-black-background, and about 40 have the wrong resolution with values 150dpi, 300dpi, 350 dpi, 360dpi, 500dpi,
and ~642dpi being seen. Script `convert-page2.py` is intended to be run on the result pdf to quickly see the color of page 2 of result.
`resolution-page2.py` is intended to be run on the source pdf quickly to tell the resolution, plus the
every-page-references-every-image and mask-only-page problems.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hintak/pymupdf-jbig2-extract

Awesome Lists containing this project

README