Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/applicaai/digital-born-pdf-scanner

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/applicaai/digital-born-pdf-scanner
Owner: applicaai
License: agpl-3.0
Created: 2020-09-10T10:51:39.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-09-10T11:07:45.000Z (over 4 years ago)
Last Synced: 2024-08-03T17:08:14.788Z (7 months ago)
Language: Java
Size: 35.2 KB
Stars: 7
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-document-understanding - Born digital pdf scanner - born-pdf-scanner.svg?style=social) - checking if pdf is born-digital (Resources)

README

# Digital-born PDF Scanner

## Genesis

Many of PDF files that we have downloaded are digital-born, that is contain easily accessible text layer that PDF
viewers use to display text. Some are definitely scanned documents, that do not have any text layer at all, some
are searchable OCR-processed scans that contain a lot of hidden text.

Since we want to tell apart all of these categories, we need a tool to detect them. Thus this tool.

## Usage

In order to run use `java -jar`:

`java -jar digital-born-pdf-scanner-0.0.1-SNAPSHOT-jar-with-dependencies.jar`

Invoking the jar without parameters produces the following output:

```
Options:
-f, --filename
Filename to check in single file mode
-d, --input-dir
Input directory to look for PDF files
-o, --output-file-name
File to write results to. Supported extensions are *.tsv, *.csv
Default: results.tsv
-r, --recursive
Whether to search for PDF files recursively
Default: false
--sort
Whether to sort file name in results.
Default: false
-v, --verbose
Whether to print processed file names.
Default: false
```

Clearly, there are two modes of operation:
* single file mode (use `-f path-to-pdf-file`)
* directory scan mode (use `-d path-to-directory-with-pdf-files`)

The latter can be used with recursive directory scan (use `-r`), which searches subdirectories for PDF files.

The output will be stored to TAB-separated file `results.tsv` unless different file name is provided. The output will be
either semicolon or TAB-separated depending on file extension.

### Handling error log

Since there might be a lot of errors coming from failed files printed, it makes sense to redirect logs to a file,
for example:

`java -jar digital-born-pdf-scanner-0.0.1-SNAPSHOT-jar-with-dependencies.jar -d dir-with-pdfs -r --sort -v 2> error.log`

### Tracking progress

Currently the only way to track processing progress is to enable verbose output (`-v`) and couple it with file sorting
(`--sort`). This will produce nice color output showing which file is successfully processed (or whether there was
processing failure). Please see example above.

## Output interpretation

The output file consists of following columns:

At a time of writing, the tool does not tell you if document is scanned, searchable scanned, or is digital born.
However, certain heuristics can be deduced from the output:

* No text in a document (both visible and hidden text len equals 0) - this is a scanned document with 99% probability
* No visible text, a lot of hidden text, max covered area ≈ 1.0 - this must be a searchable scanned document
* No hidden text, a lot of visible text - this is probably digital-born document

The question is what is "a lot of text". Well, we have to check to know.