https://github.com/hansmi/dossier
Extract textual information from PDF documents
https://github.com/hansmi/dossier
extraction golang ocr ocrmypdf paperless pdf
Last synced: 2 months ago
JSON representation
Extract textual information from PDF documents
- Host: GitHub
- URL: https://github.com/hansmi/dossier
- Owner: hansmi
- License: bsd-3-clause
- Created: 2023-12-25T20:31:21.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-13T15:17:23.000Z (2 months ago)
- Last Synced: 2025-03-13T16:26:27.207Z (2 months ago)
- Topics: extraction, golang, ocr, ocrmypdf, paperless, pdf
- Language: Go
- Homepage:
- Size: 369 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Extract information from PDF documents
[][releases]
[](https://github.com/hansmi/dossier/actions/workflows/ci.yaml)
[](https://pkg.go.dev/github.com/hansmi/dossier)Dossier is a library for extracting textual information from PDF documents. It
is written using the Go programming language.Currently PDF is the only supported format (using [MuPDF][mupdf]). Other
formats can be implemented using custom parsers or by amending the library.[Sketches](#sketches) provide a declarative approach to locating information as
an alternative to imperative/procedural access.## Sketches
[Protocol buffers][protobuf] are used to define a sketch. The [sketch protobuf
definition](proto/sketch.proto) documents available configuration options.
Usually [textproto][textproto] will be the format used for writing sketches.A web-based viewer is included in the command line utility. Screenshot of the
viewer with an [example sketch for
invoices](/pkg/sketch/testdata/acme-invoice.textproto):
Invocation:
```shell
$ dossiercli web ./invoice.pdf ./sketch.textproto
2023/12/31 00:00:00 HTTP server listening on http://[::1]:8080
```## Installation
```shell
go get github.com/hansmi/dossier
```Command line utility:
```shell
go install github.com/hansmi/dossier/cmd/dossiercli@latest
```[releases]: https://github.com/hansmi/dossier/releases/latest
[mupdf]: https://mupdf.com/
[protobuf]: https://protobuf.dev/
[textproto]: https://protobuf.dev/reference/protobuf/textformat-spec/