https://github.com/hansmi/paperminer
Amend Paperless documents with extracted information
https://github.com/hansmi/paperminer
paperless paperless-ngx pdf
Last synced: about 1 year ago
JSON representation
Amend Paperless documents with extracted information
- Host: GitHub
- URL: https://github.com/hansmi/paperminer
- Owner: hansmi
- License: bsd-3-clause
- Created: 2024-02-16T22:58:01.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-13T15:11:09.000Z (about 1 year ago)
- Last Synced: 2025-03-13T16:25:22.424Z (about 1 year ago)
- Topics: paperless, paperless-ngx, pdf
- Language: Go
- Homepage:
- Size: 240 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Amend Paperless documents with extracted information
[][releases]
[](https://github.com/hansmi/paperminer/actions/workflows/ci.yaml)
[](https://pkg.go.dev/github.com/hansmi/paperminer)
Paperminer is a system for amending documents stored in
[Paperless-ngx][paperless] with additional information ("facts") extracted from
the documents themselves or other sources.
The [`hansmi/dossier` package][dossier] is called to parse PDF documents (other
formats could be implemented).
The Go programming language's [`plugin` package][gopkgplugin] comes with
a number of caveats which make it unsuitable. Compile-time plugins via the
[`hansmi/staticplug` package][staticplug] are used instead. It's therefore
necessary to set up your own build. An example for a program with a plugin can
be found in the [`example/myminer` directory](./example/myminer).
Plugins may use [dossier sketches][dossiersketch] to look for specific regular
expressions at absolute or relative positions on pages. The [`sketchfacts`
package](./pkg/sketchfacts/) is often sufficient even though it ignores pages
beyond the first. Custom logic can produce document facts from the findings.
Plugins may also extract arbitrary document pages and implement their own data
extraction. External APIs may also be involved.
Normalizing extracted text before parsing it further is generally recommended,
not just for date and time: remove extraneous whitespace and separators, etc.
Regular expressions should also be written to be flexible where possible.
OCR-derived text is often not exactly the same as the original.
Useful packages for writing document facters:
* [`hansmi/zyt`][zyt]: Parse language/locale-specific date and time formats.
* [`hansmi/aurum`][aurum]: Golden tests. Used for generic document facter tests
by the [`factertest` package](./pkg/factertest).
[aurum]: https://github.com/hansmi/aurum/
[dossier]: https://github.com/hansmi/dossier/
[dossiersketch]: https://github.com/hansmi/dossier/#sketches
[gopkgplugin]: https://pkg.go.dev/plugin@go1.22.0
[paperless]: https://docs.paperless-ngx.com/
[releases]: https://github.com/hansmi/paperminer/releases/latest
[staticplug]: https://github.com/hansmi/staticplug/
[zyt]: https://github.com/hansmi/zyt/