https://github.com/devcsrj/docparsr-jvm
JVM client for https://github.com/axa-group/Parsr
https://github.com/devcsrj/docparsr-jvm
data document extraction nlp ocr pdf
Last synced: about 1 year ago
JSON representation
JVM client for https://github.com/axa-group/Parsr
- Host: GitHub
- URL: https://github.com/devcsrj/docparsr-jvm
- Owner: devcsrj
- License: apache-2.0
- Created: 2020-03-03T08:48:46.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-08-01T18:14:25.000Z (over 5 years ago)
- Last Synced: 2025-01-14T08:24:01.477Z (about 1 year ago)
- Topics: data, document, extraction, nlp, ocr, pdf
- Language: Kotlin
- Size: 218 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Parsr



This project is a JVM client for [Axa group's Parsr](https://github.com/axa-group/Parsr) project.
## Download
Grab via Maven:
```
com.github.devcsrj
docparsr
(version)
```
or Gradle:
```
implementation("com.github.devcsrj:docparsr:$version")
```
## Usage
Assuming you have the [API server running](https://github.com/axa-group/Parsr#usage), you can communicate
with it using:
```kotlin
val uri = URI.create("http://localhost:3001")
val parser = DocParsr.create(uri)
```
Then, submit your file with:
```kotlin
val file = File("hello.pdf") // your pdf or image file
val config = Configuration() // or, 'parser.getDefaultConfig()`
val job = parser.newParsingJob(file, config)
```
At this point, the `job` object presents you with either synchronous:
```kotlin
val result = job.execute()
```
...or asynchronous method:
```kotlin
val callback = object: Callback {
fun onFailure(job: ParsingJob, e: Exception) {}
fun onProgress(job: ParsingJob, progress: Progress) {}
fun onSuccess(job: ParsingJob, result: ParsingResult) {}
}
job.enqueue(callback)
```
Regardless of the approach you choose, you end up with a `ParsingResult`. You can then
access the [various generated output](https://github.com/axa-group/Parsr/blob/master/docs/api-guide.md#3-get-the-results)
from the server with:
```kotlin
result.source(Text).use {
// copy the InputStream
}
```
If you are instead interested on the [JSON schema](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md), this
library provides a [Visitor](https://en.wikipedia.org/wiki/Visitor_pattern) -based API:
```kotlin
val visitor = object: DocumentVisitor {
// override methods
}
val document = Document.from(result)
document.accept(visitor)
```
## Building
Like any other [gradle](https://github.com/axa-group/Parsr) -based project, you can build the artifacts
with:
```
$ ./gradlew build
```
This project also includes functional test, which runs against an actual Parsr server. Assuming
you have [Docker](https://www.docker.com/) installed, run the tests with:
```
$ ./gradlew functionalTest
```
## Future work
* [Key-Value pair metadata](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#31-key-value-pair-metadata)
* [Drawing](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#126-drawing-type)
* [Image](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#125-image-type)
* [Barcode](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#125-image-type)
* [Table](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#125-image-type)
## Motivation
When I was working on the [Klerk](https://github.com/devcsrj/klerk) project, I realized how difficult
and time-consuming it is to scrape data from PDF documents. My approach then also involved the use of
heavy witchcraft using [Tesseract](https://github.com/tesseract-ocr), because typical PDF-to-text libraries
just don't cut it (especially on skewed, or garbled sections).
The [Parsr project](https://github.com/axa-group/Parsr) seems to also tackle the challenges I faced,
and more. To keep the data extraction out of my [Beam](https://beam.apache.org/) pipeline, I wrote this
library.