https://github.com/devcsrj/docparsr-jvm

JVM client for https://github.com/axa-group/Parsr
https://github.com/devcsrj/docparsr-jvm

data document extraction nlp ocr pdf

Last synced: over 1 year ago
JSON representation

JVM client for https://github.com/axa-group/Parsr

Host: GitHub
URL: https://github.com/devcsrj/docparsr-jvm
Owner: devcsrj
License: apache-2.0
Created: 2020-03-03T08:48:46.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-08-01T18:14:25.000Z (almost 6 years ago)
Last Synced: 2025-01-14T08:24:01.477Z (over 1 year ago)
Topics: data, document, extraction, nlp, ocr, pdf
Language: Kotlin
Size: 218 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Parsr

![](https://img.shields.io/travis/devcsrj/docparsr-jvm)

![](https://img.shields.io/github/license/devcsrj/docparsr-jvm)

![](https://img.shields.io/maven-central/v/com.github.devcsrj/docparsr)

This project is a JVM client for [Axa group's Parsr](https://github.com/axa-group/Parsr) project.

## Download

Grab via Maven:

```

    com.github.devcsrj

    docparsr

    (version)

```

or Gradle:

```

implementation("com.github.devcsrj:docparsr:$version")

```

## Usage

Assuming you have the [API server running](https://github.com/axa-group/Parsr#usage), you can communicate 

with it using: 

```kotlin

val uri = URI.create("http://localhost:3001")

val parser = DocParsr.create(uri)

```

Then, submit your file with:

```kotlin

val file = File("hello.pdf")    // your pdf or image file

val config = Configuration()    // or, 'parser.getDefaultConfig()`

val job = parser.newParsingJob(file, config)

```

At this point, the `job` object presents you with either synchronous:

```kotlin

val result = job.execute()

``` 

...or asynchronous method:

```kotlin

val callback = object: Callback {

    fun onFailure(job: ParsingJob, e: Exception) {}

    fun onProgress(job: ParsingJob, progress: Progress) {}

    fun onSuccess(job: ParsingJob, result: ParsingResult) {}

}

job.enqueue(callback)

```

Regardless of the approach you choose, you end up with a `ParsingResult`. You can then

access the [various generated output](https://github.com/axa-group/Parsr/blob/master/docs/api-guide.md#3-get-the-results)

from the server with:

```kotlin

result.source(Text).use {

   // copy the InputStream

}

``` 

If you are instead interested on the [JSON schema](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md), this

library provides a [Visitor](https://en.wikipedia.org/wiki/Visitor_pattern) -based API:

```kotlin

val visitor = object: DocumentVisitor {

   // override methods

}

val document = Document.from(result)

document.accept(visitor) 

```

## Building

Like any other [gradle](https://github.com/axa-group/Parsr) -based project, you can build the artifacts

with:

```

$ ./gradlew build

```

This project also includes functional test, which runs against an actual Parsr server. Assuming

you have [Docker](https://www.docker.com/) installed, run the tests with:

```

$ ./gradlew functionalTest

```

## Future work

* [Key-Value pair metadata](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#31-key-value-pair-metadata)

* [Drawing](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#126-drawing-type)

* [Image](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#125-image-type)

* [Barcode](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#125-image-type)

* [Table](https://github.com/axa-group/Parsr/blob/master/docs/json-output.md#125-image-type)

## Motivation

When I was working on the [Klerk](https://github.com/devcsrj/klerk) project, I realized how difficult

and time-consuming it is to scrape data from PDF documents. My approach then also involved the use of

heavy witchcraft using [Tesseract](https://github.com/tesseract-ocr), because typical PDF-to-text libraries

just don't cut it (especially on skewed, or garbled sections).

The [Parsr project](https://github.com/axa-group/Parsr) seems to also tackle the challenges I faced,

and more. To keep the data extraction out of my [Beam](https://beam.apache.org/) pipeline, I wrote this

library.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/devcsrj/docparsr-jvm

Awesome Lists containing this project

README