Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jina-ai/executor-pdfsegmenter

Jina Executor used for extracting images and text as chunks from PDF data
https://github.com/jina-ai/executor-pdfsegmenter

Last synced: 3 months ago
JSON representation

Jina Executor used for extracting images and text as chunks from PDF data

Host: GitHub
URL: https://github.com/jina-ai/executor-pdfsegmenter
Owner: jina-ai
Created: 2022-03-14T14:11:25.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-03-28T09:31:33.000Z (over 1 year ago)
Last Synced: 2024-05-30T00:59:12.530Z (5 months ago)
Language: Python
Homepage: https://hub.jina.ai/executor/x9w7lcwg
Size: 3.72 MB
Stars: 14
Watchers: 25
Forks: 2
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # ✨ PDFSegmenter

PDFSegmenter is an Executor used for extracting images and text as chunks from PDF data. It stores each images and text of each page as chunks separately, with their respective mime types. It uses the [pdfplumber](https://github.com/jsvine/pdfplumber) library.

## Loading data

The `PDFSegmenter` expects data to be found in the `Document`'s `.blob` attribute. This can be loaded from a PDF file like so

```python

from docarray import DocumentArray, Document

from jina import Flow

doc = DocumentArray([Document(uri='cats_are_awesome.pdf')]) # adjust to your own pdf

doc[0].load_uri_to_blob()

print(doc[0])

f = Flow().add(

    uses='jinahub+docker://PDFSegmenter',

)

with f:

    resp = f.post(on='/craft', inputs=doc)

    print(f'{[c.mime_type for c in resp[0].chunks]}')

```

```

>>  # notice `.blob` field is set

>> ['image/*', 'image/*', 'text/plain'] # we get both images and text from a PDF

```