Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jina-ai/executor-pdfsegmenter
Jina Executor used for extracting images and text as chunks from PDF data
https://github.com/jina-ai/executor-pdfsegmenter
Last synced: 3 months ago
JSON representation
Jina Executor used for extracting images and text as chunks from PDF data
- Host: GitHub
- URL: https://github.com/jina-ai/executor-pdfsegmenter
- Owner: jina-ai
- Created: 2022-03-14T14:11:25.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-03-28T09:31:33.000Z (over 1 year ago)
- Last Synced: 2024-05-30T00:59:12.530Z (5 months ago)
- Language: Python
- Homepage: https://hub.jina.ai/executor/x9w7lcwg
- Size: 3.72 MB
- Stars: 14
- Watchers: 25
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ✨ PDFSegmenter
PDFSegmenter is an Executor used for extracting images and text as chunks from PDF data. It stores each images and text of each page as chunks separately, with their respective mime types. It uses the [pdfplumber](https://github.com/jsvine/pdfplumber) library.
## Loading data
The `PDFSegmenter` expects data to be found in the `Document`'s `.blob` attribute. This can be loaded from a PDF file like so
```python
from docarray import DocumentArray, Document
from jina import Flowdoc = DocumentArray([Document(uri='cats_are_awesome.pdf')]) # adjust to your own pdf
doc[0].load_uri_to_blob()
print(doc[0])f = Flow().add(
uses='jinahub+docker://PDFSegmenter',
)
with f:
resp = f.post(on='/craft', inputs=doc)
print(f'{[c.mime_type for c in resp[0].chunks]}')
``````
>> # notice `.blob` field is set
>> ['image/*', 'image/*', 'text/plain'] # we get both images and text from a PDF
```