https://github.com/unstructured-io/pipeline-document-layout
Pipeline for layout extraction
https://github.com/unstructured-io/pipeline-document-layout
Last synced: 11 months ago
JSON representation
Pipeline for layout extraction
- Host: GitHub
- URL: https://github.com/unstructured-io/pipeline-document-layout
- Owner: Unstructured-IO
- License: apache-2.0
- Created: 2022-11-23T21:16:53.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-07-03T05:42:04.000Z (almost 3 years ago)
- Last Synced: 2025-02-15T20:56:32.407Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 1.6 MB
- Stars: 1
- Watchers: 5
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
Pre-Processing Pipeline for Layout Detection
The description for the pipeline repository goes here.
The API is hosted at `https://api.unstructured.io`.
## Developer Quick Start
* Using `pyenv` to manage virtualenv's is recommended
* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.
* `brew install pyenv-virtualenv`
* `pyenv install 3.8.15`
* Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).
* Create a virtualenv to work in and activate it, e.g. for one named `document_layout`:
`pyenv virtualenv 3.8.15 document_layout`
`pyenv activate document_layout`
* Run `make install`
* Run `pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'`
* Start a local jupyter notebook server with `make run-jupyter`
**OR**
just start the fast-API locally with `make run-web-app`
#### Extracting whatever from some type of document
For example:
```
curl -X 'POST' \
'http://localhost:8000/document-layout/v1.0.0/layout' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/example.png' -F 'model_type=yolox'| jq -C . | less -R
```
Where `files` includes the file to process, `model_type` can be 'default' (or blank) or 'yolox',
also is possible to use `force_ocr` to auto in order to try text extraction from your file, or
'true', in which case OCR will be used.
### Generating Python files from the pipeline notebooks
You can generate the FastAPI APIs from your pipeline notebooks by running `make generate-api`.
## Security Policy
See our [security policy](https://github.com/Unstructured-IO/pipeline-document_layout/security/policy) for
information on how to report security vulnerabilities.
## Learn more
| Section | Description |
|-|-|
| [Unstructured Community Github](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects |
| [Unstructured Github](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |