Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/axa-group/parsr
Transforms PDF, Documents and Images into Enriched Structured Data
https://github.com/axa-group/parsr
data document extraction hacktoberfest images nlp ocr parsr pdf python typescript
Last synced: 28 days ago
JSON representation
Transforms PDF, Documents and Images into Enriched Structured Data
- Host: GitHub
- URL: https://github.com/axa-group/parsr
- Owner: axa-group
- License: apache-2.0
- Created: 2019-08-05T12:43:53.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-12-03T13:27:21.000Z (12 months ago)
- Last Synced: 2024-04-14T14:57:45.801Z (7 months ago)
- Topics: data, document, extraction, hacktoberfest, images, nlp, ocr, parsr, pdf, python, typescript
- Language: JavaScript
- Homepage:
- Size: 52.6 MB
- Stars: 5,634
- Watchers: 82
- Forks: 300
- Open Issues: 70
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
Turn your documents into data!
Français |
Portuguese |
Spanish |
中文- **Parsr**, is a minimal-footprint document (**image, pdf, docx, eml**) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in **JSON, Markdown (MD), CSV/Pandas DF** or **TXT** formats.
- It provides analysts, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.
- Currently, Parsr can perform: document cleaning, _hierarchy regeneration_ (words, lines, paragraphs), detection of _headings, tables, lists, table of contents, page numbers, headers/footers, links_, and others. Check out [all the features](server/src/processing/README.md#1-current-processing-modules).
# Table of Contents
- [Table of Contents](#table-of-contents)
- [Getting Started](#getting-started)
- [Installation](#installation)
- [Usage](#usage)
- [Documentation](#documentation)
- [Contribute](#contribute)
- [Third Party Licenses](#third-party-licenses)
- [License](#license)# Getting Started
## Installation
_-- The advanced installation guide is available [here](docs/installation.md) --_
The quickest way to install and run the Parsr API is through the [docker image](https://hub.docker.com/r/axarev/parsr):
```sh
docker pull axarev/parsr
```If you also wish to install the GUI for sending documents and visualising results:
```sh
docker pull axarev/parsr-ui-localhost
```Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the [installation guide](docs/installation.md).
## Usage
_-- The advanced usage guide is available [here](docs/usage.md) --_
To run the [API](docs/api-guide.md), issue:
```sh
docker run -p 3001:3001 axarev/parsr
```which will launch it on [http://localhost:3001](http://localhost:3001).
Consult the documentation on the [usage of the API](docs/api-guide.md).1. To access the **python** client to Parsr API, issue:
```sh
pip install parsr-client
```To sample the **Jupyter Notebook**, using the python client, head over to the [jupyter demo](demo/parsr-jupyter-demo).
2) To use the GUI tool (the API needs to already be running), issue:
```sh
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
```
Then, access it through [http://localhost:8080](http://localhost:8080).Refer to the [Configuration documentation](docs/configuration.md) to interpret the configurable options in the GUI viewer.
The [API based usage](docs/usage.md#3-api) and the [command line usage](docs/usage.md#23-command-line-usage) are documented in the [advanced usage](docs/usage.md) guide.
# Documentation
All documentation files can be found [here](docs/README.md).
# Contribute
Please refer to the [contribution guidelines](CONTRIBUTING.md).
# Third Party Licenses
Third Party Libraries licenses for its [dependencies](docs/dependencies.md):
1. **QPDF**: Apache [http://qpdf.sourceforge.net](http://qpdf.sourceforge.net/)
2. **ImageMagick**: Apache 2.0 [https://imagemagick.org/script/license.php](https://imagemagick.org/script/license.php)
3. **Pdfminer.six**: MIT [https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE](https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE)
4. **PDF.js**: Apache 2.0 [https://github.com/mozilla/pdf.js](https://github.com/mozilla/pdf.js)
5. **Tesseract**: Apache 2.0 [https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
6. **Camelot**: MIT [https://github.com/camelot-dev/camelot](https://github.com/camelot-dev/camelot)
7. **MuPDF** (Optional dependency): AGPL [https://mupdf.com/license.html](https://mupdf.com/license.html)
8. **Pandoc** (Optional dependency): GPL [https://github.com/jgm/pandoc](https://github.com/jgm/pandoc)# License
Copyright 2020 AXA Group Operations S.A.
Licensed under the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) license (see the [LICENSE](LICENSE) file).