{"id":13394215,"url":"https://github.com/axa-group/Parsr","last_synced_at":"2025-03-13T20:31:24.104Z","repository":{"id":35086418,"uuid":"200653543","full_name":"axa-group/Parsr","owner":"axa-group","description":"Transforms PDF, Documents and Images into Enriched Structured Data","archived":false,"fork":false,"pushed_at":"2023-12-03T13:27:21.000Z","size":55148,"stargazers_count":5804,"open_issues_count":72,"forks_count":310,"subscribers_count":81,"default_branch":"master","last_synced_at":"2024-10-29T15:10:20.676Z","etag":null,"topics":["data","document","extraction","hacktoberfest","images","nlp","ocr","parsr","pdf","python","typescript"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/axa-group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-05T12:43:53.000Z","updated_at":"2024-10-28T09:26:17.000Z","dependencies_parsed_at":"2024-05-28T13:51:11.138Z","dependency_job_id":null,"html_url":"https://github.com/axa-group/Parsr","commit_stats":{"total_commits":1403,"total_committers":38,"mean_commits":"36.921052631578945","dds":0.6985032074126871,"last_synced_commit":"b1fc36fc91531704235438e844bbf6315e889f86"},"previous_names":[],"tags_count":26,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/axa-group%2FParsr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/axa-group%2FParsr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/axa-group%2FParsr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/axa-group%2FParsr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/axa-group","download_url":"https://codeload.github.com/axa-group/Parsr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243179900,"owners_count":20249186,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","document","extraction","hacktoberfest","images","nlp","ocr","parsr","pdf","python","typescript"],"created_at":"2024-07-30T17:01:12.751Z","updated_at":"2025-03-13T20:31:24.092Z","avatar_url":"https://github.com/axa-group.png","language":"JavaScript","readme":"\u003cp align='center'\u003e\n  \u003cimg src=\"assets/logo.png\" width=\"275\"\u003e\u003cbr /\u003e\n\u003c/p\u003e\n\n\u003ch2 align=\"center\"\u003e\u003ci\u003eTurn your documents into data!\u003c/i\u003e\u003c/h2\u003e\n\n\u003cp align=\"center\"\u003e\n\t\u003ca href=\"https://cloud.drone.io/axa-group/Parsr\"\u003e\u003cimg src=\"https://cloud.drone.io/api/badges/axa-group/Parsr/status.svg\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\t\u003ca href=\"README_fr.md\"\u003eFrançais\u003c/a\u003e |\n  \u003ca href=\"README_pt.md\"\u003ePortuguese\u003c/a\u003e |\n  \u003ca href=\"README_sp.md\"\u003eSpanish\u003c/a\u003e |\n\t\u003ca href=\"README_zh-cn.md\"\u003e中文\u003c/a\u003e\n\u003c/p\u003e\n\n\u003c!--p align='center'\u003e\n  \u003cimg src=\"assets/demo_screen.gif\"\u003e\n\u003c/p--\u003e\n\n- **Parsr**, is a minimal-footprint document (**image, pdf, docx, eml**) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in **JSON, Markdown (MD), CSV/Pandas DF** or **TXT** formats.\n\n- It provides analysts, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.\n\n- Currently, Parsr can perform: document cleaning, _hierarchy regeneration_ (words, lines, paragraphs), detection of _headings, tables, lists, table of contents, page numbers, headers/footers, links_, and others. Check out [all the features](server/src/processing/README.md#1-current-processing-modules).\n\n# Table of Contents\n\n- [Table of Contents](#table-of-contents)\n- [Getting Started](#getting-started)\n  - [Installation](#installation)\n  - [Usage](#usage)\n- [Documentation](#documentation)\n- [Contribute](#contribute)\n- [Third Party Licenses](#third-party-licenses)\n- [License](#license)\n\n# Getting Started\n\n## Installation\n\n_-- The advanced installation guide is available [here](docs/installation.md) --_\n\nThe quickest way to install and run the Parsr API is through the [docker image](https://hub.docker.com/r/axarev/parsr):\n\n```sh\ndocker pull axarev/parsr\n```\n\nIf you also wish to install the GUI for sending documents and visualising results:\n\n```sh\ndocker pull axarev/parsr-ui-localhost\n```\n\nNote: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the [installation guide](docs/installation.md).\n\n## Usage\n\n_-- The advanced usage guide is available [here](docs/usage.md) --_\n\nTo run the [API](docs/api-guide.md), issue:\n\n```sh\ndocker run -p 3001:3001 axarev/parsr\n```\n\nwhich will launch it on [http://localhost:3001](http://localhost:3001).  \nConsult the documentation on the [usage of the API](docs/api-guide.md).\n\n1. To access the **python** client to Parsr API, issue:\n\n   ```sh\n   pip install parsr-client\n   ```\n\n   To sample the **Jupyter Notebook**, using the python client, head over to the [jupyter demo](demo/parsr-jupyter-demo).\n\n2) To use the GUI tool (the API needs to already be running), issue:\n   ```sh\n   docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest\n   ```\n   Then, access it through [http://localhost:8080](http://localhost:8080).\n\nRefer to the [Configuration documentation](docs/configuration.md) to interpret the configurable options in the GUI viewer.\n\nThe [API based usage](docs/usage.md#3-api) and the [command line usage](docs/usage.md#23-command-line-usage) are documented in the [advanced usage](docs/usage.md) guide.\n\n# Documentation\n\nAll documentation files can be found [here](docs/README.md).\n\n# Contribute\n\nPlease refer to the [contribution guidelines](CONTRIBUTING.md).\n\n# Third Party Licenses\n\nThird Party Libraries licenses for its [dependencies](docs/dependencies.md):\n\n1. **QPDF**: Apache [http://qpdf.sourceforge.net](http://qpdf.sourceforge.net/)\n2. **ImageMagick**: Apache 2.0 [https://imagemagick.org/script/license.php](https://imagemagick.org/script/license.php)\n3. **Pdfminer.six**: MIT [https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE](https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE)\n4. **PDF.js**: Apache 2.0 [https://github.com/mozilla/pdf.js](https://github.com/mozilla/pdf.js)\n5. **Tesseract**: Apache 2.0 [https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)\n6. **Camelot**: MIT [https://github.com/camelot-dev/camelot](https://github.com/camelot-dev/camelot)\n7. **MuPDF** (Optional dependency): AGPL [https://mupdf.com/license.html](https://mupdf.com/license.html)\n8. **Pandoc** (Optional dependency): GPL [https://github.com/jgm/pandoc](https://github.com/jgm/pandoc)\n\n# License\n\nCopyright 2020 AXA Group Operations S.A.  \nLicensed under the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) license (see the [LICENSE](LICENSE) file).\n","funding_links":[],"categories":["JavaScript","Repository","data","Ferramentas"],"sub_categories":["OCR"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faxa-group%2FParsr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faxa-group%2FParsr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faxa-group%2FParsr/lists"}