{"id":18369599,"url":"https://github.com/unstructured-io/pipeline-paddleocr","last_synced_at":"2025-08-14T05:04:52.139Z","repository":{"id":66398482,"uuid":"575663850","full_name":"Unstructured-IO/pipeline-paddleocr","owner":"Unstructured-IO","description":"Pipeline for converting PDFs to raw text with PaddleOCR","archived":false,"fork":false,"pushed_at":"2023-08-21T01:45:59.000Z","size":6871,"stargazers_count":23,"open_issues_count":6,"forks_count":7,"subscribers_count":21,"default_branch":"main","last_synced_at":"2025-08-14T05:04:15.918Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Unstructured-IO.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-08T02:44:37.000Z","updated_at":"2025-04-22T13:31:27.000Z","dependencies_parsed_at":"2024-11-05T23:32:11.556Z","dependency_job_id":"7bf95676-52e0-42cf-966a-7a8b47f37014","html_url":"https://github.com/Unstructured-IO/pipeline-paddleocr","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Unstructured-IO/pipeline-paddleocr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Fpipeline-paddleocr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Fpipeline-paddleocr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Fpipeline-paddleocr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Fpipeline-paddleocr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Unstructured-IO","download_url":"https://codeload.github.com/Unstructured-IO/pipeline-paddleocr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Unstructured-IO%2Fpipeline-paddleocr/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270364971,"owners_count":24571423,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-14T02:00:10.309Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T23:29:54.984Z","updated_at":"2025-08-14T05:04:52.100Z","avatar_url":"https://github.com/Unstructured-IO.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch3 align=\"center\"\u003e\n  \u003cimg src=\"img/unstructured_logo.png\" height=\"200\"\u003e\n\u003c/h3\u003e\n\n\u003ch3 align=\"center\"\u003e\n  \u003cp\u003ePre-Processing OCR Pipeline for PaddleOCR\u003c/p\u003e\n\u003c/h3\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n  \u003ca href=\"https://github.com/Unstructured-IO/pipeline-paddleocr/blob/main/LICENSE.md\"\u003e![https://pypi.python.org/pypi/unstructured/](https://img.shields.io/pypi/l/unstructured.svg)\u003c/a\u003e\n  \u003ca href=\"https://pypi.python.org/pypi/unstructured/\"\u003e![https://pypi.python.org/pypi/unstructured/](https://img.shields.io/pypi/pyversions/unstructured.svg)\u003c/a\u003e\n  \u003ca href=\"https://GitHub.com/unstructured-io/pipeline-paddleocr/graphs/contributors\"\u003e![https://GitHub.com/unstructured-io/unstructured.js/graphs/contributors](https://img.shields.io/github/contributors/unstructured-io/unstructured)\u003c/a\u003e\n  \u003ca href=\"https://github.com/Unstructured-IO/pipeline-paddleocr/blob/main/CODE_OF_CONDUCT.md\"\u003e![code_of_conduct.md](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg) \u003c/a\u003e\n  \u003ca href=\"https://pypi.python.org/pypi/unstructured/\"\u003e![https://github.com/Naereen/badges/](https://badgen.net/badge/Open%20Source%20%3F/Yes%21/blue?icon=github)\u003c/a\u003e\n\n\u003c/div\u003e\n\n\nThis pipeline processes input image documents in the English language using [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR).\nThe pipeline works on `x86_64` cpus.\n\n## Developer Quick Start\n\n* Using `pyenv` to manage virtualenvs is recommended\n\t* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.\n\t\t* `brew install pyenv-virtualenv`\n\t  * `pyenv install 3.8.15`\n  * Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).\n\n  * Create a virtualenv to work in and activate it, e.g. for one named `paddleocr`:\n\n\t`pyenv  virtualenv 3.8.15 paddleocr` \u003cbr /\u003e\n\t`pyenv activate paddleocr`\n\n* If you are on a Mac with an M1 chip, run `brew install mupdf swig freetype` to install\n  required non-Python dependencies.\n* Run `make install`\n* Start a local jupyter notebook server with `make run-jupyter` \u003cbr /\u003e\n\t**OR** \u003cbr /\u003e\n\tjust start the fast-API locally with `make run-web-app`\n\n### Performing OCR on a JPG image\n\nTo run OCR on a JPG image, run `make run-web-app` and run the following `curl` command,\nreplacing `sample-docs/sample-receipt.jpg` with your filename:\n\n```\ncurl -X 'POST' \\\n  'http://localhost:8000/paddleocr/v0.0.1/paddleocr' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: multipart/form-data' \\\n  -F 'files=@sample-docs/sample-receipt.jpg'  | jq -C . | less -R\n```\n\nThe result should look like the following.\n\n```\n\"{\\\"result\\\": [[[[162.0, 111.0], [429.0, 110.0], [429.0, 138.0], [162.0, 139.0]], [\\\"PETRON BKT\nLANJAN SB\\\", 0.918]], [[[162.0, 142.0], [418.0, 141.0], [418.0, 170.0], [162.0, 171.0]], [\\\"ALSERKAM\nENTERPRISE\\\", 0.9785]], [[[44.0, 178.0], [562.0, 175.0], [562.0, 199.0], [44.0, 202.0]], [\\\"Te1\n03-6156 8757 Co No 001083069-M\\\", 0.9282]], [[[121.0, 209.0], [467.0, 209.0], [467.0, 232.0],\n[121.0, 232.0]], [\\\"KM 458.4 BKT LANJAN UTARA,\\\", 0.9205]], [[[95.0, 239.0], [484.0, 237.0], [484.0,\n264.0], [95.0, 267.0]], [\\\"L/RAYA UTARA SELATAN,SG BULOH\\\", 0.9525]], [[[188.0, 270.0], [403.0,\n270.0], [403.0, 298.0], [188.0, 298.0]], [\\\"47000 SUNGAI BUL\\\", 0.9704]], [[[139.0, 335.0], [443.0,\n335.0], [443.0, 359.0], [139.0, 359.0]], [\\\"GST ID No001210736640\\\", 0.9619]], [[[217.0, 397.0],\n[366.0, 397.0], [366.0, 424.0], [217.0, 424.0]], [\\\"TAX INVOICE\\\", 0.9886]], [[[29.0, 491.0],\n[351.0, 490.0], [351.0, 518.0], [29.0, 519.0]], [\\\"TAX INVOICE NO 19729058\\\", 0.963]], [[[28.0,\n523.0], [129.0, 523.0], [129.0, 552.0], [28.0, 552.0]], [\\\"POS1\\\", 0.9617]], [[[29.0, 554.0],\n[272.0, 552.0], [272.0, 582.0], [29.0, 583.0]], [\\\"Store No.:129077\\\", 0.9439]], [[[492.0, 552.0],\n[553.0, 552.0], [553.0, 584.0], [492.0, 584.0]], [\\\"Babu\\\", 0.9968]], [[[28.0, 586.0], [169.0,\n589.0], [169.0, 618.0], [27.0, 615.0]], [\\\"01/02/2018\\\", 0.9972]], [[[162.0, 587.0], [340.0, 587.0],\n[340.0, 615.0], [162.0, 615.0]], [\\\"4:43:17PM\\\", 0.8981]], [[[28.0, 683.0], [311.0, 683.0], [311.0,\n711.0], [28.0, 711.0]], [\\\"A 2 doublemint te\\\", 0.9652]], [[[506.0, 679.0], [566.0, 679.0], [566.0,\n710.0], [506.0, 710.0]], [\\\"3.00\\\", 0.9931]], [[[25.0, 714.0], [313.0, 712.0], [314.0, 742.0],\n[25.0, 743.0]], [\\\"A1sandwich vanill\\\", 0.9318]], [[[507.0, 711.0], [566.0, 711.0], [566.0, 743.0],\n[507.0, 743.0]], [\\\"1.90\\\", 0.9937]], [[[69.0, 778.0], [165.0, 778.0], [165.0, 807.0], [69.0,\n807.0]], [\\\"GST RM\\\", 0.9119]], [[[505.0, 775.0], [566.0, 775.0], [566.0, 807.0], [505.0, 807.0]],\n[\\\"0.28\\\", 0.9929]], [[[70.0, 811.0], [296.0, 811.0], [296.0, 839.0], [70.0, 839.0]], [\\\"Total RM\ninc.GST:\\\", 0.9176]], [[[506.0, 807.0], [566.0, 807.0], [566.0, 839.0], [506.0, 839.0]], [\\\"4.90\\\",\n0.9949]], [[[67.0, 873.0], [128.0, 873.0], [128.0, 905.0], [67.0, 905.0]], [\\\"Cash\\\", 0.9938]],\n[[[505.0, 868.0], [568.0, 868.0], [568.0, 905.0], [505.0, 905.0]], [\\\"5.00\\\", 0.992]], [[[67.0,\n904.0], [154.0, 908.0], [153.0, 938.0], [66.0, 935.0]], [\\\"Change\\\", 0.9971]], [[[506.0, 903.0],\n[566.0, 903.0], [566.0, 935.0], [506.0, 935.0]], [\\\"0.10\\\", 0.9981]], [[[29.0, 968.0], [179.0,\n973.0], [178.0, 1002.0], [29.0, 998.0]], [\\\"GsT Summary\\\", 0.8839]], [[[242.0, 969.0], [387.0,\n966.0], [388.0, 996.0], [242.0, 999.0]], [\\\"AnountRM\\\", 0.895]], [[[454.0, 969.0], [562.0, 969.0],\n[562.0, 998.0], [454.0, 998.0]], [\\\"Tax (RM)\\\", 0.8915]], [[[29.0, 1002.0], [128.0, 1002.0], [128.0,\n1033.0], [29.0, 1033.0]], [\\\"A=6.00%\\\", 0.9756]], [[[241.0, 1001.0], [301.0, 1001.0], [301.0,\n1033.0], [241.0, 1033.0]], [\\\"4.62\\\", 0.9949]], [[[452.0, 999.0], [513.0, 999.0], [513.0, 1031.0],\n[452.0, 1031.0]], [\\\"0.28\\\", 0.9955]], [[[29.0, 1070.0], [47.0, 1070.0], [47.0, 1092.0], [29.0,\n1092.0]], [\\\"A\\\", 0.9864]], [[[106.0, 1066.0], [418.0, 1066.0], [418.0, 1094.0], [106.0, 1094.0]],\n[\\\"ITAL INCLUDES 6.00%GST\\\", 0.9485]], [[[151.0, 1166.0], [429.0, 1166.0], [429.0, 1190.0], [151.0,\n1190.0]], [\\\"Use 3000 Petron Miles\\\", 0.9395]], [[[176.0, 1197.0], [403.0, 1194.0], [403.0, 1223.0],\n[176.0, 1226.0]], [\\\"points to pay for\\\", 0.9474]], [[[228.0, 1227.0], [351.0, 1227.0], [351.0,\n1257.0], [228.0, 1257.0]], [\\\"RM45 Fue1\\\", 0.932]]]}\n```\n\nYou can also run OCR through the Python API using the following commands:\n\n```python\nfrom prepline_paddleocr.api.paddleocr import pipeline_api\n\nfilename = \"sample-docs/sample-receipt.jpg\"\n\nwith open(filename, \"rb\") as f:\n    pipeline_api(file=f)\n```\n\n\n### Generating Python files from the pipeline notebooks\n\nYou can generate the FastAPI APIs from your pipeline notebooks by running `make generate-api`.\n\n## Security Policy\n\nSee our [security policy](https://github.com/Unstructured-IO/pipeline-paddleocr/security/policy) for\ninformation on how to report security vulnerabilities.\n\n## Learn more\n\n| Section | Description |\n|-|-|\n| [Unstructured Community Github](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects  |\n| [Unstructured Github](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |\n| [Company Website](https://unstructured.io) | Unstructured.io product and company info |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funstructured-io%2Fpipeline-paddleocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funstructured-io%2Fpipeline-paddleocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funstructured-io%2Fpipeline-paddleocr/lists"}