https://github.com/ts-azure-services/batch-doc-pipeline
https://github.com/ts-azure-services/batch-doc-pipeline
azure-ai-document-intelligence azure-ml batch-pipeline form-recognizer pdf-processing
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/ts-azure-services/batch-doc-pipeline
- Owner: ts-azure-services
- Created: 2024-05-20T22:39:06.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-05T06:36:07.000Z (almost 2 years ago)
- Last Synced: 2025-01-28T23:29:44.915Z (over 1 year ago)
- Topics: azure-ai-document-intelligence, azure-ml, batch-pipeline, form-recognizer, pdf-processing
- Language: Python
- Homepage:
- Size: 52.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# batch-doc-pipeline
The purpose of this repo is to setup a batch pipeline to process PDFs into text files leveraging Azure ML's native pipeline
capabilities and Azure Form Recognizer (soon to be Azure AI Document Intelligence). This is a custom version of this [repo](https://github.com/ts-azure-services/document-extraction-pipeline), though this repo does not split PDFs by pages.
### Other considerations
- With PDF file names, ensure special characters like `+` don't cause issues while processing. This is not specifically handled in
the above operations.
- Given the size of the PDF files being processed, this can sometimes lead to out of memory issues. Either change the compute
configuration or have a way of filtering out larger items to process independently.
- As of the current update (May 2024), [azure-ai-form-recognizer](https://pypi.org/project/azure-ai-formrecognizer/) was version 3.1 and GA. Over time, however this will give way to
[azure-ai-documentintelligence](https://pypi.org/project/azure-ai-documentintelligence/) which is currently version 4.0 and in preview. This repo uses the former.
- In terms of RBAC, both the Azure ML workspace and the service principal have `Contributor` access to the storage account.
Additionally, the workspace has `Storage Blob Data Contributor` access to the storage account.
- Note about for Form Recognizer, you can [auto-scale](https://learn.microsoft.com/en-us/azure/ai-services/autoscale?tabs=portal) to avoid throttling issues.
- Critical to understand which SDK version maps to which API as listed [here](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-formrecognizer-readme?view=azure-python).