Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/phongcao/pdf-azure-blob-extracter
This notebook reads a PDF file, extracts its images and content and upload them to Azure Blob storage.
https://github.com/phongcao/pdf-azure-blob-extracter
azure blob openai parser pdf
Last synced: about 1 month ago
JSON representation
This notebook reads a PDF file, extracts its images and content and upload them to Azure Blob storage.
- Host: GitHub
- URL: https://github.com/phongcao/pdf-azure-blob-extracter
- Owner: phongcao
- License: agpl-3.0
- Created: 2023-07-19T15:46:33.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-07-27T19:31:57.000Z (over 1 year ago)
- Last Synced: 2024-10-27T12:32:08.427Z (3 months ago)
- Topics: azure, blob, openai, parser, pdf
- Language: Jupyter Notebook
- Homepage:
- Size: 20.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf-azure-blob-extracter
This notebook reads a PDF file, extracts its images and content and upload them to
Azure Blob storage.After running, you can find the following blobs on Azure Blob storage:
```
[Container]/[PDF file name without ext]/content.md
[Container]/[PDF file name without ext]/img_1.png
[Container]/[PDF file name without ext]/img_2.png
[Container]/[PDF file name without ext]/img_3.png
...
```In the content.md file, the embedded images are extracted and replaced by file names:
```
Text 1![img_1.png](img_1.png)
Text 2
![img_2.png](img_2.png)
Text 3
![img_3.png](img_3.png)
```It's easy to manipulate those image links to inject
[SAS tokens](https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview)
so that they can be fully rendered.## How to run
1. Copy the `.env.template` file to `.env` and fill in the required info.
2. Upgrade pip: `pip install --upgrade pip`.
3. Install dependencies: `pip install -r requirements.txt`.
4. Modify the input file name, default is `userguide.pdf`.
5. Modify the output file name, default is `userguide.md`.
6. Modify footer and header font size under `Special setttings` if needed.
7. Run the notebook and check the blob storage for the extracted images and content.