Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/phongcao/pdf-azure-blob-extracter

This notebook reads a PDF file, extracts its images and content and upload them to Azure Blob storage.
https://github.com/phongcao/pdf-azure-blob-extracter

azure blob openai parser pdf

Last synced: about 1 month ago
JSON representation

This notebook reads a PDF file, extracts its images and content and upload them to Azure Blob storage.

Host: GitHub
URL: https://github.com/phongcao/pdf-azure-blob-extracter
Owner: phongcao
License: agpl-3.0
Created: 2023-07-19T15:46:33.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-07-27T19:31:57.000Z (over 1 year ago)
Last Synced: 2024-10-27T12:32:08.427Z (3 months ago)
Topics: azure, blob, openai, parser, pdf
Language: Jupyter Notebook
Homepage:
Size: 20.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# pdf-azure-blob-extracter

This notebook reads a PDF file, extracts its images and content and upload them to
Azure Blob storage.

After running, you can find the following blobs on Azure Blob storage:

```
[Container]/[PDF file name without ext]/content.md
[Container]/[PDF file name without ext]/img_1.png
[Container]/[PDF file name without ext]/img_2.png
[Container]/[PDF file name without ext]/img_3.png
...
```

In the content.md file, the embedded images are extracted and replaced by file names:

```
Text 1

![img_1.png](img_1.png)

Text 2

![img_2.png](img_2.png)

Text 3

![img_3.png](img_3.png)
```

It's easy to manipulate those image links to inject
[SAS tokens](https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview)
so that they can be fully rendered.

## How to run

1. Copy the `.env.template` file to `.env` and fill in the required info.
2. Upgrade pip: `pip install --upgrade pip`.
3. Install dependencies: `pip install -r requirements.txt`.
4. Modify the input file name, default is `userguide.pdf`.
5. Modify the output file name, default is `userguide.md`.
6. Modify footer and header font size under `Special setttings` if needed.
7. Run the notebook and check the blob storage for the extracted images and content.