https://github.com/pilarcode/pdf_lab

Having fun with pdf document processing libraries 🧐
https://github.com/pilarcode/pdf_lab

pdf-document pdf2csv pdf2txt

Last synced: 4 months ago
JSON representation

Having fun with pdf document processing libraries 🧐

README

# Pdf & images lab
A project to explore libraries to extract text from pdfs such as:
* pdfminer
* pyMuPDF
* pyPDF2
* ptpdfium2

Besides, I explore others to extract text from images such as
* pytesseract
* easyocr
* transformers models from huggingface

Additionally, how to extract text from pdfs using LLMs is also explored
* Gemini

## Setup

**Step 1**. Navigate to the root directory of the repository and create a new conda environment for development:

```bash
uv venv .venv
```

**Step 2**. Activate the environment:

```bash
source .venv/Scripts/activate
```

**Step 3**. Install the dependencies:

```bash
uv pip install -e .
```

## Usage
Go to the notebook and select your environment to run the cells.