https://github.com/pilarcode/pdf_lab
Having fun with pdf document processing libraries 🧐
https://github.com/pilarcode/pdf_lab
pdf-document pdf2csv pdf2txt
Last synced: 3 days ago
JSON representation
Having fun with pdf document processing libraries 🧐
- Host: GitHub
- URL: https://github.com/pilarcode/pdf_lab
- Owner: pilarcode
- Created: 2024-12-21T12:14:56.000Z (10 months ago)
- Default Branch: master
- Last Pushed: 2025-03-22T15:44:59.000Z (7 months ago)
- Last Synced: 2025-05-30T01:38:45.713Z (4 months ago)
- Topics: pdf-document, pdf2csv, pdf2txt
- Language: Jupyter Notebook
- Homepage:
- Size: 5.85 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Pdf & images lab
A project to explore libraries to extract text from pdfs such as:
* pdfminer
* pyMuPDF
* pyPDF2
* ptpdfium2Besides, I explore others to extract text from images such as
* pytesseract
* easyocr
* transformers models from huggingfaceAdditionally, how to extract text from pdfs using LLMs is also explored
* Gemini## Setup
**Step 1**. Navigate to the root directory of the repository and create a new conda environment for development:
```bash
uv venv .venv
```**Step 2**. Activate the environment:
```bash
source .venv/Scripts/activate
```**Step 3**. Install the dependencies:
```bash
uv pip install -e .
```## Usage
Go to the notebook and select your environment to run the cells.