Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vikparuchuri/pdf_to_md


https://github.com/vikparuchuri/pdf_to_md

Last synced: about 10 hours ago
JSON representation

Awesome Lists containing this project

README

        

# Convert PDFs to markdown

- Extract text from pdf with pymupdf
- Remove headers/footers using clustering with DBScan algorithm
- Convert text to markdown with a finetuned LLM

Known issues: it will repeat text if the generation goes off the rails. I need to retrain the model using some lessons from the nougat paper.

## Installation

- `poetry install`

## Usage