Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vikparuchuri/pdf_to_md
https://github.com/vikparuchuri/pdf_to_md
Last synced: about 10 hours ago
JSON representation
- Host: GitHub
- URL: https://github.com/vikparuchuri/pdf_to_md
- Owner: VikParuchuri
- Created: 2023-09-24T23:01:33.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2023-09-25T02:54:16.000Z (about 1 year ago)
- Last Synced: 2024-10-12T13:45:37.124Z (about 1 month ago)
- Language: Python
- Size: 796 KB
- Stars: 8
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Convert PDFs to markdown
- Extract text from pdf with pymupdf
- Remove headers/footers using clustering with DBScan algorithm
- Convert text to markdown with a finetuned LLMKnown issues: it will repeat text if the generation goes off the rails. I need to retrain the model using some lessons from the nougat paper.
## Installation
- `poetry install`
## Usage