An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with pdf-text-extraction

A curated list of projects in awesome lists tagged with pdf-text-extraction .

https://github.com/mazzasaverio/pipeline-docs-data-extractor

(Let's build a) Robust pipeline for extracting structured data from various documents

airflow data-engineer data-engineering etl-pipeline large-language-models pdf-text-extraction unstructured

Last synced: 13 Aug 2025

https://github.com/hyeonsangjeon/pdf2llm-tuning-studio

PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.

aws bedrock claude cuda data-argumantation data-extraction distillation docker finetuning gpu llm pdf-generation pdf-text-extraction processing processing-job sagemaker text-disti unsloth unstructured

Last synced: 15 Jun 2025

https://github.com/prathameshdhande22/pdftxtbot

A Telegram bot which extract Text from PDF, also extract the Images of PDF Pages. Made with Python

image-extractor pdf-image pdf-text pdf-text-extraction python python-telegram python-telegram-bot python3 telegram telegram-bot

Last synced: 22 Sep 2025

https://github.com/virajmadhu/pdf_key_matcher

Highlights the key matches between your Given PDF and the description text

ats cv open-source pdf pdf-text-extraction python python-script python3 terminal-based text-compression text-extraction virajmadhu

Last synced: 12 Apr 2025

https://github.com/simonpierreboucher/crawler

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

concurrent-crawling content-extraction data-collection data-extraction-pipeline data-preservation-and-recovery data-scraping error-handling html-parsing http-requests metadata-storage modular-design pdf-text-extraction python-crawler rate-limiting structured-data-storage text-processing url-normalization web-crawling yaml-configuration

Last synced: 30 Mar 2025

https://github.com/rmottanet/unchainedtext

UnchainedText: Break free from PDFs! Easily extract raw text to .txt for preprocessing.

data-extraction extractor pdf-text-extraction text-extraction text-extraction-tool text-processing

Last synced: 27 Jun 2025