Projects in Awesome Lists tagged with pdf-text-extraction

https://github.com/mazzasaverio/pipeline-docs-data-extractor

(Let's build a) Robust pipeline for extracting structured data from various documents

airflow data-engineer data-engineering etl-pipeline large-language-models pdf-text-extraction unstructured

Last synced: 13 Aug 2025

https://github.com/hyeonsangjeon/pdf2llm-tuning-studio

PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.

aws bedrock claude cuda data-argumantation data-extraction distillation docker finetuning gpu llm pdf-generation pdf-text-extraction processing processing-job sagemaker text-disti unsloth unstructured

Last synced: 15 Jun 2025

https://github.com/prathameshdhande22/pdftxtbot

A Telegram bot which extract Text from PDF, also extract the Images of PDF Pages. Made with Python

image-extractor pdf-image pdf-text pdf-text-extraction python python-telegram python-telegram-bot python3 telegram telegram-bot

Last synced: 22 Sep 2025

https://github.com/eli64s/pdflex

CLI for merging PDF contexts.

pdf-automation pdf-converter pdf-data-extraction pdf-document pdf-document-parser pdf-document-processor pdf-extractor pdf-generator pdf-library pdf-manipulation pdf-parser pdf-processor pdf-python pdf-regex pdf-search pdf-text-extraction pdf-tools python-pdf python-pdf-tools

Last synced: 07 Oct 2025

https://github.com/zeeshanahmad4/nlp-pdf-minning-extracting-text-from-pdf

NLP Pdf Minning Extracting text from pdf

extract-text pdf pdf-converter pdf-document-processor pdf-files pdf-format pdf-text-extraction pdfcon pdfkit pdftohtml pdftoimage pdftools pdftotext python text-extraction

Last synced: 01 Apr 2025

https://github.com/virajmadhu/pdf_key_matcher

Highlights the key matches between your Given PDF and the description text

ats cv open-source pdf pdf-text-extraction python python-script python3 terminal-based text-compression text-extraction virajmadhu

Last synced: 12 Apr 2025

https://github.com/nsourlos/ocr_and_rag

Tests of OCR and RAG with LLMs

cohere colpali document-processing gemini information-retrieval mistral ocr openai pdf-text-extraction qwen2-vl rag

Last synced: 19 Aug 2025

https://github.com/simonpierreboucher/crawler

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

concurrent-crawling content-extraction data-collection data-extraction-pipeline data-preservation-and-recovery data-scraping error-handling html-parsing http-requests metadata-storage modular-design pdf-text-extraction python-crawler rate-limiting structured-data-storage text-processing url-normalization web-crawling yaml-configuration