Projects in Awesome Lists tagged with pdf-text-extraction
A curated list of projects in awesome lists tagged with pdf-text-extraction .
https://github.com/mazzasaverio/pipeline-docs-data-extractor
(Let's build a) Robust pipeline for extracting structured data from various documents
airflow data-engineer data-engineering etl-pipeline large-language-models pdf-text-extraction unstructured
Last synced: 13 Aug 2025
https://github.com/hyeonsangjeon/pdf2llm-tuning-studio
PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.
aws bedrock claude cuda data-argumantation data-extraction distillation docker finetuning gpu llm pdf-generation pdf-text-extraction processing processing-job sagemaker text-disti unsloth unstructured
Last synced: 15 Jun 2025
https://github.com/prathameshdhande22/pdftxtbot
A Telegram bot which extract Text from PDF, also extract the Images of PDF Pages. Made with Python
image-extractor pdf-image pdf-text pdf-text-extraction python python-telegram python-telegram-bot python3 telegram telegram-bot
Last synced: 22 Sep 2025
https://github.com/eli64s/pdflex
CLI for merging PDF contexts.
pdf-automation pdf-converter pdf-data-extraction pdf-document pdf-document-parser pdf-document-processor pdf-extractor pdf-generator pdf-library pdf-manipulation pdf-parser pdf-processor pdf-python pdf-regex pdf-search pdf-text-extraction pdf-tools python-pdf python-pdf-tools
Last synced: 07 Oct 2025
https://github.com/zeeshanahmad4/nlp-pdf-minning-extracting-text-from-pdf
NLP Pdf Minning Extracting text from pdf
extract-text pdf pdf-converter pdf-document-processor pdf-files pdf-format pdf-text-extraction pdfcon pdfkit pdftohtml pdftoimage pdftools pdftotext python text-extraction
Last synced: 01 Apr 2025
https://github.com/virajmadhu/pdf_key_matcher
Highlights the key matches between your Given PDF and the description text
ats cv open-source pdf pdf-text-extraction python python-script python3 terminal-based text-compression text-extraction virajmadhu
Last synced: 12 Apr 2025
https://github.com/nsourlos/ocr_and_rag
Tests of OCR and RAG with LLMs
cohere colpali document-processing gemini information-retrieval mistral ocr openai pdf-text-extraction qwen2-vl rag
Last synced: 19 Aug 2025
https://github.com/simonpierreboucher/crawler
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
concurrent-crawling content-extraction data-collection data-extraction-pipeline data-preservation-and-recovery data-scraping error-handling html-parsing http-requests metadata-storage modular-design pdf-text-extraction python-crawler rate-limiting structured-data-storage text-processing url-normalization web-crawling yaml-configuration
Last synced: 30 Mar 2025
https://github.com/rmottanet/unchainedtext
UnchainedText: Break free from PDFs! Easily extract raw text to .txt for preprocessing.
data-extraction extractor pdf-text-extraction text-extraction text-extraction-tool text-processing
Last synced: 27 Jun 2025