Projects in Awesome Lists tagged with text-extraction
A curated list of projects in awesome lists tagged with text-extraction .
https://github.com/kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
bun csharp document-intelligence elixir ffi golang java metadata-extraction node pdf-extraction pdfium php python rag ruby rust table-extraction tesseract text-extraction wasm
Last synced: 17 May 2026
https://github.com/run-llama/liteparse
A fast, helpful, and open-source document parser
document-ocr document-processing ocr ocr-recognition pdf pdf-parser text-extraction
Last synced: 30 May 2026
https://github.com/miso-belica/sumy
Module for automatic summarization of text documents and HTML pages.
html-extraction html-extractor html-page lsa nlp pagerank-algorithm python reduction summarization summarizer summary sumy text-extraction textteaser
Last synced: 14 Feb 2026
https://github.com/adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 24 Dec 2025
https://github.com/unidoc/unipdf
Golang PDF library for creating and processing PDF files (pure go)
golang pdf pdf-compression pdf-document-processor pdf-generation pdf-generator pdf-library pdf-manipulation pdf-reader pdf-reports pdf-sign signing text-extraction
Last synced: 12 May 2025
https://github.com/Goldziher/kreuzberg
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
async document-intelligence mcp metadata-extraction ocr pandoc pdf-extraction pdfium python rag table-extraction tesseract text-extraction
Last synced: 21 Oct 2025
https://github.com/goldziher/kreuzberg
A text extraction library supporting PDFs, images, office documents and more
asyncio docx ocr pdf text-extraction
Last synced: 14 May 2025
https://github.com/chrismattmann/tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
buffer covid-19 detection extraction memex mime nlp nlp-library nlp-machine-learning parse parser-interface python recognition text-extraction text-recognition tika-python tika-server tika-server-jar translation-interface usc
Last synced: 14 May 2025
https://github.com/whitelok/image-text-localization-recognition
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
awesome convolutional-neural-networks deep-learning deep-learning-algorithms machine-learning ocr scene-texts text-detection text-extraction text-recognition
Last synced: 20 Mar 2025
https://github.com/kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.
hocr html html-converter markdown markdown-converter rag text-extraction text-processing
Last synced: 28 May 2026
https://github.com/yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
data-extraction document-processing fast image-extraction llm markdown pdf pdf-editor pdf-generation pdf-library pdf-parser pdf-to-markdown pdf-to-text pyo3 python rag rust text-extraction
Last synced: 13 May 2026
https://github.com/miso-belica/justext
Heuristic based boilerplate removal tool
html-parser html-parsing python text-extraction
Last synced: 13 Apr 2025
https://github.com/miso-belica/jusText
Heuristic based boilerplate removal tool
html-parser html-parsing python text-extraction
Last synced: 14 Mar 2025
https://github.com/unidoc/unidoc
This repository has moved! https://github.com/unidoc/unipdf
golang pdf pdf-files pdf-invoice pdf-library text-extraction unidoc
Last synced: 01 Apr 2025
https://github.com/ICIJ/datashare
A self-hosted search engine for documents.
datashare docker elasticsearch extract investigative-journalism named-entity-recognition text-extraction web-gui
Last synced: 15 Apr 2025
https://github.com/icij/datashare
A self-hosted search engine for documents.
datashare docker elasticsearch extract investigative-journalism named-entity-recognition text-extraction web-gui
Last synced: 25 Feb 2026
https://github.com/ropensci/pdftools
Text Extraction, Rendering and Converting of PDF Documents
pdf-files pdf-format pdftools poppler poppler-library r r-package rstats text-extraction
Last synced: 27 Aug 2025
https://github.com/cdown/srt
A simple library and set of tools for parsing, modifying, and composing SRT files.
command-line command-line-tool library mit-license python srt subtitle subtitle-fixer subtitle-parser subtitles subtitles-parsing text-extraction tools
Last synced: 14 May 2025
https://github.com/iamarunbrahma/vision-parse
Parse PDFs into markdown using Vision LLMs
document-parser pdf-parser pdf-to-markdown text-extraction
Last synced: 13 Dec 2025
https://github.com/Shixzie/nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
go golang natural-language-processing nlp parse text text-extraction
Last synced: 14 Mar 2025
https://github.com/shixzie/nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
go golang natural-language-processing nlp parse text text-extraction
Last synced: 01 Apr 2025
https://github.com/flairnlp/fundus
A very simple news crawler with a funny name
cc-news commoncrawl corpus corpus-tools crawler datasets image-classification image-extraction news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping
Last synced: 08 Jan 2026
https://github.com/pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
extract-text language-model machine-learning ocr parsr pd3f pdf pdf-to-text pipeline python text-extraction
Last synced: 08 Apr 2026
https://github.com/flairNLP/fundus
A very simple news crawler with a funny name
cc-news commoncrawl corpus crawler news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping
Last synced: 04 Mar 2025
https://github.com/py-pdf/benchmarks
Benchmarking PDF libraries
benchmark data-extraction mupdf pdf poppler-utils pypdf2 text-extraction
Last synced: 28 Jul 2025
https://github.com/bookieio/breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
html-extraction html-extractor html-parsing python text-extraction text-mining
Last synced: 21 Oct 2025
https://github.com/weareprestatech/hotpdf
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
pdf python text-extraction text-search
Last synced: 03 Mar 2026
https://github.com/sapienzanlp/extend
Entity Disambiguation as text extraction (ACL 2022)
acl acl2022 entity-disambiguation entity-disambiguation-models entity-linking natural-language-processing nlp pytorch text-extraction
Last synced: 03 Aug 2025
https://github.com/skylander86/lambda-text-extractor
AWS Lambda functions to extract text from various binary formats.
aws-lambda lambda-functions ocr pdf pdf-ocr-extraction searchable-pdfs tesseract text-extraction
Last synced: 16 Jan 2026
https://github.com/bzsanti/oxidizePdf
a PDF library for rust
crates-io data-extraction digital-signatures document-processing encryption invoice ocr pdf pdf-generation pdf-library pdf-manipulation pdf-parser pdf-reader pdfa rust rust-library table-extraction text-extraction
Last synced: 29 Apr 2026
https://github.com/vsymbol/CUTIE
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
computer-vision deep-learning text-extraction
Last synced: 02 Apr 2025
https://github.com/archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives
Last synced: 13 Apr 2025
https://github.com/vaites/php-apache-tika
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
apache ocr php-library text-extraction text-recognition tika
Last synced: 28 Jan 2026
https://github.com/victorqribeiro/ocr
Simple app to extract text from pictures using Tesseract
image-recognition ocr tesseract text-extraction text-recognition
Last synced: 07 Dec 2025
https://github.com/lu4p/cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
cat cross-platform docx2txt extract-text go golang odt2txt pdf2txt pdftotext rtf-to-text text-extraction textextracting
Last synced: 25 Jun 2025
https://github.com/gamemaker1/office-text-extractor
Yet another library to extract text from MS Office and PDF files
docx get-text ms-excel ms-office ms-powerpoint ms-word parser pdf pptx text-extraction xlsx
Last synced: 16 Mar 2026
https://github.com/iscc/mobi
python based software to unpack kindlegen generated ebooks
Last synced: 17 Feb 2026
https://github.com/jonathanraiman/wikipedia_ner
:book: Labeled examples from wiki dumps in Python
dataset named-entity-recognition python text-extraction wikipedia
Last synced: 11 Jul 2025
https://github.com/iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
document-conversion document-processing information-retrieval pdf-converter pdf-extraction pdf-parsing pdf-to-markdown python rag retrieval-augmented-generation text-extraction
Last synced: 10 Apr 2025
https://github.com/ckorzen/pdf-text-extraction-benchmark
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
arxiv benchmark evaluation extraction pdf tex text-extraction
Last synced: 11 May 2025
https://github.com/abhinaba-ghosh/any-text
Get text content from any file
file-reader reader text text-extraction text-extractor
Last synced: 22 Jul 2025
https://github.com/fourdigits/wagtail_textract
Text extraction for Wagtail document search
django search tesseract text-extraction textract wagtail
Last synced: 06 Oct 2025
https://github.com/pt-perkasa-pilar-utama/ppu-paddle-ocr
A lightweight, PaddleOCR implementation in Bun/Deno/Node.js for text detection and recognition in JavaScript environments.
bun computer-vision deno image-processing image-text-extraction javascript-ocr node-js nodejs ocr onnx onnxruntime optical-character-recognition paddleocr paddlepaddle text-detection text-extraction text-recognition typescript-ocr
Last synced: 24 May 2026
https://github.com/pd3f/pd3f-core
📑 Python Package to reconstruct the original continuous text from PDFs with language models
dehyphenation language-model machine-learning pd3f pdf text-extraction
Last synced: 08 Apr 2026
https://github.com/goldziher/html-to-markdown
HTML to markdown converter
html-converter markdown-converter rag text-extraction text-processing
Last synced: 24 Apr 2025
https://github.com/hscspring/pnlp
NLP预/后处理工具。
chinese-nlp concurrency nlp nlp-enhancer nlp-preprocess normalization preprocessing text-cleaning text-extraction text-length text-processing
Last synced: 27 Sep 2025
https://github.com/amenezes/aiopytesseract
A Python asyncio wrapper for Tesseract-OCR.
asyncio ocr optical-character-recognition pdftotext pytesseract pytesseract-ocr tesseract tesseract-ocr text-extraction
Last synced: 04 Oct 2025
https://github.com/spences10/mcp-jinaai-reader
🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader
content-extraction documentation-tool jinaai llm-tools mcp model-context-protocol text-extraction web-content web-scraping
Last synced: 15 Apr 2025
https://github.com/altaidevorg/llm-food
Serving files for hungry LLMs
batch-prediction gemini llm pdf-to-markdown text-extraction
Last synced: 04 Apr 2026
https://github.com/arshad-yaseen/ocr-llm
⚡️ Fast, ultra-accurate text extraction from any image or PDF—including challenging ones—with structured markdown output powered by vision models.
Last synced: 05 May 2025
https://github.com/ingmarboeschen/jatsdecoder
A text extraction and manipulation toolset for NISO-JATS coded XML files
cermine niso-jats pubmedcentral r text-extraction text-mining xml-files
Last synced: 12 Apr 2025
https://github.com/greed2411/tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
apache-tika clojure document-processing extension extract-text filetype mime-types ring text-extraction text-parser text-parsing
Last synced: 07 May 2025
https://github.com/Altabeh/tesseract-ocr-wrapper
This is a highly efficient python wrapper for tesseract-ocr.
leptonica multiprocessing tesseract-ocr text-extraction xpdf
Last synced: 09 Jul 2025
https://github.com/ad-freiburg/pdftotext-plus-plus
A fast and accurate command line tool for extracting text from PDF files.
c-plus-plus cli document-analysis metadata-extraction pdf text-extraction
Last synced: 16 May 2025
https://github.com/dotfurther/OpenDiscoverSDK
.NET 6 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.
archive csharp dotnet email embedded-objects entity-extraction extraction file-deduplication file-format-detection file-identification indexing metadata microsoft-office phi pii pii-detection pst sdk text text-extraction
Last synced: 12 Apr 2025
https://github.com/ssciwr/ammico
AI-based Media and Misinformation Content Analysis Tool: Analyze text and images
classification computer-vision nlp text-extraction translation
Last synced: 06 Mar 2026
https://github.com/shelfio/apache-tika-lambda-layer
AWS Lambda layer containing latest version of Apache Tika
apache-tika aws-lambda lambda-layer text-extraction
Last synced: 10 Jun 2025
https://github.com/coregx/gxpdf
GxPDF - Enterprise-grade PDF library for Go. Table extraction, text parsing, encryption, document creation.
go golang open-source pdf pdf-encryption pdf-generation pdf-library pdf-parser table-extraction text-extraction
Last synced: 02 Apr 2026
https://github.com/mrgrd56/textractor-translator
Translate visual novels in real time
anime electron games javascript text-extraction textractor textractor-extension translation translator typescript visual-novel
Last synced: 04 May 2025
https://github.com/bmoscon/articleparse
Heuristic text extraction from news sites in Python3
analysis boilerplate-removal heuristics python text-analysis text-extraction
Last synced: 07 May 2025
https://github.com/dotfurther/OpenDiscoverPlatformCaseStudy
Case study using dotfurther's Open Discover Platform with the RavenDB document store to rapidly create a full-text search/eDiscovery/information governance capable demonstration application.
archive-extractor data-breach document-ingestion ediscovery file-deduplication file-format-detection file-identification full-text full-text-extraction full-text-search indexing-engine information-governance information-governance-catalog metadata personally-identifiable-information pii pii-detection ravendb text-extraction
Last synced: 12 Apr 2025
https://github.com/funinkina/gnome-ocr-screenshot
A simple python script to extarct text from screenshot in GNOME desktop environment using pytesseract.
gnome gnome-shell linux ocr screenshot text-extraction tools utility
Last synced: 30 Jul 2025
https://github.com/typo3-solr/ext-tika
A TYPO3 CMS extension that provides Apache Tika functionality
cms cms-extension file-indexing language-detection metadata php search text-extraction tika typo3 typo3-cms-extension
Last synced: 04 Apr 2025
https://github.com/andrealenzi11/py-poppleract
Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents
ocr optical-character-recognition pdf-reader pdf-splitting pdf-to-text pdf2text pdftotext poppler poppleract py-poppleract tesseract tesseract-ocr text-extraction
Last synced: 26 Mar 2025
https://github.com/heussd/pdftotext-go
Extract texts + their page numbers from PDF
Last synced: 14 Jan 2026
https://github.com/globality-corp/deboiler
Deboiler - Boilerplate Identification and Removal
boilerplate-identification boilerplate-removal deboiler python text-extraction
Last synced: 29 Jan 2026
https://github.com/pt-perkasa-pilar-utama/ppu-pdf
Pdf utilities for text extraction in digital and convert scanned pdf into canvas.
jsr npm pdf-canvas pdf-digital pdf-reader pdfjs rag scanned-pdf text-extraction
Last synced: 01 Aug 2025
https://github.com/manofstrong/sitescrapper
A PHP library to Scrape Websites from their Sitemaps and Extract Relevant Content from the Webpage and Upload to a Database
keywords-extraction php scraper scraping-websites sitemap-xml text-extraction
Last synced: 13 Jan 2026
https://github.com/fisseha-estifanos/llm-api
A repository to demonstrate some of the concepts behind large language models, transformer (foundation) models, in-context learning, and prompt engineering using open source large language models like Bloom and co:here.
api bloom cohere in-context-learning llm news-score prompt-engineering text-extraction transformer
Last synced: 08 Mar 2026
https://github.com/rithulkamesh/docproc
Document Intelligence Platform — Extract, refine, and query documents with vision LLMs and config-driven RAG.
content-extraction data-extraction document-analysis document-parsing equation-detection layout-analysis machine-learning mathematical-symbols ocr pdf-processing pdf-text-extraction python region-detection text-classification text-extraction
Last synced: 02 Apr 2026
https://github.com/apyhub/apyhub.js
ApyHub SDK for Node.js is a library for accessing the ApyHub APIs.
api document-generation file-conversion image-generation image-processing nodejs text-extraction
Last synced: 26 Jan 2026
https://github.com/andythefactory/article-extraction-dataset
Article title, authors, date and body extraction dataset.
article-extractor corpus corpus-builder corpus-tools dataset datasets html-to-markdown html2text news news-aggregator news-crawler readability scraping scraping-websites text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 27 Jan 2026
https://github.com/saidsef/tika-document-to-text
Apache Tika extract text and metadata from any document format with this pre-built containerised solution Kubernetes-ready deployment with intuitive UI, API, and text-to-speech capabilities - perfect for content indexing, analysis, and document processing workflows
docker-container document-to-text document-to-text-ui extract-text helm-chart kubernetes kubernetes-deployment nodejs python text-extraction text-to-speech
Last synced: 02 Apr 2026
https://github.com/lihanghang/tecroom
技术栈在线总结文档,包含编程语言、数据结构与算法、机器学习、数据库等。
artificial-intelligence coding data-mining data-structures deep-learning design-patterns docker java machine-learning nlp python3 summary text-extraction text-mining tools
Last synced: 09 Apr 2025
https://github.com/rajdeep2804/automated_invoice_processing
The number of types of physical documents being digitized is on the increase. Medical bills, bank documents and personal documents are examples of such documents. Objective of this repo is to implement and understand such use cases with an example of extracting text information from invoice receipts.
automation computer-vision detectron2 digitalization image-processing image-segmentation maskrcnn ocr opencv-python polygon python3 pytorch tesseract-ocr text-extraction
Last synced: 05 Oct 2025
https://github.com/kind-unes/flutter-translation-application
Flutter Android & iOS Translation Education Application. It utilizes ObjectBox as a local database and Google API for translations, and is powered by GEMINI-ULTRA for AI capabilities
ai android chatbot computer-vision dart educational-application flutter gemini-api google-translate-api hive ios langugage-recognition nosql open-source phrasebook source-code sqlite text-extraction translation
Last synced: 09 Apr 2025
https://github.com/rosette-api-community/text-embeddings-sample
A little python code to show how to get similarity between word embeddings returned from the Rosette API's new /text-embedding endpoint.
machine-learning natural-language-processing nlp python text-embedding text-extraction text-similarity word-similarity
Last synced: 25 May 2026
https://github.com/gatenlp/wpextract
Create datasets from WordPress sites for research or archiving
corpus crawler nlp text-extraction text-mining web-scraping wordpress
Last synced: 25 Jun 2025
https://github.com/gursv/url-summ
A URL summarizer, which summarizes the content of a URL with proper formatting. It uses 'sshleifer/distilbart-cnn-12-6', which is a distilled version of the BART model, specifically optimized for text summarization tasks, including CNN summarization.
ai beautifulsoup chunking formatted-text huggingface-models python3 smtp star-rating streamlit text-extraction text-summarization transformers url-summarization
Last synced: 23 Apr 2025
https://github.com/atahanuz/yt2text
Extract text from a YouTube video in a single command, using OpenAi's Whisper speech recognition model.
artificial-intelligence python text-extraction transcription whisper whisper-ai youtube
Last synced: 13 May 2025
https://github.com/utachicodes/pyshotter
A python library for smart, annotated, and shareable screenshots.
annotation cross-platform ocr python screenshot sharing smart-detection text-extraction
Last synced: 05 Feb 2026
https://github.com/lykmapipo/us-inaugural-addresses
Python scripts to download, process, and analyze US Inaugural Addresses
beautifulsoup4 gensim joblib lykmapipo natural-language-processing nlp nltk python python-scripts requests spacy text-analysis text-analytics text-extraction text-processing web-scraping
Last synced: 16 May 2026
https://github.com/zeeshanahmad4/nlp-pdf-minning-extracting-text-from-pdf
NLP Pdf Minning Extracting text from pdf
extract-text pdf pdf-converter pdf-document-processor pdf-files pdf-format pdf-text-extraction pdfcon pdfkit pdftohtml pdftoimage pdftools pdftotext python text-extraction
Last synced: 01 Apr 2025
https://github.com/dataiku/dss-plugin-tesseract-ocr
Dataiku DSS plugin to perform optical character recognition (OCR) using the Tesseract engine.
dataiku dss-plugin ocr optical-character-recognition tesseract tesseract-ocr text-extraction
Last synced: 04 Apr 2026
https://github.com/rushi-balapure/pdf_2_json_extractor
A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_to_json preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.
cli-tool cpu-only cross-platform data-extraction document-parsing document-processing json layout-analysis nlp offline pdf pdf-extraction pdf-parser pdf-processing pdf-to-json python python-library structure-extraction text-extraction
Last synced: 21 Apr 2026
https://github.com/ceodaniyal/free-llm-image-to-text
Free OCR powered by LLMs using OpenRouter — extract text from images with no API costs. Works with image URLs and Base64 inputs using free vision-capable models.
ai-ocr api-integration computer-vision free-ai free-ocr image-processing image-to-text llm ocr openrouter python text-extraction vision-llm
Last synced: 04 May 2026
https://github.com/nishant2018/text-extraction-ocr-opencv
Text extraction is the process of automatically extracting text from images or documents. Optical Character Recognition (OCR) is a technology that enables computers to convert images of text into machine-readable text.
ocr opencv python text-extraction
Last synced: 04 May 2026
https://github.com/ceodaniyal/universal-llm-ocr
This repository contains a Python script to extract text from images using OpenAI's GPT-4 API. The script supports text extraction from both online image URLs and locally stored images (converted to base64). It ensures accurate and structured text extraction, making it a powerful tool for OCR-like tasks. The extracted text is saved to a file
api-integration base64 gpt-4 gpt-4o gpt-4o-mini image-ocr image-processing image-to-text ocr openai python text-analysis text-extraction
Last synced: 04 May 2026
https://github.com/nbdy/prntscrngrb
prnt.sc / lightshot crawler, nudity detection and text extraction to a sqlite database
crawler nudity-detection prntsc text-extraction
Last synced: 04 Oct 2025
https://github.com/h0neyp0t-466/pen2pdf
"📝 Pen2PDF – AI-powered web app to transform handwritten notes, slides, PDFs & images into editable Markdown ✏️ → export as polished PDFs 📄. Features drag & drop 📤, real-time editing ⚡, responsive UI 📱, and Google Gemini 🤖 integration. Perfect for students, creators & pros 🚀."
ai-app ai-text-extraction document-processing express file-converter google-gemini handwritten-notes javascript markdown-editor nodejs ocr pdf-converter pdf-to-markdown pdf-tools pen2pdf ppt-to-pdf react text-extraction vite web-app
Last synced: 15 Mar 2026
https://github.com/chchench/textract
Golang module for extracting text from XML-based MS Office documents
golang msoffice msoffice-parser msword text-extraction unarchive
Last synced: 13 Jan 2026
https://github.com/virajmadhu/pdf_key_matcher
Highlights the key matches between your Given PDF and the description text
ats cv open-source pdf pdf-text-extraction python python-script python3 terminal-based text-compression text-extraction virajmadhu
Last synced: 01 Feb 2026
https://github.com/anyparser/anyparserjs
Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction. Supports extraction from various file formats including PDF, Microsoft Office documents, OCR/Image to Text, Audio to Text, and Website to Text.
anyparser artificial-intelligence cache-augmented-generation crawler etl-pipeline graph-rag knowledgebase langchain microsoft-office microsoft-word ms-office n8n-nodes ocr pdf-extraction rag retrieval-augmented-generation text-extraction web-crawler
Last synced: 17 Feb 2026
https://github.com/mazzasaverio/url2md4ai
Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
html-to-markdown html-to-markdown-converter openai playwright text-extraction trafilatura
Last synced: 05 Jul 2025
https://github.com/fieldcure/fieldcure-document-parsers
Document text extraction library for DOCX, HWPX, XLSX, PPTX, and PDF. Supports OOXML math-to-LaTeX conversion, Hancom equation parsing, and IMediaDocumentParser for image extraction.
csharp document-parser docx dotnet equation-parser hwpx latex nuget pdf text-extraction
Last synced: 27 Apr 2026
https://github.com/cmhac/chat-extract
Experimental tool to extract data from screen recordings of text chats
chat-app osint text-extraction
Last synced: 02 Mar 2026
https://github.com/caesariodito/mp-assignment-automation
Mini Personal Project to Automate Assignments (Provide Insights Only)
automation chatgpt chatgpt-api homework-assignments image pdf python text-extraction
Last synced: 27 Jul 2025
https://github.com/importcjj/go-readability
Go package that cleans a HTML page for better readability.
extractor go golang html html-extractor html2text readability text text-extraction
Last synced: 14 Jan 2026
https://github.com/nationallibraryofnorway/maalfrid_toolkit
Toolkit for the Målfrid project
corpus crawling language-detection text-extraction
Last synced: 22 Jan 2026
https://github.com/lightbridge-ks/radreportparser-app
A Python web app for extract key sections from radiology reports text
radiology-report regex text-extraction
Last synced: 24 Apr 2026